Numerical recipes in C

Numerical Recipes in C The Art of Scientific Computing Second Edition William H. Press Harvard-Smithsonian Center for A...

Author: Press W.H. | Teukolsky S.A. | Vetterling W.T. | Flannery B.P.

570 downloads 2627 Views 5MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Numerical Recipes in C The Art of Scientific Computing Second Edition

William H. Press Harvard-Smithsonian Center for Astrophysics

Saul A. Teukolsky Department of Physics, Cornell University

William T. Vetterling Polaroid Corporation

Brian P. Flannery EXXON Research and Engineering Company

CAMBRIDGE UNIVERSITY PRESS Cambridge New York Port Chester Melbourne Sydney

Published by the Press Syndicate of the University of Cambridge The Pitt Building, Trumpington Street, Cambridge CB2 1RP 40 West 20th Street, New York, NY 10011-4211, USA 10 Stamford Road, Oakleigh, Melbourne 3166, Australia c Cambridge University Press 1988, 1992 Copyright except for §13.10 and Appendix B, which are placed into the public domain, and except for all other computer programs and procedures, which are c Numerical Recipes Software 1987, 1988, 1992, 1997 Copyright All Rights Reserved. Some sections of this book were originally published, in different form, in Computers c American Institute of Physics, 1988–1992. in Physics magazine, Copyright First Edition originally published 1988; Second Edition originally published 1992. Reprinted with corrections, 1993, 1994, 1995, 1997. This reprinting is corrected to software version 2.08 Printed in the United States of America Typeset in TEX

Without an additional license to use the contained software, this book is intended as a text and reference book, for reading purposes only. A free license for limited use of the software by the individual owner of a copy of this book who personally types one or more routines into a single computer is granted under terms described on p. xvii. See the section “License Information” (pp. xvi–xviii) for information on obtaining more general licenses at low cost. Machine-readable media containing the software in this book, with included licenses for use on a single screen, are available from Cambridge University Press. See the order form at the back of the book, email to “[email protected]” (North America) or “[email protected]” (rest of world), or write to Cambridge University Press, 110 Midland Avenue, Port Chester, NY 10573 (USA), for further information. The software may also be downloaded, with immediate purchase of a license also possible, from the Numerical Recipes Software Web Site (http://www.nr.com). Unlicensed transfer of Numerical Recipes programs to any other format, or to any computer except one that is specifically licensed, is strictly prohibited. Technical questions, corrections, and requests for information should be addressed to Numerical Recipes Software, P.O. Box 243, Cambridge, MA 02238 (USA), email “[email protected]”, or fax 781 863-1739.

Library of Congress Cataloging in Publication Data Numerical recipes in C : the art of scientific computing / William H. Press . . . [et al.]. – 2nd ed. Includes bibliographical references (p. ) and index. ISBN 0-521-43108-5 1. Numerical analysis–Computer programs. 2. Science–Mathematics–Computer programs. 3. C (Computer program language) I. Press, William H. QA297.N866 1992 519.4028553–dc20 92-8876 A catalog record for this book is available from the British Library. ISBN ISBN ISBN ISBN ISBN

0 0 0 0 0

521 43108 521 43720 521 43724 521 57608 521 57607

5 2 5 3 5

Book Example book in C C diskette (IBM 3.5 , 1.44M) CDROM (IBM PC/Macintosh) CDROM (UNIX)

Contents

Preface to the Second Edition

1

Preface to the First Edition

xiv

License Information

xvi

Computer Programs by Chapter and Section

xix

Preliminaries

1

1.0 Introduction 1.1 Program Organization and Control Structures 1.2 Some C Conventions for Scientific Computing 1.3 Error, Accuracy, and Stability

2

3

xi

1 5 15 28

Solution of Linear Algebraic Equations

32

2.0 Introduction 2.1 Gauss-Jordan Elimination 2.2 Gaussian Elimination with Backsubstitution 2.3 LU Decomposition and Its Applications 2.4 Tridiagonal and Band Diagonal Systems of Equations 2.5 Iterative Improvement of a Solution to Linear Equations 2.6 Singular Value Decomposition 2.7 Sparse Linear Systems 2.8 Vandermonde Matrices and Toeplitz Matrices 2.9 Cholesky Decomposition 2.10 QR Decomposition 2.11 Is Matrix Inversion an N 3 Process?

32 36 41 43 50 55 59 71 90 96 98 102

Interpolation and Extrapolation 3.0 Introduction 3.1 Polynomial Interpolation and Extrapolation 3.2 Rational Function Interpolation and Extrapolation 3.3 Cubic Spline Interpolation 3.4 How to Search an Ordered Table 3.5 Coefficients of the Interpolating Polynomial 3.6 Interpolation in Two or More Dimensions v

105 105 108 111 113 117 120 123

vi

4

Contents

Integration of Functions 4.0 Introduction 4.1 Classical Formulas for Equally Spaced Abscissas 4.2 Elementary Algorithms 4.3 Romberg Integration 4.4 Improper Integrals 4.5 Gaussian Quadratures and Orthogonal Polynomials 4.6 Multidimensional Integrals

5

Evaluation of Functions 5.0 Introduction 5.1 Series and Their Convergence 5.2 Evaluation of Continued Fractions 5.3 Polynomials and Rational Functions 5.4 Complex Arithmetic 5.5 Recurrence Relations and Clenshaw’s Recurrence Formula 5.6 Quadratic and Cubic Equations 5.7 Numerical Derivatives 5.8 Chebyshev Approximation 5.9 Derivatives or Integrals of a Chebyshev-approximated Function 5.10 Polynomial Approximation from Chebyshev Coefficients 5.11 Economization of Power Series 5.12 Padé Approximants 5.13 Rational Chebyshev Approximation 5.14 Evaluation of Functions by Path Integration

6

Special Functions 6.0 Introduction 6.1 Gamma Function, Beta Function, Factorials, Binomial Coefficients 6.2 Incomplete Gamma Function, Error Function, Chi-Square Probability Function, Cumulative Poisson Function 6.3 Exponential Integrals 6.4 Incomplete Beta Function, Student’s Distribution, F-Distribution, Cumulative Binomial Distribution 6.5 Bessel Functions of Integer Order 6.6 Modified Bessel Functions of Integer Order 6.7 Bessel Functions of Fractional Order, Airy Functions, Spherical Bessel Functions 6.8 Spherical Harmonics 6.9 Fresnel Integrals, Cosine and Sine Integrals 6.10 Dawson’s Integral 6.11 Elliptic Integrals and Jacobian Elliptic Functions 6.12 Hypergeometric Functions

7

Random Numbers 7.0 Introduction 7.1 Uniform Deviates

129 129 130 136 140 141 147 161

165 165 165 169 173 176 178 183 186 190 195 197 198 200 204 208

212 212 213 216 222 226 230 236 240 252 255 259 261 271

274 274 275

Contents

7.2 Transformation Method: Exponential and Normal Deviates 7.3 Rejection Method: Gamma, Poisson, Binomial Deviates 7.4 Generation of Random Bits 7.5 Random Sequences Based on Data Encryption 7.6 Simple Monte Carlo Integration 7.7 Quasi- (that is, Sub-) Random Sequences 7.8 Adaptive and Recursive Monte Carlo Methods

8

Sorting 8.0 Introduction 8.1 Straight Insertion and Shell’s Method 8.2 Quicksort 8.3 Heapsort 8.4 Indexing and Ranking 8.5 Selecting the M th Largest 8.6 Determination of Equivalence Classes

9

Root Finding and Nonlinear Sets of Equations 9.0 Introduction 9.1 Bracketing and Bisection 9.2 Secant Method, False Position Method, and Ridders’ Method 9.3 Van Wijngaarden–Dekker–Brent Method 9.4 Newton-Raphson Method Using Derivative 9.5 Roots of Polynomials 9.6 Newton-Raphson Method for Nonlinear Systems of Equations 9.7 Globally Convergent Methods for Nonlinear Systems of Equations

10 Minimization or Maximization of Functions 10.0 Introduction 10.1 Golden Section Search in One Dimension 10.2 Parabolic Interpolation and Brent’s Method in One Dimension 10.3 One-Dimensional Search with First Derivatives 10.4 Downhill Simplex Method in Multidimensions 10.5 Direction Set (Powell’s) Methods in Multidimensions 10.6 Conjugate Gradient Methods in Multidimensions 10.7 Variable Metric Methods in Multidimensions 10.8 Linear Programming and the Simplex Method 10.9 Simulated Annealing Methods

11 Eigensystems 11.0 Introduction 11.1 Jacobi Transformations of a Symmetric Matrix 11.2 Reduction of a Symmetric Matrix to Tridiagonal Form: Givens and Householder Reductions 11.3 Eigenvalues and Eigenvectors of a Tridiagonal Matrix 11.4 Hermitian Matrices 11.5 Reduction of a General Matrix to Hessenberg Form

vii 287 290 296 300 304 309 316

329 329 330 332 336 338 341 345

347 347 350 354 359 362 369 379 383

394 394 397 402 405 408 412 420 425 430 444

456 456 463 469 475 481 482

viii

Contents

11.6 The QR Algorithm for Real Hessenberg Matrices 11.7 Improving Eigenvalues and/or Finding Eigenvectors by Inverse Iteration

12 Fast Fourier Transform 12.0 Introduction 12.1 Fourier Transform of Discretely Sampled Data 12.2 Fast Fourier Transform (FFT) 12.3 FFT of Real Functions, Sine and Cosine Transforms 12.4 FFT in Two or More Dimensions 12.5 Fourier Transforms of Real Data in Two and Three Dimensions 12.6 External Storage or Memory-Local FFTs

13 Fourier and Spectral Applications 13.0 Introduction 13.1 Convolution and Deconvolution Using the FFT 13.2 Correlation and Autocorrelation Using the FFT 13.3 Optimal (Wiener) Filtering with the FFT 13.4 Power Spectrum Estimation Using the FFT 13.5 Digital Filtering in the Time Domain 13.6 Linear Prediction and Linear Predictive Coding 13.7 Power Spectrum Estimation by the Maximum Entropy (All Poles) Method 13.8 Spectral Analysis of Unevenly Sampled Data 13.9 Computing Fourier Integrals Using the FFT 13.10 Wavelet Transforms 13.11 Numerical Use of the Sampling Theorem

14 Statistical Description of Data 14.0 Introduction 14.1 Moments of a Distribution: Mean, Variance, Skewness, and So Forth 14.2 Do Two Distributions Have the Same Means or Variances? 14.3 Are Two Distributions Different? 14.4 Contingency Table Analysis of Two Distributions 14.5 Linear Correlation 14.6 Nonparametric or Rank Correlation 14.7 Do Two-Dimensional Distributions Differ? 14.8 Savitzky-Golay Smoothing Filters

15 Modeling of Data 15.0 Introduction 15.1 Least Squares as a Maximum Likelihood Estimator 15.2 Fitting Data to a Straight Line 15.3 Straight-Line Data with Errors in Both Coordinates 15.4 General Linear Least Squares 15.5 Nonlinear Models

486 493

496 496 500 504 510 521 525 532

537 537 538 545 547 549 558 564 572 575 584 591 606

609 609 610 615 620 628 636 639 645 650

656 656 657 661 666 671 681

Contents

15.6 Confidence Limits on Estimated Model Parameters 15.7 Robust Estimation

16 Integration of Ordinary Differential Equations 16.0 Introduction 16.1 Runge-Kutta Method 16.2 Adaptive Stepsize Control for Runge-Kutta 16.3 Modified Midpoint Method 16.4 Richardson Extrapolation and the Bulirsch-Stoer Method 16.5 Second-Order Conservative Equations 16.6 Stiff Sets of Equations 16.7 Multistep, Multivalue, and Predictor-Corrector Methods

17 Two Point Boundary Value Problems 17.0 Introduction 17.1 The Shooting Method 17.2 Shooting to a Fitting Point 17.3 Relaxation Methods 17.4 A Worked Example: Spheroidal Harmonics 17.5 Automated Allocation of Mesh Points 17.6 Handling Internal Boundary Conditions or Singular Points

18 Integral Equations and Inverse Theory 18.0 Introduction 18.1 Fredholm Equations of the Second Kind 18.2 Volterra Equations 18.3 Integral Equations with Singular Kernels 18.4 Inverse Problems and the Use of A Priori Information 18.5 Linear Regularization Methods 18.6 Backus-Gilbert Method 18.7 Maximum Entropy Image Restoration

19 Partial Differential Equations 19.0 Introduction 19.1 Flux-Conservative Initial Value Problems 19.2 Diffusive Initial Value Problems 19.3 Initial Value Problems in Multidimensions 19.4 Fourier and Cyclic Reduction Methods for Boundary Value Problems 19.5 Relaxation Methods for Boundary Value Problems 19.6 Multigrid Methods for Boundary Value Problems

20 Less-Numerical Algorithms 20.0 Introduction 20.1 Diagnosing Machine Parameters 20.2 Gray Codes

ix 689 699

707 707 710 714 722 724 732 734 747

753 753 757 760 762 772 783 784

788 788 791 794 797 804 808 815 818

827 827 834 847 853 857 863 871

889 889 889 894

x

Contents

20.3 Cyclic Redundancy and Other Checksums 20.4 Huffman Coding and Compression of Data 20.5 Arithmetic Coding 20.6 Arithmetic at Arbitrary Precision

896 903 910 915

References

926

Appendix A: Table of Prototype Declarations

930

Appendix B: Utility Routines

940

Appendix C: Complex Arithmetic

948

Index of Programs and Dependencies

951

General Index

965

Preface to the Second Edition Our aim in writing the original edition of Numerical Recipes was to provide a book that combined general discussion, analytical mathematics, algorithmics, and actual working programs. The success of the first edition puts us now in a difficult, though hardly unenviable, position. We wanted, then and now, to write a book that is informal, fearlessly editorial, unesoteric, and above all useful. There is a danger that, if we are not careful, we might produce a second edition that is weighty, balanced, scholarly, and boring. It is a mixed blessing that we know more now than we did six years ago. Then, we were making educated guesses, based on existing literature and our own research, about which numerical techniques were the most important and robust. Now, we have the benefit of direct feedback from a large reader community. Letters to our alter-ego enterprise, Numerical Recipes Software, are in the thousands per year. (Please, don’t telephone us.) Our post office box has become a magnet for letters pointing out that we have omitted some particular technique, well known to be important in a particular field of science or engineering. We value such letters, and digest them carefully, especially when they point us to specific references in the literature. The inevitable result of this input is that this Second Edition of Numerical Recipes is substantially larger than its predecessor, in fact about 50% larger both in words and number of included programs (the latter now numbering well over 300). “Don’t let the book grow in size,” is the advice that we received from several wise colleagues. We have tried to follow the intended spirit of that advice, even as we violate the letter of it. We have not lengthened, or increased in difficulty, the book’s principal discussions of mainstream topics. Many new topics are presented at this same accessible level. Some topics, both from the earlier edition and new to this one, are now set in smaller type that labels them as being “advanced.” The reader who ignores such advanced sections completely will not, we think, find any lack of continuity in the shorter volume that results. Here are some highlights of the new material in this Second Edition: • a new chapter on integral equations and inverse methods • a detailed treatment of multigrid methods for solving elliptic partial differential equations • routines for band diagonal linear systems • improved routines for linear algebra on sparse matrices • Cholesky and QR decomposition • orthogonal polynomials and Gaussian quadratures for arbitrary weight functions • methods for calculating numerical derivatives • Padé approximants, and rational Chebyshev approximation • Bessel functions, and modified Bessel functions, of fractional order; and several other new special functions • improved random number routines • quasi-random sequences • routines for adaptive and recursive Monte Carlo integration in highdimensional spaces • globally convergent methods for sets of nonlinear equations xi

xii


• • • • • • • • • • • • • •

simulated annealing minimization for continuous control spaces fast Fourier transform (FFT) for real data in two and three dimensions fast Fourier transform (FFT) using external storage improved fast cosine transform routines wavelet transforms Fourier integrals with upper and lower limits spectral analysis on unevenly sampled data Savitzky-Golay smoothing filters fitting straight line data with errors in both coordinates a two-dimensional Kolmogorov-Smirnoff test the statistical bootstrap method embedded Runge-Kutta-Fehlberg methods for differential equations high-order methods for stiff differential equations a new chapter on “less-numerical” algorithms, including Huffman and arithmetic coding, arbitrary precision arithmetic, and several other topics. Consult the Preface to the First Edition, following, or the Table of Contents, for a list of the more “basic” subjects treated.

Acknowledgments It is not possible for us to list by name here all the readers who have made useful suggestions; we are grateful for these. In the text, we attempt to give specific attribution for ideas that appear to be original, and not known in the literature. We apologize in advance for any omissions. Some readers and colleagues have been particularly generous in providing us with ideas, comments, suggestions, and programs for this Second Edition. We especially want to thank George Rybicki, Philip Pinto, Peter Lepage, Robert Lupton, Douglas Eardley, Ramesh Narayan, David Spergel, Alan Oppenheim, Sallie Baliunas, Scott Tremaine, Glennys Farrar, Steven Block, John Peacock, Thomas Loredo, Matthew Choptuik, Gregory Cook, L. Samuel Finn, P. Deuflhard, Harold Lewis, Peter Weinberger, David Syer, Richard Ferch, Steven Ebstein, Bradley Keister, and William Gould. We have been helped by Nancy Lee Snyder’s mastery of a complicated TEX manuscript. We express appreciation to our editors Lauren Cowles and Alan Harvey at Cambridge University Press, and to our production editor Russell Hahn. We remain, of course, grateful to the individuals acknowledged in the Preface to the First Edition. Special acknowledgment is due to programming consultant Seth Finkelstein, who wrote, rewrote, or influenced many of the routines in this book, as well as in its FORTRAN-language twin and the companion Example books. Our project has benefited enormously from Seth’s talent for detecting, and following the trail of, even very slight anomalies (often compiler bugs, but occasionally our errors), and from his good programming sense. To the extent that this edition of Numerical Recipes in C has a more graceful and “C-like” programming style than its predecessor, most of the credit goes to Seth. (Of course, we accept the blame for the FORTRANish lapses that still remain.) We prepared this book for publication on DEC and Sun workstations running the UNIX operating system, and on a 486/33 PC compatible running MS-DOS 5.0/Windows 3.0. (See §1.0 for a list of additional computers used in

xiii


program tests.) We enthusiastically recommend the principal software used: GNU Emacs, TEX, Perl, Adobe Illustrator, and PostScript. Also used were a variety of C compilers – too numerous (and sometimes too buggy) for individual acknowledgment. It is a sobering fact that our standard test suite (exercising all the routines in this book) has uncovered compiler bugs in many of the compilers tried. When possible, we work with developers to see that such bugs get fixed; we encourage interested compiler developers to contact us about such arrangements. WHP and SAT acknowledge the continued support of the U.S. National Science Foundation for their research on computational methods. D.A.R.P.A. support is acknowledged for §13.10 on wavelets. June, 1992

William H. Press Saul A. Teukolsky William T. Vetterling Brian P. Flannery

Preface to the First Edition We call this book Numerical Recipes for several reasons. In one sense, this book is indeed a “cookbook” on numerical computation. However there is an important distinction between a cookbook and a restaurant menu. The latter presents choices among complete dishes in each of which the individual flavors are blended and disguised. The former — and this book — reveals the individual ingredients and explains how they are prepared and combined. Another purpose of the title is to connote an eclectic mixture of presentational techniques. This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines. Our task has been to find the right balance among these ingredients for each topic. You will find that for some topics we have tilted quite far to the analytic side; this where we have felt there to be gaps in the “standard” mathematical training. For other topics, where the mathematical prerequisites are universally held, we have tilted towards more in-depth discussion of the nature of the computational algorithms, or towards practical questions of implementation. We admit, therefore, to some unevenness in the “level” of this book. About half of it is suitable for an advanced undergraduate course on numerical computation for science or engineering majors. The other half ranges from the level of a graduate course to that of a professional reference. Most cookbooks have, after all, recipes at varying levels of complexity. An attractive feature of this approach, we think, is that the reader can use the book at increasing levels of sophistication as his/her experience grows. Even inexperienced readers should be able to use our most advanced routines as black boxes. Having done so, we hope that these readers will subsequently go back and learn what secrets are inside. If there is a single dominant theme in this book, it is that practical methods of numerical computation can be simultaneously efficient, clever, and — important — clear. The alternative viewpoint, that efficient computational methods must necessarily be so arcane and complex as to be useful only in “black box” form, we firmly reject. Our purpose in this book is thus to open up a large number of computational black boxes to your scrutiny. We want to teach you to take apart these black boxes and to put them back together again, modifying them to suit your specific needs. We assume that you are mathematically literate, i.e., that you have the normal mathematical preparation associated with an undergraduate degree in a physical science, or engineering, or economics, or a quantitative social science. We assume that you know how to program a computer. We do not assume that you have any prior formal knowledge of numerical analysis or numerical methods. The scope of Numerical Recipes is supposed to be “everything up to, but not including, partial differential equations.” We honor this in the breach: First, we do have one introductory chapter on methods for partial differential equations (Chapter 19). Second, we obviously cannot include everything else. All the so-called “standard” topics of a numerical analysis course have been included in this book: xiv

xv

Preface to the First Edition

linear equations (Chapter 2), interpolation and extrapolation (Chaper 3), integration (Chaper 4), nonlinear root-finding (Chapter 9), eigensystems (Chapter 11), and ordinary differential equations (Chapter 16). Most of these topics have been taken beyond their standard treatments into some advanced material which we have felt to be particularly important or useful. Some other subjects that we cover in detail are not usually found in the standard numerical analysis texts. These include the evaluation of functions and of particular special functions of higher mathematics (Chapters 5 and 6); random numbers and Monte Carlo methods (Chapter 7); sorting (Chapter 8); optimization, including multidimensional methods (Chapter 10); Fourier transform methods, including FFT methods and other spectral methods (Chapters 12 and 13); two chapters on the statistical description and modeling of data (Chapters 14 and 15); and two-point boundary value problems, both shooting and relaxation methods (Chapter 17). The programs in this book are included in ANSI-standard C. Versions of the book in FORTRAN, Pascal, and BASIC are available separately. We have more to say about the C language, and the computational environment assumed by our routines, in §1.1 (Introduction).

Acknowledgments Many colleagues have been generous in giving us the benefit of their numerical and computational experience, in providing us with programs, in commenting on the manuscript, or in general encouragement. We particularly wish to thank George Rybicki, Douglas Eardley, Philip Marcus, Stuart Shapiro, Paul Horowitz, Bruce Musicus, Irwin Shapiro, Stephen Wolfram, Henry Abarbanel, Larry Smarr, Richard Muller, John Bahcall, and A.G.W. Cameron. We also wish to acknowledge two individuals whom we have never met: Forman Acton, whose 1970 textbook Numerical Methods that Work (New York: Harper and Row) has surely left its stylistic mark on us; and Donald Knuth, both for his series of books on The Art of Computer Programming (Reading, MA: AddisonWesley), and for TEX, the computer typesetting language which immensely aided production of this book. Research by the authors on computational methods was supported in part by the U.S. National Science Foundation. October, 1985

William H. Press Brian P. Flannery Saul A. Teukolsky William T. Vetterling

License Information Read this section if you want to use the programs in this book on a computer. You’ll need to read the following Disclaimer of Warranty, get the programs onto your computer, and acquire a Numerical Recipes software license. (Without this license, which can be the free “immediate license” under terms described below, the book is intended as a text and reference book, for reading purposes only.)

Disclaimer of Warranty We make no warranties, express or implied, that the programs contained in this volume are free of error, or are consistent with any particular standard of merchantability, or that they will meet your requirements for any particular application. They should not be relied on for solving a problem whose incorrect solution could result in injury to a person or loss of property. If you do use the programs in such a manner, it is at your own risk. The authors and publisher disclaim all liability for direct or consequential damages resulting from your use of the programs.

How to Get the Code onto Your Computer Pick one of the following methods: • You can type the programs from this book directly into your computer. In this case, the only kind of license available to you is the free “immediate license” (see below). You are not authorized to transfer or distribute a machine-readable copy to any other person, nor to have any other person type the programs into a computer on your behalf. We do not want to hear bug reports from you if you choose this option, because experience has shown that virtually all reported bugs in such cases are typing errors! • You can download the Numerical Recipes programs electronically from the Numerical Recipes On-Line Software Store, located at http://www.nr.com, our Web site. They are packaged as a password-protected file, and you’ll need to purchase a license to unpack them. You can get a single-screen license and password immediately, on-line, from the On-Line Store, with fees ranging from $50 (PC, Macintosh, educational institutions’ UNIX) to $140 (general UNIX). Downloading the packaged software from the On-Line Store is also the way to start if you want to acquire a more general (multiscreen, site, or corporate) license. • You can purchase media containing the programs from Cambridge University Press. Diskette versions are available in IBM-compatible format for machines running Windows 3.1, 95, or NT. CDROM versions in ISO-9660 format for PC, Macintosh, and UNIX systems are also available; these include both C and Fortran versions on a single CDROM (as well as versions in Pascal and BASIC from the first edition). Diskettes purchased from Cambridge University Press include a single-screen license for PC or Macintosh only. The CDROM is available with a single-screen license for PC or Macintosh (order ISBN 0 521 576083), or (at a slightly higher price) with a single-screen license for UNIX workstations (order ISBN 0 521 576075). Orders for media from Cambridge University Press can be placed at 800 872-7423 (North America only) or by email to [email protected] (North America) or [email protected] (rest of world). Or, visit the Web sites http://www.cup.org (North America) or http://www.cup.cam.ac.uk (rest of world).

xvi

License Information

xvii

Types of License Offered Here are the types of licenses that we offer. Note that some types are automatically acquired with the purchase of media from Cambridge University Press, or of an unlocking password from the Numerical Recipes On-Line Software Store, while other types of licenses require that you communicate specifically with Numerical Recipes Software (email: [email protected] or fax: 781 863-1739). Our Web site http://www.nr.com has additional information. • [“Immediate License”] If you are the individual owner of a copy of this book and you type one or more of its routines into your computer, we authorize you to use them on that computer for your own personal and noncommercial purposes. You are not authorized to transfer or distribute machine-readable copies to any other person, or to use the routines on more than one machine, or to distribute executable programs containing our routines. This is the only free license. • [“Single-Screen License”] This is the most common type of low-cost license, with terms governed by our Single Screen (Shrinkwrap) License document (complete terms available through our Web site). Basically, this license lets you use Numerical Recipes routines on any one screen (PC, workstation, X-terminal, etc.). You may also, under this license, transfer pre-compiled, executable programs incorporating our routines to other, unlicensed, screens or computers, providing that (i) your application is noncommercial (i.e., does not involve the selling of your program for a fee), (ii) the programs were first developed, compiled, and successfully run on a licensed screen, and (iii) our routines are bound into the programs in such a manner that they cannot be accessed as individual routines and cannot practicably be unbound and used in other programs. That is, under this license, your program user must not be able to use our programs as part of a program library or “mix-andmatch” workbench. Conditions for other types of commercial or noncommercial distribution may be found on our Web site (http://www.nr.com). • [“Multi-Screen, Server, Site, and Corporate Licenses”] The terms of the Single Screen License can be extended to designated groups of machines, defined by number of screens, number of machines, locations, or ownership. Significant discounts from the corresponding single-screen prices are available when the estimated number of screens exceeds 40. Contact Numerical Recipes Software (email: [email protected] or fax: 781 863-1739) for details. • [“Course Right-to-Copy License”] Instructors at accredited educational institutions who have adopted this book for a course, and who have already purchased a Single Screen License (either acquired with the purchase of media, or from the Numerical Recipes On-Line Software Store), may license the programs for use in that course as follows: Mail your name, title, and address; the course name, number, dates, and estimated enrollment; and advance payment of $5 per (estimated) student to Numerical Recipes Software, at this address: P.O. Box 243, Cambridge, MA 02238 (USA). You will receive by return mail a license authorizing you to make copies of the programs for use by your students, and/or to transfer the programs to a machine accessible to your students (but only for the duration of the course).

About Copyrights on Computer Programs Like artistic or literary compositions, computer programs are protected by copyright. Generally it is an infringement for you to copy into your computer a program from a copyrighted source. (It is also not a friendly thing to do, since it deprives the program’s author of compensation for his or her creative effort.) Under

xviii

License Information

copyright law, all “derivative works” (modified versions, or translations into another computer language) also come under the same copyright as the original work. Copyright does not protect ideas, but only the expression of those ideas in a particular form. In the case of a computer program, the ideas consist of the program’s methodology and algorithm, including the necessary sequence of steps adopted by the programmer. The expression of those ideas is the program source code (particularly any arbitrary or stylistic choices embodied in it), its derived object code, and any other derivative works. If you analyze the ideas contained in a program, and then express those ideas in your own completely different implementation, then that new program implementation belongs to you. That is what we have done for those programs in this book that are not entirely of our own devising. When programs in this book are said to be “based” on programs published in copyright sources, we mean that the ideas are the same. The expression of these ideas as source code is our own. We believe that no material in this book infringes on an existing copyright.

Trademarks Several registered trademarks appear within the text of this book: Sun is a trademark of Sun Microsystems, Inc. SPARC and SPARCstation are trademarks of SPARC International, Inc. Microsoft, Windows 95, Windows NT, PowerStation, and MS are trademarks of Microsoft Corporation. DEC, VMS, Alpha AXP, and ULTRIX are trademarks of Digital Equipment Corporation. IBM is a trademark of International Business Machines Corporation. Apple and Macintosh are trademarks of Apple Computer, Inc. UNIX is a trademark licensed exclusively through X/Open Co. Ltd. IMSL is a trademark of Visual Numerics, Inc. NAG refers to proprietary computer software of Numerical Algorithms Group (USA) Inc. PostScript and Adobe Illustrator are trademarks of Adobe Systems Incorporated. Last, and no doubt least, Numerical Recipes (when identifying products) is a trademark of Numerical Recipes Software.

Attributions The fact that ideas are legally “free as air” in no way supersedes the ethical requirement that ideas be credited to their known originators. When programs in this book are based on known sources, whether copyrighted or in the public domain, published or “handed-down,” we have attempted to give proper attribution. Unfortunately, the lineage of many programs in common circulation is often unclear. We would be grateful to readers for new or corrected information regarding attributions, which we will attempt to incorporate in subsequent printings.

Computer Programs by Chapter and Section 1.0 1.1 1.1 1.1

flmoon julday badluk caldat

calculate phases of the moon by date Julian Day number from calendar date Friday the 13th when the moon is full calendar date from Julian day number

2.1

gaussj

2.3 2.3 2.4 2.4 2.4 2.4 2.5 2.6 2.6 2.6 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.7 2.8 2.8 2.9 2.9 2.10 2.10 2.10 2.10 2.10

ludcmp lubksb tridag banmul bandec banbks mprove svbksb svdcmp pythag cyclic sprsin sprsax sprstx sprstp sprspm sprstm linbcg snrm atimes asolve vander toeplz choldc cholsl qrdcmp qrsolv rsolv qrupdt rotate

Gauss-Jordan matrix inversion and linear equation solution linear equation solution, LU decomposition linear equation solution, backsubstitution solution of tridiagonal systems multiply vector by band diagonal matrix band diagonal systems, decomposition band diagonal systems, backsubstitution linear equation solution, iterative improvement singular value backsubstitution singular value decomposition of a matrix calculate (a2 + b2 )1/2 without overflow solution of cyclic tridiagonal systems convert matrix to sparse format product of sparse matrix and vector product of transpose sparse matrix and vector transpose of sparse matrix pattern multiply two sparse matrices threshold multiply two sparse matrices biconjugate gradient solution of sparse systems used by linbcg for vector norm used by linbcg for sparse multiplication used by linbcg for preconditioner solve Vandermonde systems solve Toeplitz systems Cholesky decomposition Cholesky backsubstitution QR decomposition QR backsubstitution right triangular backsubstitution update a QR decomposition Jacobi rotation used by qrupdt

3.1 3.2 3.3 3.3 3.4

polint ratint spline splint locate

polynomial interpolation rational function interpolation construct a cubic spline cubic spline interpolation search an ordered table by bisection xix

xx


3.4 3.5 3.5 3.6 3.6 3.6 3.6 3.6

hunt polcoe polcof polin2 bcucof bcuint splie2 splin2

search a table when calls are correlated polynomial coefficients from table of values polynomial coefficients from table of values two-dimensional polynomial interpolation construct two-dimensional bicubic two-dimensional bicubic interpolation construct two-dimensional spline two-dimensional spline interpolation

4.2 4.2 4.2 4.3 4.4 4.4 4.4 4.4 4.4 4.4 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.6

trapzd qtrap qsimp qromb midpnt qromo midinf midsql midsqu midexp qgaus gauleg gaulag gauher gaujac gaucof orthog quad3d

trapezoidal rule integrate using trapezoidal rule integrate using Simpson’s rule integrate using Romberg adaptive method extended midpoint rule integrate using open Romberg adaptive method integrate a function on a semi-infinite interval integrate a function with lower square-root singularity integrate a function with upper square-root singularity integrate a function that decreases exponentially integrate a function by Gaussian quadratures Gauss-Legendre weights and abscissas Gauss-Laguerre weights and abscissas Gauss-Hermite weights and abscissas Gauss-Jacobi weights and abscissas quadrature weights from orthogonal polynomials construct nonclassical orthogonal polynomials integrate a function over a three-dimensional space

5.1 5.3 5.3 5.3 5.7 5.8 5.8 5.9 5.9 5.10 5.10 5.11 5.12 5.13

eulsum ddpoly poldiv ratval dfridr chebft chebev chder chint chebpc pcshft pccheb pade ratlsq

sum a series by Euler–van Wijngaarden algorithm evaluate a polynomial and its derivatives divide one polynomial by another evaluate a rational function numerical derivative by Ridders’ method fit a Chebyshev polynomial to a function Chebyshev polynomial evaluation derivative of a function already Chebyshev fitted integrate a function already Chebyshev fitted polynomial coefficients from a Chebyshev fit polynomial coefficients of a shifted polynomial inverse of chebpc; use to economize power series Padé approximant from power series coefficients rational fit by least-squares method

6.1 6.1 6.1 6.1

gammln factrl bico factln

logarithm of gamma function factorial function binomial coefficients function logarithm of factorial function


xxi

6.1 6.2 6.2 6.2 6.2 6.2 6.2 6.2 6.3 6.3 6.4 6.4 6.5 6.5 6.5 6.5 6.5 6.5 6.6 6.6 6.6 6.6 6.6 6.6 6.7 6.7 6.7 6.7 6.7 6.8 6.9 6.9 6.10 6.11 6.11 6.11 6.11 6.11 6.11 6.11 6.11 6.12 6.12 6.12

beta gammp gammq gser gcf erff erffc erfcc expint ei betai betacf bessj0 bessy0 bessj1 bessy1 bessy bessj bessi0 bessk0 bessi1 bessk1 bessk bessi bessjy beschb bessik airy sphbes plgndr frenel cisi dawson rf rd rj rc ellf elle ellpi sncndn hypgeo hypser hypdrv

beta function incomplete gamma function complement of incomplete gamma function series used by gammp and gammq continued fraction used by gammp and gammq error function complementary error function complementary error function, concise routine exponential integral En exponential integral Ei incomplete beta function continued fraction used by betai Bessel function J0 Bessel function Y0 Bessel function J1 Bessel function Y1 Bessel function Y of general integer order Bessel function J of general integer order modified Bessel function I0 modified Bessel function K0 modified Bessel function I1 modified Bessel function K1 modified Bessel function K of integer order modified Bessel function I of integer order Bessel functions of fractional order Chebyshev expansion used by bessjy modified Bessel functions of fractional order Airy functions spherical Bessel functions jn and yn Legendre polynomials, associated (spherical harmonics) Fresnel integrals S(x) and C(x) cosine and sine integrals Ci and Si Dawson’s integral Carlson’s elliptic integral of the first kind Carlson’s elliptic integral of the second kind Carlson’s elliptic integral of the third kind Carlson’s degenerate elliptic integral Legendre elliptic integral of the first kind Legendre elliptic integral of the second kind Legendre elliptic integral of the third kind Jacobian elliptic functions complex hypergeometric function complex hypergeometric function, series evaluation complex hypergeometric function, derivative of

7.1 7.1

ran0 ran1

random deviate by Park and Miller minimal standard random deviate, minimal standard plus shuffle

xxii


7.1 7.1 7.2 7.2 7.3 7.3 7.3 7.4 7.4 7.5 7.5 7.7 7.8 7.8 7.8 7.8

ran2 ran3 expdev gasdev gamdev poidev bnldev irbit1 irbit2 psdes ran4 sobseq vegas rebin miser ranpt

random deviate by L’Ecuyer long period plus shuffle random deviate by Knuth subtractive method exponential random deviates normally distributed random deviates gamma-law distribution random deviates Poisson distributed random deviates binomial distributed random deviates random bit sequence random bit sequence “pseudo-DES” hashing of 64 bits random deviates from DES-like hashing Sobol’s quasi-random sequence adaptive multidimensional Monte Carlo integration sample rebinning used by vegas recursive multidimensional Monte Carlo integration get random point, used by miser

8.1 8.1 8.1 8.2 8.2 8.3 8.4 8.4 8.4 8.5 8.5 8.5 8.6 8.6

piksrt piksr2 shell sort sort2 hpsort indexx sort3 rank select selip hpsel eclass eclazz

sort an array by straight insertion sort two arrays by straight insertion sort an array by Shell’s method sort an array by quicksort method sort two arrays by quicksort method sort an array by heapsort method construct an index for an array sort, use an index to sort 3 or more arrays construct a rank table for an array find the N th largest in an array find the N th largest, without altering an array find M largest values, without altering an array determine equivalence classes from list determine equivalence classes from procedure

9.0 9.1 9.1 9.1 9.2 9.2 9.2 9.3 9.4 9.4 9.5 9.5

scrsho zbrac zbrak rtbis rtflsp rtsec zriddr zbrent rtnewt rtsafe laguer zroots

9.5 9.5

zrhqr qroot

graph a function to search for roots outward search for brackets on roots inward search for brackets on roots find root of a function by bisection find root of a function by false-position find root of a function by secant method find root of a function by Ridders’ method find root of a function by Brent’s method find root of a function by Newton-Raphson find root of a function by Newton-Raphson and bisection find a root of a polynomial by Laguerre’s method roots of a polynomial by Laguerre’s method with deflation roots of a polynomial by eigenvalue methods complex or double root of a polynomial, Bairstow


xxiii

9.6 9.7 9.7 9.7 9.7 9.7

mnewt lnsrch newt fdjac fmin broydn

Newton’s method for systems of equations search along a line, used by newt globally convergent multi-dimensional Newton’s method finite-difference Jacobian, used by newt norm of a vector function, used by newt secant method for systems of equations

10.1 10.1 10.2 10.3 10.4 10.4 10.5 10.5 10.5 10.6 10.6 10.6 10.7 10.8 10.8 10.8 10.8 10.9 10.9 10.9 10.9 10.9 10.9 10.9 10.9

mnbrak golden brent dbrent amoeba amotry powell linmin f1dim frprmn dlinmin df1dim dfpmin simplx simp1 simp2 simp3 anneal revcst reverse trncst trnspt metrop amebsa amotsa

bracket the minimum of a function find minimum of a function by golden section search find minimum of a function by Brent’s method find minimum of a function using derivative information minimize in N -dimensions by downhill simplex method evaluate a trial point, used by amoeba minimize in N -dimensions by Powell’s method minimum of a function along a ray in N -dimensions function used by linmin minimize in N -dimensions by conjugate gradient minimum of a function along a ray using derivatives function used by dlinmin minimize in N -dimensions by variable metric method linear programming maximization of a linear function linear programming, used by simplx linear programming, used by simplx linear programming, used by simplx traveling salesman problem by simulated annealing cost of a reversal, used by anneal do a reversal, used by anneal cost of a transposition, used by anneal do a transposition, used by anneal Metropolis algorithm, used by anneal simulated annealing in continuous spaces evaluate a trial point, used by amebsa

11.1 11.1 11.2 11.3 11.5 11.5 11.6

jacobi eigsrt tred2 tqli balanc elmhes hqr

eigenvalues and eigenvectors of a symmetric matrix eigenvectors, sorts into order by eigenvalue Householder reduction of a real, symmetric matrix eigensolution of a symmetric tridiagonal matrix balance a nonsymmetric matrix reduce a general matrix to Hessenberg form eigenvalues of a Hessenberg matrix

12.2 12.3 12.3 12.3 12.3 12.3

four1 twofft realft sinft cosft1 cosft2

fast Fourier transform (FFT) in one dimension fast Fourier transform of two real functions fast Fourier transform of a single real function fast sine transform fast cosine transform with endpoints “staggered” fast cosine transform

xxiv


12.4 12.5 12.6 12.6

fourn rlft3 fourfs fourew

fast Fourier transform in multidimensions FFT of real data in two or three dimensions FFT for huge data sets on external media rewind and permute files, used by fourfs

13.1 13.2 13.4 13.6 13.6 13.6 13.7 13.8 13.8 13.8 13.9 13.9 13.10 13.10 13.10 13.10 13.10

convlv correl spctrm memcof fixrts predic evlmem period fasper spread dftcor dftint wt1 daub4 pwtset pwt wtn

convolution or deconvolution of data using FFT correlation or autocorrelation of data using FFT power spectrum estimation using FFT evaluate maximum entropy (MEM) coefficients reflect roots of a polynomial into unit circle linear prediction using MEM coefficients power spectral estimation from MEM coefficients power spectrum of unevenly sampled data power spectrum of unevenly sampled larger data sets extirpolate value into array, used by fasper compute endpoint corrections for Fourier integrals high-accuracy Fourier integrals one-dimensional discrete wavelet transform Daubechies 4-coefficient wavelet filter initialize coefficients for pwt partial wavelet transform multidimensional discrete wavelet transform

14.1 14.2 14.2 14.2 14.2 14.2 14.3 14.3 14.3 14.3 14.3 14.4 14.4 14.5 14.6 14.6 14.6 14.6 14.7 14.7 14.7 14.7 14.8

moment ttest avevar tutest tptest ftest chsone chstwo ksone kstwo probks cntab1 cntab2 pearsn spear crank kendl1 kendl2 ks2d1s quadct quadvl ks2d2s savgol

calculate moments of a data set Student’s t-test for difference of means calculate mean and variance of a data set Student’s t-test for means, case of unequal variances Student’s t-test for means, case of paired data F -test for difference of variances chi-square test for difference between data and model chi-square test for difference between two data sets Kolmogorov-Smirnov test of data against model Kolmogorov-Smirnov test between two data sets Kolmogorov-Smirnov probability function contingency table analysis using chi-square contingency table analysis using entropy measure Pearson’s correlation between two data sets Spearman’s rank correlation between two data sets replaces array elements by their rank correlation between two data sets, Kendall’s tau contingency table analysis using Kendall’s tau K–S test in two dimensions, data vs. model count points by quadrants, used by ks2d1s quadrant probabilities, used by ks2d1s K–S test in two dimensions, data vs. data Savitzky-Golay smoothing coefficients


xxv

15.2 15.3 15.3 15.4 15.4 15.4 15.4 15.4 15.4 15.5 15.5 15.5 15.7 15.7

fit fitexy chixy lfit covsrt svdfit svdvar fpoly fleg mrqmin mrqcof fgauss medfit rofunc

least-squares fit data to a straight line fit data to a straight line, errors in both x and y used by fitexy to calculate a χ2 general linear least-squares fit by normal equations rearrange covariance matrix, used by lfit linear least-squares fit by singular value decomposition variances from singular value decomposition fit a polynomial using lfit or svdfit fit a Legendre polynomial using lfit or svdfit nonlinear least-squares fit, Marquardt’s method used by mrqmin to evaluate coefficients fit a sum of Gaussians using mrqmin fit data to a straight line robustly, least absolute deviation fit data robustly, used by medfit

16.1 16.1 16.2 16.2 16.2 16.3 16.4 16.4 16.4 16.5 16.6 16.6 16.6 16.6 16.6

rk4 rkdumb rkqs rkck odeint mmid bsstep pzextr rzextr stoerm stiff jacobn derivs simpr stifbs

integrate one step of ODEs, fourth-order Runge-Kutta integrate ODEs by fourth-order Runge-Kutta integrate one step of ODEs with accuracy monitoring Cash-Karp-Runge-Kutta step used by rkqs integrate ODEs with accuracy monitoring integrate ODEs by modified midpoint method integrate ODEs, Bulirsch-Stoer step polynomial extrapolation, used by bsstep rational function extrapolation, used by bsstep integrate conservative second-order ODEs integrate stiff ODEs by fourth-order Rosenbrock sample Jacobian routine for stiff sample derivatives routine for stiff integrate stiff ODEs by semi-implicit midpoint rule integrate stiff ODEs, Bulirsch-Stoer step

17.1 17.2 17.3 17.3 17.3 17.3 17.4 17.4 17.4 17.4

shoot shootf solvde bksub pinvs red sfroid difeq sphoot sphfpt

solve two point boundary value problem by shooting ditto, by shooting to a fitting point two point boundary value problem, solve by relaxation backsubstitution, used by solvde diagonalize a sub-block, used by solvde reduce columns of a matrix, used by solvde spheroidal functions by method of solvde spheroidal matrix coefficients, used by sfroid spheroidal functions by method of shoot spheroidal functions by method of shootf

18.1 18.1 18.2 18.3 18.3

fred2 fredin voltra wwghts kermom

solve linear Fredholm equations of the second kind interpolate solutions obtained with fred2 linear Volterra equations of the second kind quadrature weights for an arbitrarily singular kernel sample routine for moments of a singular kernel

xxvi


18.3 18.3

quadmx fredex

sample routine for a quadrature matrix example of solving a singular Fredholm equation

19.5 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6 19.6

sor mglin rstrct interp addint slvsml relax resid copy fill0 mgfas relax2 slvsm2 lop matadd matsub anorm2

elliptic PDE solved by successive overrelaxation method linear elliptic PDE solved by multigrid method half-weighting restriction, used by mglin, mgfas bilinear prolongation, used by mglin, mgfas interpolate and add, used by mglin solve on coarsest grid, used by mglin Gauss-Seidel relaxation, used by mglin calculate residual, used by mglin utility used by mglin, mgfas utility used by mglin nonlinear elliptic PDE solved by multigrid method Gauss-Seidel relaxation, used by mgfas solve on coarsest grid, used by mgfas applies nonlinear operator, used by mgfas utility used by mgfas utility used by mgfas utility used by mgfas

20.1 20.2 20.3 20.3 20.3 20.4 20.4 20.4 20.4 20.5 20.5 20.5 20.6 20.6 20.6 20.6 20.6 20.6 20.6

machar igray icrc1 icrc decchk hufmak hufapp hufenc hufdec arcmak arcode arcsum mpops mpmul mpinv mpdiv mpsqrt mp2dfr mppi

diagnose computer’s floating arithmetic Gray code and its inverse cyclic redundancy checksum, used by icrc cyclic redundancy checksum decimal check digit calculation or verification construct a Huffman code append bits to a Huffman code, used by hufmak use Huffman code to encode and compress a character use Huffman code to decode and decompress a character construct an arithmetic code encode or decode a character using arithmetic coding add integer to byte string, used by arcode multiple precision arithmetic, simpler operations multiple precision multiply, using FFT methods multiple precision reciprocal multiple precision divide and remainder multiple precision square root multiple precision conversion to decimal base multiple precision example, compute many digits of π

Chapter 1.

Preliminaries

1.0 Introduction This book, like its predecessor edition, is supposed to teach you methods of numerical computing that are practical, efficient, and (insofar as possible) elegant. We presume throughout this book that you, the reader, have particular tasks that you want to get done. We view our job as educating you on how to proceed. Occasionally we may try to reroute you briefly onto a particularly beautiful side road; but by and large, we will guide you along main highways that lead to practical destinations. Throughout this book, you will find us fearlessly editorializing, telling you what you should and shouldn’t do. This prescriptive tone results from a conscious decision on our part, and we hope that you will not find it irritating. We do not claim that our advice is infallible! Rather, we are reacting against a tendency, in the textbook literature of computation, to discuss every possible method that has ever been invented, without ever offering a practical judgment on relative merit. We do, therefore, offer you our practical judgments whenever we can. As you gain experience, you will form your own opinion of how reliable our advice is. We presume that you are able to read computer programs in C, that being the language of this version of Numerical Recipes (Second Edition). The book Numerical Recipes in FORTRAN (Second Edition) is separately available, if you prefer to program in that language. Earlier editions of Numerical Recipes in Pascal and Numerical Recipes Routines and Examples in BASIC are also available; while not containing the additional material of the Second Edition versions in C and FORTRAN, these versions are perfectly serviceable if Pascal or BASIC is your language of choice. When we include programs in the text, they look like this: #include <math.h> #define RAD (3.14159265/180.0) void flmoon(int n, int nph, long *jd, float *frac) Our programs begin with an introductory comment summarizing their purpose and explaining their calling sequence. This routine calculates the phases of the moon. Given an integer n and a code nph for the phase desired (nph = 0 for new moon, 1 for first quarter, 2 for full, 3 for last quarter), the routine returns the Julian Day Number jd, and the fractional part of a day frac to be added to it, of the nth such phase since January, 1900. Greenwich Mean Time is assumed. { void nrerror(char error_text[]); int i; float am,as,c,t,t2,xtra; This is how we comment an individual line.

c=n+nph/4.0;

1

2

Chapter 1.

Preliminaries

t=c/1236.85; t2=t*t; as=359.2242+29.105356*c; You aren’t really intended to understand am=306.0253+385.816918*c+0.010730*t2; this algorithm, but it does work! *jd=2415020+28L*n+7L*nph; xtra=0.75933+1.53058868*c+((1.178e-4)-(1.55e-7)*t)*t2; if (nph == 0 || nph == 2) xtra += (0.1734-3.93e-4*t)*sin(RAD*as)-0.4068*sin(RAD*am); else if (nph == 1 || nph == 3) xtra += (0.1721-4.0e-4*t)*sin(RAD*as)-0.6280*sin(RAD*am); else nrerror("nph is unknown in flmoon"); This is how we will indicate error i=(int)(xtra >= 0.0 ? floor(xtra) : ceil(xtra-1.0)); conditions. *jd += i; *frac=xtra-i; }

If the syntax of the function definition above looks strange to you, then you are probably used to the older Kernighan and Ritchie (“K&R”) syntax, rather than that of the newer ANSI C. In this edition, we adopt ANSI C as our standard. You might want to look ahead to §1.2 where ANSI C function prototypes are discussed in more detail. Note our convention of handling all errors and exceptional cases with a statement like nrerror("some error message");. The function nrerror() is part of a small file of utility programs, nrutil.c, listed in Appendix B at the back of the book. This Appendix includes a number of other utilities that we will describe later in this chapter. Function nrerror() prints the indicated error message to your stderr device (usually your terminal screen), and then invokes the function exit(), which terminates execution. The function exit() is in every C library we know of; but if you find it missing, you can modify nrerror() so that it does anything else that will halt execution. For example, you can have it pause for input from the keyboard, and then manually interrupt execution. In some applications, you will want to modify nrerror() to do more sophisticated error handling, for example to transfer control somewhere else, with an error flag or error code set. We will have more to say about the C programming language, its conventions and style, in §1.1 and §1.2.

Computational Environment and Program Validation Our goal is that the programs in this book be as portable as possible, across different platforms (models of computer), across different operating systems, and across different C compilers. C was designed with this type of portability in mind. Nevertheless, we have found that there is no substitute for actually checking all programs on a variety of compilers, in the process uncovering differences in library structure or contents, and even occasional differences in allowed syntax. As surrogates for the large number of possible combinations, we have tested all the programs in this book on the combinations of machines, operating systems, and compilers shown on the accompanying table. More generally, the programs should run without modification on any compiler that implements the ANSI C standard, as described for example in Harbison and Steele’s excellent book [1]. With small modifications, our programs should run on any compiler that implements the older, de facto K&R standard [2]. An example of the kind of trivial incompatibility to watch out for is that ANSI C requires the memory allocation functions malloc()

3

1.0 Introduction

Tested Machines and Compilers Hardware

O/S Version

Compiler Version

IBM PC compatible 486/33

MS-DOS 5.0/Windows 3.1

Microsoft C/C++ 7.0

IBM PC compatible 486/33

MS-DOS 5.0

Borland C/C++ 2.0

IBM RS/6000

AIX 3.2

IBM xlc 1.02

DECstation 5000/25

ULTRIX 4.2A

CodeCenter (Saber) C 3.1.1

DECsystem 5400

ULTRIX 4.1

GNU C Compiler 2.1

Sun SPARCstation 2

SunOS 4.1

GNU C Compiler 1.40

DECstation 5000/200

ULTRIX 4.2

DEC RISC C 2.1*

Sun SPARCstation 2

SunOS 4.1

Sun cc 1.1*

*compiler version does not fully implement ANSI C; only K&R validated

and free() to be declared via the header stdlib.h; some older compilers require them to be declared with the header file malloc.h, while others regard them as inherent in the language and require no header file at all. In validating the programs, we have taken the program source code directly from the machine-readable form of the book’s manuscript, to decrease the chance of propagating typographical errors. “Driver” or demonstration programs that we used as part of our validations are available separately as the Numerical Recipes Example Book (C), as well as in machine-readable form. If you plan to use more than a few of the programs in this book, or if you plan to use programs in this book on more than one different computer, then you may find it useful to obtain a copy of these demonstration programs. Of course we would be foolish to claim that there are no bugs in our programs, and we do not make such a claim. We have been very careful, and have benefitted from the experience of the many readers who have written to us. If you find a new bug, please document it and tell us!

Compatibility with the First Edition If you are accustomed to the Numerical Recipes routines of the First Edition, rest assured: almost all of them are still here, with the same names and functionalities, often with major improvements in the code itself. In addition, we hope that you will soon become equally familiar with the added capabilities of the more than 100 routines that are new to this edition. We have retired a small number of First Edition routines, those that we believe to be clearly dominated by better methods implemented in this edition. A table, following, lists the retired routines and suggests replacements. First Edition users should also be aware that some routines common to both editions have alterations in their calling interfaces, so are not directly “plug compatible.” A fairly complete list is: chsone, chstwo, covsrt, dfpmin, laguer, lfit, memcof, mrqcof, mrqmin, pzextr, ran4, realft, rzextr, shoot, shootf. There may be others (depending in part on which printing of the First Edition is taken for the comparison). If you have written software of any appreciable complexity

4

Chapter 1.

Preliminaries

Previous Routines Omitted from This Edition Name(s)

Replacement(s)

Comment

adi

mglin or mgfas

better method

cosft

cosft1 or cosft2

choice of boundary conditions

cel, el2

rf, rd, rj, rc

better algorithms

des, desks

ran4 now uses psdes

was too slow

mdian1, mdian2

select, selip

more general

qcksrt

sort

name change (sort is now hpsort)

rkqc

rkqs

better method

smooft

use convlv with coefficients from savgol

sparse

linbcg

more general

that is dependent on First Edition routines, we do not recommend blindly replacing them by the corresponding routines in this book. We do recommend that any new programming efforts use the new routines.

About References You will find references, and suggestions for further reading, listed at the end of most sections of this book. References are cited in the text by bracketed numbers like this [3]. Because computer algorithms often circulate informally for quite some time before appearing in a published form, the task of uncovering “primary literature” is sometimes quite difficult. We have not attempted this, and we do not pretend to any degree of bibliographical completeness in this book. For topics where a substantial secondary literature exists (discussion in textbooks, reviews, etc.) we have consciously limited our references to a few of the more useful secondary sources, especially those with good references to the primary literature. Where the existing secondary literature is insufficient, we give references to a few primary sources that are intended to serve as starting points for further reading, not as complete bibliographies for the field. The order in which references are listed is not necessarily significant. It reflects a compromise between listing cited references in the order cited, and listing suggestions for further reading in a roughly prioritized order, with the most useful ones first. The remaining three sections of this chapter review some basic concepts of programming (control structures, etc.), discuss a set of conventions specific to C that we have adopted in this book, and introduce some fundamental concepts in numerical analysis (roundoff error, etc.). Thereafter, we plunge into the substantive material of the book. CITED REFERENCES AND FURTHER READING: Harbison, S.P., and Steele, G.L., Jr. 1991, C: A Reference Manual, 3rd ed. (Englewood Cliffs, NJ: Prentice-Hall). [1]

1.1 Program Organization and Control Structures

5

Kernighan, B., and Ritchie, D. 1978, The C Programming Language (Englewood Cliffs, NJ: Prentice-Hall). [2] [Reference for K&R “traditional” C. Later editions of this book conform to the ANSI C standard.] Meeus, J. 1982, Astronomical Formulae for Calculators, 2nd ed., revised and enlarged (Richmond, VA: Willmann-Bell). [3]

1.1 Program Organization and Control Structures We sometimes like to point out the close analogies between computer programs, on the one hand, and written poetry or written musical scores, on the other. All three present themselves as visual media, symbols on a two-dimensional page or computer screen. Yet, in all three cases, the visual, two-dimensional, frozen-in-time representation communicates (or is supposed to communicate) something rather different, namely a process that unfolds in time. A poem is meant to be read; music, played; a program, executed as a sequential series of computer instructions. In all three cases, the target of the communication, in its visual form, is a human being. The goal is to transfer to him/her, as efficiently as can be accomplished, the greatest degree of understanding, in advance, of how the process will unfold in time. In poetry, this human target is the reader. In music, it is the performer. In programming, it is the program user. Now, you may object that the target of communication of a program is not a human but a computer, that the program user is only an irrelevant intermediary, a lackey who feeds the machine. This is perhaps the case in the situation where the business executive pops a diskette into a desktop computer and feeds that computer a black-box program in binary executable form. The computer, in this case, doesn’t much care whether that program was written with “good programming practice” or not. We envision, however, that you, the readers of this book, are in quite a different situation. You need, or want, to know not just what a program does, but also how it does it, so that you can tinker with it and modify it to your particular application. You need others to be able to see what you have done, so that they can criticize or admire. In such cases, where the desired goal is maintainable or reusable code, the targets of a program’s communication are surely human, not machine. One key to achieving good programming practice is to recognize that programming, music, and poetry — all three being symbolic constructs of the human brain — are naturally structured into hierarchies that have many different nested levels. Sounds (phonemes) form small meaningful units (morphemes) which in turn form words; words group into phrases, which group into sentences; sentences make paragraphs, and these are organized into higher levels of meaning. Notes form musical phrases, which form themes, counterpoints, harmonies, etc.; which form movements, which form concertos, symphonies, and so on. The structure in programs is equally hierarchical. Appropriately, good programming practice brings different techniques to bear on the different levels [1-3]. At a low level is the ascii character set. Then, constants, identifiers, operands,

6

Chapter 1.

Preliminaries

operators. Then program statements, like a[j+1]=b+c/3.0;. Here, the best programming advice is simply be clear, or (correspondingly) don’t be too tricky. You might momentarily be proud of yourself at writing the single line k=(2-j)*(1+3*j)/2;

if you want to permute cyclically one of the values j = (0, 1, 2) into respectively k = (1, 2, 0). You will regret it later, however, when you try to understand that line. Better, and likely also faster, is k=j+1; if (k == 3) k=0;

Many programming stylists would even argue for the ploddingly literal switch (j) { case 0: k=1; break; case 1: k=2; break; case 2: k=0; break; default: { fprintf(stderr,"unexpected value for j"); exit(1); } }

on the grounds that it is both clear and additionally safeguarded from wrong assumptions about the possible values of j. Our preference among the implementations is for the middle one. In this simple example, we have in fact traversed several levels of hierarchy: Statements frequently come in “groups” or “blocks” which make sense only taken as a whole. The middle fragment above is one example. Another is swap=a[j]; a[j]=b[j]; b[j]=swap;

which makes immediate sense to any programmer as the exchange of two variables, while ans=sum=0.0; n=1;

is very likely to be an initialization of variables prior to some iterative process. This level of hierarchy in a program is usually evident to the eye. It is good programming practice to put in comments at this level, e.g., “initialize” or “exchange variables.” The next level is that of control structures. These are things like the switch construction in the example above, for loops, and so on. This level is sufficiently important, and relevant to the hierarchical level of the routines in this book, that we will come back to it just below. At still higher levels in the hierarchy, we have functions and modules, and the whole “global” organization of the computational task to be done. In the musical analogy, we are now at the level of movements and complete works. At these levels,


7

modularization and encapsulation become important programming concepts, the general idea being that program units should interact with one another only through clearly defined and narrowly circumscribed interfaces. Good modularization practice is an essential prerequisite to the success of large, complicated software projects, especially those employing the efforts of more than one programmer. It is also good practice (if not quite as essential) in the less massive programming tasks that an individual scientist, or reader of this book, encounters. Some computer languages, such as Modula-2 and C++, promote good modularization with higher-level language constructs absent in C. In Modula-2, for example, functions, type definitions, and data structures can be encapsulated into “modules” that communicate through declared public interfaces and whose internal workings are hidden from the rest of the program [4]. In the C++ language, the key concept is “class,” a user-definable generalization of data type that provides for data hiding, automatic initialization of data, memory management, dynamic typing, and operator overloading (i.e., the user-definable extension of operators like + and * so as to be appropriate to operands in any particular class) [5]. Properly used in defining the data structures that are passed between program units, classes can clarify and circumscribe these units’ public interfaces, reducing the chances of programming error and also allowing a considerable degree of compile-time and run-time error checking. Beyond modularization, though depending on it, lie the concepts of objectoriented programming. Here a programming language, such as C++ or Turbo Pascal 5.5 [6], allows a module’s public interface to accept redefinitions of types or actions, and these redefinitions become shared all the way down through the module’s hierarchy (so-called polymorphism). For example, a routine written to invert a matrix of real numbers could — dynamically, at run time — be made able to handle complex numbers by overloading complex data types and corresponding definitions of the arithmetic operations. Additional concepts of inheritance (the ability to define a data type that “inherits” all the structure of another type, plus additional structure of its own), and object extensibility (the ability to add functionality to a module without access to its source code, e.g., at run time), also come into play. We have not attempted to modularize, or make objects out of, the routines in this book, for at least two reasons. First, the chosen language, C, does not really make this possible. Second, we envision that you, the reader, might want to incorporate the algorithms in this book, a few at a time, into modules or objects with a structure of your own choosing. There does not exist, at present, a standard or accepted set of “classes” for scientific object-oriented computing. While we might have tried to invent such a set, doing so would have inevitably tied the algorithmic content of the book (which is its raison d’être) to some rather specific, and perhaps haphazard, set of choices regarding class definitions. On the other hand, we are not unfriendly to the goals of modular and objectoriented programming. Within the limits of C, we have therefore tried to structure our programs to be “object friendly.” That is one reason we have adopted ANSI C with its function prototyping as our default C dialect (see §1.2). Also, within our implementation sections, we have paid particular attention to the practices of structured programming, as we now discuss.

8

Chapter 1.

Preliminaries

Control Structures An executing program unfolds in time, but not strictly in the linear order in which the statements are written. Program statements that affect the order in which statements are executed, or that affect whether statements are executed, are called control statements. Control statements never make useful sense by themselves. They make sense only in the context of the groups or blocks of statements that they in turn control. If you think of those blocks as paragraphs containing sentences, then the control statements are perhaps best thought of as the indentation of the paragraph and the punctuation between the sentences, not the words within the sentences. We can now say what the goal of structured programming is. It is to make program control manifestly apparent in the visual presentation of the program. You see that this goal has nothing at all to do with how the computer sees the program. As already remarked, computers don’t care whether you use structured programming or not. Human readers, however, do care. You yourself will also care, once you discover how much easier it is to perfect and debug a well-structured program than one whose control structure is obscure. You accomplish the goals of structured programming in two complementary ways. First, you acquaint yourself with the small number of essential control structures that occur over and over again in programming, and that are therefore given convenient representations in most programming languages. You should learn to think about your programming tasks, insofar as possible, exclusively in terms of these standard control structures. In writing programs, you should get into the habit of representing these standard control structures in consistent, conventional ways. “Doesn’t this inhibit creativity?” our students sometimes ask. Yes, just as Mozart’s creativity was inhibited by the sonata form, or Shakespeare’s by the metrical requirements of the sonnet. The point is that creativity, when it is meant to communicate, does well under the inhibitions of appropriate restrictions on format. Second, you avoid, insofar as possible, control statements whose controlled blocks or objects are difficult to discern at a glance. This means, in practice, that you must try to avoid named labels on statements and goto’s. It is not the goto’s that are dangerous (although they do interrupt one’s reading of a program); the named statement labels are the hazard. In fact, whenever you encounter a named statement label while reading a program, you will soon become conditioned to get a sinking feeling in the pit of your stomach. Why? Because the following questions will, by habit, immediately spring to mind: Where did control come from in a branch to this label? It could be anywhere in the routine! What circumstances resulted in a branch to this label? They could be anything! Certainty becomes uncertainty, understanding dissolves into a morass of possibilities. Some examples are now in order to make these considerations more concrete (see Figure 1.1.1).

Catalog of Standard Structures Iteration.

In C, simple iteration is performed with a for loop, for example

for (j=2;j 3) if (a > 3) b += 1; else b -= 1;

/* questionable! */

As judged by the indentation used on successive lines, the intent of the writer of this code is the following: ‘If b is greater than 3 and a is greater than 3, then increment b. If b is not greater than 3, then decrement b.’ According to the rules of C, however, the actual meaning is ‘If b is greater than 3, then evaluate a. If a is greater than 3, then increment b, and if a is less than or equal to 3, decrement b.’ The point is that an else clause is associated with the most recent open if statement, no matter how you lay it out on the page. Such confusions in meaning are easily resolved by the inclusion of braces. They may in some instances be technically superfluous; nevertheless, they clarify your intent and improve the program. The above fragment should be written as if (b > 3) { if (a > 3) b += 1; } else { b -= 1; }

Here is a working program that consists dominantly of if control statements: #include <math.h> #define IGREG (15+31L*(10+12L*1582))

Gregorian Calendar adopted Oct. 15, 1582.

long julday(int mm, int id, int iyyy) In this routine julday returns the Julian Day Number that begins at noon of the calendar date specified by month mm, day id, and year iyyy, all integer variables. Positive year signifies A.D.; negative, B.C. Remember that the year after 1 B.C. was 1 A.D. { void nrerror(char error_text[]);

12

Chapter 1.

Preliminaries

long jul; int ja,jy=iyyy,jm; if (jy == 0) nrerror("julday: there is no year zero."); if (jy < 0) ++jy; if (mm > 2) { Here is an example of a block IF-structure. jm=mm+1; } else { --jy; jm=mm+13; } jul = (long) (floor(365.25*jy)+floor(30.6001*jm)+id+1720995); if (id+31L*(mm+12L*iyyy) >= IGREG) { Test whether to change to Gregorian Calja=(int)(0.01*jy); endar. jul += 2-ja+(int) (0.25*ja); } return jul; }

(Astronomers number each 24-hour period, starting and ending at noon, with a unique integer, the Julian Day Number [7]. Julian Day Zero was a very long time ago; a convenient reference point is that Julian Day 2440000 began at noon of May 23, 1968. If you know the Julian Day Number that begins at noon of a given calendar date, then the day of the week of that date is obtained by adding 1 and taking the result modulo base 7; a zero answer corresponds to Sunday, 1 to Monday, . . . , 6 to Saturday.) While iteration. Most languages (though not FORTRAN, incidentally) provide for structures like the following C example: while (n < 1000) { n *= 2; j += 1; }

It is the particular feature of this structure that the control-clause (in this case n < 1000) is evaluated before each iteration. If the clause is not true, the enclosed statements will not be executed. In particular, if this code is encountered at a time when n is greater than or equal to 1000, the statements will not even be executed once. Do-While iteration. Companion to the while iteration is a related controlstructure that tests its control-clause at the end of each iteration. In C, it looks like this: do { n *= 2; j += 1; } while (n < 1000);

In this case, the enclosed statements will be executed at least once, independent of the initial value of n. Break. In this case, you have a loop that is repeated indefinitely until some condition tested somewhere in the middle of the loop (and possibly tested in more


13

than one place) becomes true. At that point you wish to exit the loop and proceed with what comes after it. In C the structure is implemented with the simple break statement, which terminates execution of the innermost for, while, do, or switch construction and proceeds to the next sequential instruction. (In Pascal and standard FORTRAN, this structure requires the use of statement labels, to the detriment of clear programming.) A typical usage of the break statement is: for(;;) { [statements before the test] if (...) break; [statements after the test] } [next sequential instruction]

Here is a program that uses several different iteration structures. One of us was once asked, for a scavenger hunt, to find the date of a Friday the 13th on which the moon was full. This is a program which accomplishes that task, giving incidentally all other Fridays the 13th as a by-product. #include <stdio.h> #include <math.h> #define ZON -5.0 #define IYBEG 1900 #define IYEND 2000

Time zone −5 is Eastern Standard Time. The range of dates to be searched.

int main(void) /* Program badluk */ { void flmoon(int n, int nph, long *jd, float *frac); long julday(int mm, int id, int iyyy); int ic,icon,idwk,im,iyyy,n; float timzon = ZON/24.0,frac; long jd,jday; printf("\nFull moons on Friday the 13th from %5d to %5d\n",IYBEG,IYEND); for (iyyy=IYBEG;iyyy= jd ? 1 : -1); if (ic == (-icon)) break; Another break, case of no match. icon=ic; n += ic; } } } } } return 0; }

If you are merely curious, there were (or will be) occurrences of a full moon on Friday the 13th (time zone GMT−5) on: 3/13/1903, 10/13/1905, 6/13/1919, 1/13/1922, 11/13/1970, 2/13/1987, 10/13/2000, 9/13/2019, and 8/13/2049. Other “standard” structures. Our advice is to avoid them. Every programming language has some number of “goodies” that the designer just couldn’t resist throwing in. They seemed like a good idea at the time. Unfortunately they don’t stand the test of time! Your program becomes difficult to translate into other languages, and difficult to read (because rarely used structures are unfamiliar to the reader). You can almost always accomplish the supposed conveniences of these structures in other ways. In C, the most problematic control structure is the switch...case...default construction (see Figure 1.1.1), which has historically been burdened by uncertainty, from compiler to compiler, about what data types are allowed in its control expression. Data types char and int are universally supported. For other data types, e.g., float or double, the structure should be replaced by a more recognizable and translatable if. . .else construction. ANSI C allows the control expression to be of type long, but many older compilers do not. The continue; construction, while benign, can generally be replaced by an if construction with no loss of clarity.

About “Advanced Topics” Material set in smaller type, like this, signals an “advanced topic,” either one outside of the main argument of the chapter, or else one requiring of you more than the usual assumed mathematical background, or else (in a few cases) a discussion that is more speculative or an algorithm that is less well-tested. Nothing important will be lost if you skip the advanced topics on a first reading of the book. You may have noticed that, by its looping over the months and years, the program badluk avoids using any algorithm for converting a Julian Day Number back into a calendar date. A routine for doing just this is not very interesting structurally, but it is occasionally useful: #include <math.h> #define IGREG 2299161 void caldat(long julian, int *mm, int *id, int *iyyy) Inverse of the function julday given above. Here julian is input as a Julian Day Number, and the routine outputs mm,id, and iyyy as the month, day, and year on which the specified Julian Day started at noon. { long ja,jalpha,jb,jc,jd,je;

1.2 Some C Conventions for Scientific Computing

15

if (julian >= IGREG) { Cross-over to Gregorian Calendar produces this correcjalpha=(long)(((float) (julian-1867216)-0.25)/36524.25); tion. ja=julian+1+jalpha-(long) (0.25*jalpha); } else if (julian < 0) { Make day number positive by adding integer number of ja=julian+36525*(1-julian/36525); Julian centuries, then subtract them off } else at the end. ja=julian; jb=ja+1524; jc=(long)(6680.0+((float) (jb-2439870)-122.1)/365.25); jd=(long)(365*jc+(0.25*jc)); je=(long)((jb-jd)/30.6001); *id=jb-jd-(long) (30.6001*je); *mm=je-1; if (*mm > 12) *mm -= 12; *iyyy=jc-4715; if (*mm > 2) --(*iyyy); if (*iyyy =

arithmetic less than arithmetic greater than arithmetic less than or equal to arithmetic greater than or equal to

left-to-right

== !=

arithmetic equal arithmetic not equal

left-to-right

&

bitwise and

left-to-right

^

bitwise exclusive or

left-to-right

|

bitwise or

left-to-right

&&

logical and

left-to-right

||

logical or

left-to-right

conditional expression

right-to-left

() [] . -> ! ~ ++ -& * (type) sizeof

?

:

= assignment operator also += -= *= /= %= = &= ^= |= ,

sequential expression

right-to-left

left-to-right

We have already alluded to the problem of computing small integer powers of numbers, most notably the square and cube. The omission of this operation from C is perhaps the language’s most galling insult to the scientific programmer. All good FORTRAN compilers recognize expressions like (A+B)**4 and produce in-line code, in this case with only one add and two multiplies. It is typical for constant integer powers up to 12 to be thus recognized.

1.2 Some C Conventions for Scientific Computing

27

In C, the mere problem of squaring is hard enough! Some people “macro-ize” the operation as #define SQR(a) ((a)*(a))

However, this is likely to produce code where SQR(sin(x)) results in two calls to the sine routine! You might be tempted to avoid this by storing the argument of the squaring function in a temporary variable: static float sqrarg; #define SQR(a) (sqrarg=(a),sqrarg*sqrarg)

The global variable sqrarg now has (and needs to keep) scope over the whole module, which is a little dangerous. Also, one needs a completely different macro to square expressions of type int. More seriously, this macro can fail if there are two SQR operations in a single expression. Since in C the order of evaluation of pieces of the expression is at the compiler’s discretion, the value of sqrarg in one evaluation of SQR can be that from the other evaluation in the same expression, producing nonsensical results. When we need a guaranteed-correct SQR macro, we use the following, which exploits the guaranteed complete evaluation of subexpressions in a conditional expression: static float sqrarg; #define SQR(a) ((sqrarg=(a)) == 0.0 ? 0.0 : sqrarg*sqrarg)

A collection of macros for other simple operations is included in the file nrutil.h (see Appendix B) and used by many of our programs. Here are the synopses: SQR(a) DSQR(a) FMAX(a,b) FMIN(a,b) DMAX(a,b) DMIN(a,b) IMAX(a,b) IMIN(a,b) LMAX(a,b) LMIN(a,b) SIGN(a,b)

Square a float value. Square a double value. Maximum of two float values. Minimum of two float values. Maximum of two double values. Minimum of two double values. Maximum of two int values. Minimum of two int values. Maximum of two long values. Minimum of two long values. Magnitude of a times sign of b.

Scientific programming in C may someday become a bed of roses; for now, watch out for the thorns! CITED REFERENCES AND FURTHER READING: Harbison, S.P., and Steele, G.L., Jr. 1991, C: A Reference Manual, 3rd ed. (Englewood Cliffs, NJ: Prentice-Hall). [1] AT&T Bell Laboratories 1985, The C Programmer’s Handbook (Englewood Cliffs, NJ: PrenticeHall). Kernighan, B., and Ritchie, D. 1978, The C Programming Language (Englewood Cliffs, NJ: Prentice-Hall). [Reference for K&R “traditional” C. Later editions of this book conform to the ANSI C standard.] Hogan, T. 1984, The C Programmer’s Handbook (Bowie, MD: Brady Communications).

28

Chapter 1.

Preliminaries

1.3 Error, Accuracy, and Stability Although we assume no prior training of the reader in formal numerical analysis, we will need to presume a common understanding of a few key concepts. We will define these briefly in this section. Computers store numbers not with infinite precision but rather in some approximation that can be packed into a fixed number of bits (binary digits) or bytes (groups of 8 bits). Almost all computers allow the programmer a choice among several different such representations or data types. Data types can differ in the number of bits utilized (the wordlength), but also in the more fundamental respect of whether the stored number is represented in fixed-point (int or long) or floating-point (float or double) format. A number in integer representation is exact. Arithmetic between numbers in integer representation is also exact, with the provisos that (i) the answer is not outside the range of (usually, signed) integers that can be represented, and (ii) that division is interpreted as producing an integer result, throwing away any integer remainder. In floating-point representation, a number is represented internally by a sign bit s (interpreted as plus or minus), an exact integer exponent e, and an exact positive integer mantissa M . Taken together these represent the number s × M × B e−E

(1.3.1)

where B is the base of the representation (usually B = 2, but sometimes B = 16), and E is the bias of the exponent, a fixed integer constant for any given machine and representation. An example is shown in Figure 1.3.1. Several floating-point bit patterns can represent the same number. If B = 2, for example, a mantissa with leading (high-order) zero bits can be left-shifted, i.e., multiplied by a power of 2, if the exponent is decreased by a compensating amount. Bit patterns that are “as left-shifted as they can be” are termed normalized. Most computers always produce normalized results, since these don’t waste any bits of the mantissa and thus allow a greater accuracy of the representation. Since the high-order bit of a properly normalized mantissa (when B = 2) is always one, some computers don’t store this bit at all, giving one extra bit of significance. Arithmetic among numbers in floating-point representation is not exact, even if the operands happen to be exactly represented (i.e., have exact values in the form of equation 1.3.1). For example, two floating numbers are added by first right-shifting (dividing by two) the mantissa of the smaller (in magnitude) one, simultaneously increasing its exponent, until the two operands have the same exponent. Low-order (least significant) bits of the smaller operand are lost by this shifting. If the two operands differ too greatly in magnitude, then the smaller operand is effectively replaced by zero, since it is right-shifted to oblivion. The smallest (in magnitude) floating-point number which, when added to the floating-point number 1.0, produces a floating-point result different from 1.0 is termed the machine accuracy m . A typical computer with B = 2 and a 32-bit wordlength has m around 3 × 10−8 . (A more detailed discussion of machine characteristics, and a program to determine them, is given in §20.1.) Roughly

29

a an tis s -b it m 23

th be is b “p it c ha ou nt ld om ”

8bi te xp

sig

n

bi t

on

en t

1.3 Error, Accuracy, and Stability

= 0

10000000

10000000000000000000000

(a)

3 = 0

10000010

11000000000000000000000

(b)

1⁄ 2

= 0

01111111

10000000000000000000000

(c)

10 − 7 = 0

01101001

(d)

= 0

10000010

1 1 0 ... 10110101111111001010 00000000000000000000000

3 + 10 − 7 = 0

10000010

11000000000000000000000

(f )

1⁄ 4

(e)

Figure 1.3.1. Floating point representations of numbers in a typical 32-bit (4-byte) format. (a) The number 1/2 (note the bias in the exponent); (b) the number 3; (c) the number 1/4; (d) the number 10−7 , represented to machine accuracy; (e) the same number 10−7 , but shifted so as to have the same exponent as the number 3; with this shifting, all significance is lost and 10−7 becomes zero; shifting to a common exponent must occur before two numbers can be added; (f) sum of the numbers 3 + 10−7, which equals 3 to machine accuracy. Even though 10−7 can be represented accurately by itself, it cannot accurately be added to a much larger number.

speaking, the machine accuracy m is the fractional accuracy to which floating-point numbers are represented, corresponding to a change of one in the least significant bit of the mantissa. Pretty much any arithmetic operation among floating numbers should be thought of as introducing an additional fractional error of at least m . This type of error is called roundoff error. It is important to understand that m is not the smallest floating-point number that can be represented on a machine. That number depends on how many bits there are in the exponent, while m depends on how many bits there are in the mantissa. Roundoff errors accumulate with increasing amounts of calculation. If, in the course of obtaining a calculated value, you perform N such arithmetic operations, √ you might be so lucky as to have a total roundoff error on the order of N m , if the roundoff errors come in randomly up or down. (The square root comes from a random-walk.) However, this estimate can be very badly off the mark for two reasons: (i) It very frequently happens that the regularities of your calculation, or the peculiarities of your computer, cause the roundoff errors to accumulate preferentially in one direction. In this case the total will be of order N m . (ii) Some especially unfavorable occurrences can vastly increase the roundoff error of single operations. Generally these can be traced to the subtraction of two very nearly equal numbers, giving a result whose only significant bits are those (few) low-order ones in which the operands differed. You might think that such a “coincidental” subtraction is unlikely to occur. Not always so. Some mathematical expressions magnify its probability of occurrence tremendously. For example, in the familiar formula for the solution of a quadratic equation, x=

−b +

√

b2 − 4ac 2a

(1.3.2)

the addition becomes delicate and roundoff-prone whenever ac b2 . (In §5.6 we will learn how to avoid the problem in this particular case.)

30

Chapter 1.

Preliminaries

Roundoff error is a characteristic of computer hardware. There is another, different, kind of error that is a characteristic of the program or algorithm used, independent of the hardware on which the program is executed. Many numerical algorithms compute “discrete” approximations to some desired “continuous” quantity. For example, an integral is evaluated numerically by computing a function at a discrete set of points, rather than at “every” point. Or, a function may be evaluated by summing a finite number of leading terms in its infinite series, rather than all infinity terms. In cases like this, there is an adjustable parameter, e.g., the number of points or of terms, such that the “true” answer is obtained only when that parameter goes to infinity. Any practical calculation is done with a finite, but sufficiently large, choice of that parameter. The discrepancy between the true answer and the answer obtained in a practical calculation is called the truncation error. Truncation error would persist even on a hypothetical, “perfect” computer that had an infinitely accurate representation and no roundoff error. As a general rule there is not much that a programmer can do about roundoff error, other than to choose algorithms that do not magnify it unnecessarily (see discussion of “stability” below). Truncation error, on the other hand, is entirely under the programmer’s control. In fact, it is only a slight exaggeration to say that clever minimization of truncation error is practically the entire content of the field of numerical analysis! Most of the time, truncation error and roundoff error do not strongly interact with one another. A calculation can be imagined as having, first, the truncation error that it would have if run on an infinite-precision computer, “plus” the roundoff error associated with the number of operations performed. Sometimes, however, an otherwise attractive method can be unstable. This means that any roundoff error that becomes “mixed into” the calculation at an early stage is successively magnified until it comes to swamp the true answer. An unstable method would be useful on a hypothetical, perfect computer; but in this imperfect world it is necessary for us to require that algorithms be stable — or if unstable that we use them with great caution. Here is a simple, if somewhat artificial, example of an unstable algorithm: Suppose that it is desired to calculate all integer powers of the so-called “Golden Mean,” the number given by √ 5−1 ≈ 0.61803398 (1.3.3) φ≡ 2 It turns out (you can easily verify) that the powers φn satisfy a simple recursion relation, φn+1 = φn−1 − φn

(1.3.4)

Thus, knowing the first two values φ0 = 1 and φ1 = 0.61803398, we can successively apply (1.3.4) performing only a single subtraction, rather than a slower multiplication by φ, at each stage. Unfortunately, the recurrence (1.3.4) also has another solution, namely the value √ − 12 ( 5 + 1). Since the recurrence is linear, and since this undesired solution has magnitude greater than unity, any small admixture of it introduced by roundoff errors will grow exponentially. On a typical machine with 32-bit wordlength, (1.3.4) starts

1.3 Error, Accuracy, and Stability

31

to give completely wrong answers by about n = 16, at which point φn is down to only 10−4 . The recurrence (1.3.4) is unstable, and cannot be used for the purpose stated. We will encounter the question of stability in many more sophisticated guises, later in this book. CITED REFERENCES AND FURTHER READING: Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), Chapter 1. Kahaner, D., Moler, C., and Nash, S. 1989, Numerical Methods and Software (Englewood Cliffs, NJ: Prentice Hall), Chapter 2. Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: AddisonWesley), §1.3. Wilkinson, J.H. 1964, Rounding Errors in Algebraic Processes (Englewood Cliffs, NJ: PrenticeHall).

Chapter 2. Solution of Linear Algebraic Equations

2.0 Introduction A set of linear algebraic equations looks like this: a11 x1 + a12 x2 + a13 x3 + · · · + a1N xN = b1 a21 x1 + a22 x2 + a23 x3 + · · · + a2N xN = b2 a31 x1 + a32 x2 + a33 x3 + · · · + a3N xN = b3 ···

(2.0.1)

···

a M 1 x 1 + a M 2 x 2 + aM 3 x 3 + · · · + aM N x N = b M Here the N unknowns xj , j = 1, 2, . . . , N are related by M equations. The coefficients aij with i = 1, 2, . . . , M and j = 1, 2, . . ., N are known numbers, as are the right-hand side quantities bi , i = 1, 2, . . . , M .

Nonsingular versus Singular Sets of Equations If N = M then there are as many equations as unknowns, and there is a good chance of solving for a unique solution set of xj ’s. Analytically, there can fail to be a unique solution if one or more of the M equations is a linear combination of the others, a condition called row degeneracy, or if all equations contain certain variables only in exactly the same linear combination, called column degeneracy. (For square matrices, a row degeneracy implies a column degeneracy, and vice versa.) A set of equations that is degenerate is called singular. We will consider singular matrices in some detail in §2.6. Numerically, at least two additional things can go wrong: • While not exact linear combinations of each other, some of the equations may be so close to linearly dependent that roundoff errors in the machine render them linearly dependent at some stage in the solution process. In this case your numerical procedure will fail, and it can tell you that it has failed. 32

33

2.0 Introduction

• Accumulated roundoff errors in the solution process can swamp the true solution. This problem particularly emerges if N is too large. The numerical procedure does not fail algorithmically. However, it returns a set of x’s that are wrong, as can be discovered by direct substitution back into the original equations. The closer a set of equations is to being singular, the more likely this is to happen, since increasingly close cancellations will occur during the solution. In fact, the preceding item can be viewed as the special case where the loss of significance is unfortunately total. Much of the sophistication of complicated “linear equation-solving packages” is devoted to the detection and/or correction of these two pathologies. As you work with large linear sets of equations, you will develop a feeling for when such sophistication is needed. It is difficult to give any firm guidelines, since there is no such thing as a “typical” linear problem. But here is a rough idea: Linear sets with N as large as 20 or 50 can be routinely solved in single precision (32 bit floating representations) without resorting to sophisticated methods, if the equations are not close to singular. With double precision (60 or 64 bits), this number can readily be extended to N as large as several hundred, after which point the limiting factor is generally machine time, not accuracy. Even larger linear sets, N in the thousands or greater, can be solved when the coefficients are sparse (that is, mostly zero), by methods that take advantage of the sparseness. We discuss this further in §2.7. At the other end of the spectrum, one seems just as often to encounter linear problems which, by their underlying nature, are close to singular. In this case, you might need to resort to sophisticated methods even for the case of N = 10 (though rarely for N = 5). Singular value decomposition (§2.6) is a technique that can sometimes turn singular problems into nonsingular ones, in which case additional sophistication becomes unnecessary.

Matrices Equation (2.0.1) can be written in matrix form as A·x=b

(2.0.2)

Here the raised dot denotes matrix multiplication, A is the matrix of coefficients, and b is the right-hand side written as a column vector, 

a11  a21 A= aM 1

a12 a22 ··· aM 2

... ...

 a1N a2N  

. . . aM N



 b1 b  b= 2  ··· bM

(2.0.3)

By convention, the first index on an element aij denotes its row, the second index its column. For most purposes you don’t need to know how a matrix is stored in a computer’s physical memory; you simply reference matrix elements by their two-dimensional addresses, e.g., a34 = a[3][4]. We have already seen, in §1.2, that this C notation can in fact hide a rather subtle and versatile physical storage scheme, “pointer to array of pointers to rows.” You might wish to review that section

34

Chapter 2.


at this point. Occasionally it is useful to be able to peer through the veil, for example to pass a whole row a[i][j], j=1, . . . , N by the reference a[i].

Tasks of Computational Linear Algebra We will consider the following tasks as falling in the general purview of this chapter: • Solution of the matrix equation A·x = b for an unknown vector x, where A is a square matrix of coefficients, raised dot denotes matrix multiplication, and b is a known right-hand side vector (§2.1–§2.10). • Solution of more than one matrix equation A · xj = bj , for a set of vectors xj , j = 1, 2, . . . , each corresponding to a different, known right-hand side vector bj . In this task the key simplification is that the matrix A is held constant, while the right-hand sides, the b’s, are changed (§2.1–§2.10). • Calculation of the matrix A−1 which is the matrix inverse of a square matrix A, i.e., A · A−1 = A−1 · A = 1, where 1 is the identity matrix (all zeros except for ones on the diagonal). This task is equivalent, for an N × N matrix A, to the previous task with N different bj ’s (j = 1, 2, . . ., N ), namely the unit vectors (bj = all zero elements except for 1 in the jth component). The corresponding x’s are then the columns of the matrix inverse of A (§2.1 and §2.3). • Calculation of the determinant of a square matrix A (§2.3). If M < N , or if M = N but the equations are degenerate, then there are effectively fewer equations than unknowns. In this case there can be either no solution, or else more than one solution vector x. In the latter event, the solution space consists of a particular solution xp added to any linear combination of (typically) N − M vectors (which are said to be in the nullspace of the matrix A). The task of finding the solution space of A involves • Singular value decomposition of a matrix A. This subject is treated in §2.6. In the opposite case there are more equations than unknowns, M > N . When this occurs there is, in general, no solution vector x to equation (2.0.1), and the set of equations is said to be overdetermined. It happens frequently, however, that the best “compromise” solution is sought, the one that comes closest to satisfying all equations simultaneously. If closeness is defined in the least-squares sense, i.e., that the sum of the squares of the differences between the left- and right-hand sides of equation (2.0.1) be minimized, then the overdetermined linear problem reduces to a (usually) solvable linear problem, called the • Linear least-squares problem. The reduced set of equations to be solved can be written as the N ×N set of equations (AT · A) · x = (AT · b)

(2.0.4)

where AT denotes the transpose of the matrix A. Equations (2.0.4) are called the normal equations of the linear least-squares problem. There is a close connection

2.0 Introduction

35

between singular value decomposition and the linear least-squares problem, and the latter is also discussed in §2.6. You should be warned that direct solution of the normal equations (2.0.4) is not generally the best way to find least-squares solutions. Some other topics in this chapter include • Iterative improvement of a solution (§2.5) • Various special forms: symmetric positive-definite (§2.9), tridiagonal (§2.4), band diagonal (§2.4), Toeplitz (§2.8), Vandermonde (§2.8), sparse (§2.7) • Strassen’s “fast matrix inversion” (§2.11).

Standard Subroutine Packages We cannot hope, in this chapter or in this book, to tell you everything there is to know about the tasks that have been defined above. In many cases you will have no alternative but to use sophisticated black-box program packages. Several good ones are available, though not always in C. LINPACK was developed at Argonne National Laboratories and deserves particular mention because it is published, documented, and available for free use. A successor to LINPACK, LAPACK, is now becoming available. Packages available commercially (though not necessarily in C) include those in the IMSL and NAG libraries. You should keep in mind that the sophisticated packages are designed with very large linear systems in mind. They therefore go to great effort to minimize not only the number of operations, but also the required storage. Routines for the various tasks are usually provided in several versions, corresponding to several possible simplifications in the form of the input coefficient matrix: symmetric, triangular, banded, positive definite, etc. If you have a large matrix in one of these forms, you should certainly take advantage of the increased efficiency provided by these different routines, and not just use the form provided for general matrices. There is also a great watershed dividing routines that are direct (i.e., execute in a predictable number of operations) from routines that are iterative (i.e., attempt to converge to the desired answer in however many steps are necessary). Iterative methods become preferable when the battle against loss of significance is in danger of being lost, either due to large N or because the problem is close to singular. We will treat iterative methods only incompletely in this book, in §2.7 and in Chapters 18 and 19. These methods are important, but mostly beyond our scope. We will, however, discuss in detail a technique which is on the borderline between direct and iterative methods, namely the iterative improvement of a solution that has been obtained by direct methods (§2.5). CITED REFERENCES AND FURTHER READING: Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins University Press). Gill, P.E., Murray, W., and Wright, M.H. 1991, Numerical Linear Algebra and Optimization, vol. 1 (Redwood City, CA: Addison-Wesley). Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), Chapter 4. Dongarra, J.J., et al. 1979, LINPACK User’s Guide (Philadelphia: S.I.A.M.).

36

Chapter 2.


Coleman, T.F., and Van Loan, C. 1988, Handbook for Matrix Computations (Philadelphia: S.I.A.M.). Forsythe, G.E., and Moler, C.B. 1967, Computer Solution of Linear Algebraic Systems (Englewood Cliffs, NJ: Prentice-Hall). Wilkinson, J.H., and Reinsch, C. 1971, Linear Algebra, vol. II of Handbook for Automatic Computation (New York: Springer-Verlag). Westlake, J.R. 1968, A Handbook of Numerical Matrix Inversion and Solution of Linear Equations (New York: Wiley). Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: AddisonWesley), Chapter 2. Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: McGraw-Hill), Chapter 9.

2.1 Gauss-Jordan Elimination For inverting a matrix, Gauss-Jordan elimination is about as efficient as any other method. For solving sets of linear equations, Gauss-Jordan elimination produces both the solution of the equations for one or more right-hand side vectors b, and also the matrix inverse A−1 . However, its principal weaknesses are (i) that it requires all the right-hand sides to be stored and manipulated at the same time, and (ii) that when the inverse matrix is not desired, Gauss-Jordan is three times slower than the best alternative technique for solving a single linear set (§2.3). The method’s principal strength is that it is as stable as any other direct method, perhaps even a bit more stable when full pivoting is used (see below). If you come along later with an additional right-hand side vector, you can multiply it by the inverse matrix, of course. This does give an answer, but one that is quite susceptible to roundoff error, not nearly as good as if the new vector had been included with the set of right-hand side vectors in the first instance. For these reasons, Gauss-Jordan elimination should usually not be your method of first choice, either for solving linear equations or for matrix inversion. The decomposition methods in §2.3 are better. Why do we give you Gauss-Jordan at all? Because it is straightforward, understandable, solid as a rock, and an exceptionally good “psychological” backup for those times that something is going wrong and you think it might be your linear-equation solver. Some people believe that the backup is more than psychological, that GaussJordan elimination is an “independent” numerical method. This turns out to be mostly myth. Except for the relatively minor differences in pivoting, described below, the actual sequence of operations performed in Gauss-Jordan elimination is very closely related to that performed by the routines in the next two sections. For clarity, and to avoid writing endless ellipses (· · ·) we will write out equations only for the case of four equations and four unknowns, and with three different righthand side vectors that are known in advance. You can write bigger matrices and extend the equations to the case of N × N matrices, with M sets of right-hand side vectors, in completely analogous fashion. The routine implemented below is, of course, general.

37

2.1 Gauss-Jordan Elimination

Elimination on Column-Augmented Matrices 

Consider the linear matrix equation     

a11  a21 a31 a41

a12 a22 a32 a42

a13 a23 a33 a43

 =







a14 x11 x12 x13 y11 a24   x21   x22   x23   y21 · x x y a34 x31 32 33 31 a44 x41 x42 x43 y41













b11 b12 b13 1  b21   b22   b23   0 0 b31 b32 b33 0 b41 b42 b43

0 1 0 0

0 0 1 0

y12 y22 y32 y42

y13 y23 y33 y43



y14 y24  y34 y44



0 0  0 1

(2.1.1)

Here the raised dot (·) signifies matrix multiplication, while the operator just signifies column augmentation, that is, removing the abutting parentheses and making a wider matrix out of the operands of the operator. It should not take you long to write out equation (2.1.1) and to see that it simply states that xij is the ith component (i = 1, 2, 3, 4) of the vector solution of the jth right-hand side (j = 1, 2, 3), the one whose coefficients are bij , i = 1, 2, 3, 4; and that the matrix of unknown coefficients yij is the inverse matrix of aij . In other words, the matrix solution of [A] · [x1 x2 x3 Y] = [b1 b2 b3 1]

(2.1.2)

where A and Y are square matrices, the bi ’s and xi ’s are column vectors, and 1 is the identity matrix, simultaneously solves the linear sets A · x1 = b 1

A · x2 = b2

A · x3 = b3

(2.1.3)

and A·Y = 1

(2.1.4)

Now it is also elementary to verify the following facts about (2.1.1): • Interchanging any two rows of A and the corresponding rows of the b’s and of 1, does not change (or scramble in any way) the solution x’s and Y. Rather, it just corresponds to writing the same set of linear equations in a different order. • Likewise, the solution set is unchanged and in no way scrambled if we replace any row in A by a linear combination of itself and any other row, as long as we do the same linear combination of the rows of the b’s and 1 (which then is no longer the identity matrix, of course). • Interchanging any two columns of A gives the same solution set only if we simultaneously interchange corresponding rows of the x’s and of Y. In other words, this interchange scrambles the order of the rows in the solution. If we do this, we will need to unscramble the solution by restoring the rows to their original order. Gauss-Jordan elimination uses one or more of the above operations to reduce the matrix A to the identity matrix. When this is accomplished, the right-hand side becomes the solution set, as one sees instantly from (2.1.2).

38

Chapter 2.


Pivoting In “Gauss-Jordan elimination with no pivoting,” only the second operation in the above list is used. The first row is divided by the element a11 (this being a trivial linear combination of the first row with any other row — zero coefficient for the other row). Then the right amount of the first row is subtracted from each other row to make all the remaining ai1 ’s zero. The first column of A now agrees with the identity matrix. We move to the second column and divide the second row by a22 , then subtract the right amount of the second row from rows 1, 3, and 4, so as to make their entries in the second column zero. The second column is now reduced to the identity form. And so on for the third and fourth columns. As we do these operations to A, we of course also do the corresponding operations to the b’s and to 1 (which by now no longer resembles the identity matrix in any way!). Obviously we will run into trouble if we ever encounter a zero element on the (then current) diagonal when we are going to divide by the diagonal element. (The element that we divide by, incidentally, is called the pivot element or pivot.) Not so obvious, but true, is the fact that Gauss-Jordan elimination with no pivoting (no use of the first or third procedures in the above list) is numerically unstable in the presence of any roundoff error, even when a zero pivot is not encountered. You must never do Gauss-Jordan elimination (or Gaussian elimination, see below) without pivoting! So what is this magic pivoting? Nothing more than interchanging rows (partial pivoting) or rows and columns (full pivoting), so as to put a particularly desirable element in the diagonal position from which the pivot is about to be selected. Since we don’t want to mess up the part of the identity matrix that we have already built up, we can choose among elements that are both (i) on rows below (or on) the one that is about to be normalized, and also (ii) on columns to the right (or on) the column we are about to eliminate. Partial pivoting is easier than full pivoting, because we don’t have to keep track of the permutation of the solution vector. Partial pivoting makes available as pivots only the elements already in the correct column. It turns out that partial pivoting is “almost” as good as full pivoting, in a sense that can be made mathematically precise, but which need not concern us here (for discussion and references, see [1]). To show you both variants, we do full pivoting in the routine in this section, partial pivoting in §2.3. We have to state how to recognize a particularly desirable pivot when we see one. The answer to this is not completely known theoretically. It is known, both theoretically and in practice, that simply picking the largest (in magnitude) available element as the pivot is a very good choice. A curiosity of this procedure, however, is that the choice of pivot will depend on the original scaling of the equations. If we take the third linear equation in our original set and multiply it by a factor of a million, it is almost guaranteed that it will contribute the first pivot; yet the underlying solution of the equations is not changed by this multiplication! One therefore sometimes sees routines which choose as pivot that element which would have been largest if the original equations had all been scaled to have their largest coefficient normalized to unity. This is called implicit pivoting. There is some extra bookkeeping to keep track of the scale factors by which the rows would have been multiplied. (The routines in §2.3 include implicit pivoting, but the routine in this section does not.) Finally, let us consider the storage requirements of the method. With a little reflection you will see that at every stage of the algorithm, either an element of A is

2.1 Gauss-Jordan Elimination

39

predictably a one or zero (if it is already in a part of the matrix that has been reduced to identity form) or else the exactly corresponding element of the matrix that started as 1 is predictably a one or zero (if its mate in A has not been reduced to the identity form). Therefore the matrix 1 does not have to exist as separate storage: The matrix inverse of A is gradually built up in A as the original A is destroyed. Likewise, the solution vectors x can gradually replace the right-hand side vectors b and share the same storage, since after each column in A is reduced, the corresponding row entry in the b’s is never again used. Here is the routine for Gauss-Jordan elimination with full pivoting: #include <math.h> #include "nrutil.h" #define SWAP(a,b) {temp=(a);(a)=(b);(b)=temp;} void gaussj(float **a, int n, float **b, int m) Linear equation solution by Gauss-Jordan elimination, equation (2.1.1) above. a[1..n][1..n] is the input matrix. b[1..n][1..m] is input containing the m right-hand side vectors. On output, a is replaced by its matrix inverse, and b is replaced by the corresponding set of solution vectors. { int *indxc,*indxr,*ipiv; int i,icol,irow,j,k,l,ll; float big,dum,pivinv,temp; indxc=ivector(1,n); The integer arrays ipiv, indxr, and indxc are indxr=ivector(1,n); used for bookkeeping on the pivoting. ipiv=ivector(1,n); for (j=1;j 1; if (x >= xx[jm] == ascnd) *jlo=jm; else jhi=jm; } if (x == xx[n]) *jlo=n-1; if (x == xx[1]) *jlo=1; }

If your array xx is zero-offset, read the comment following locate, above.

After the Hunt The problem: Routines locate and hunt return an index j such that your desired value lies between table entries xx[j] and xx[j+1], where xx[1..n] is the full length of the table. But, to obtain an m-point interpolated value using a routine like polint (§3.1) or ratint (§3.2), you need to supply much shorter xx and yy arrays, of length m. How do you make the connection? The solution: Calculate k = IMIN(IMAX(j-(m-1)/2,1),n+1-m) (The macros IMIN and IMAX give the minimum and maximum of two integer arguments; see §1.2 and Appendix B.) This expression produces the index of the leftmost member of an m-point set of points centered (insofar as possible) between j and j+1, but bounded by 1 at the left and n at the right. C then lets you call the interpolation routine with array addresses offset by k, e.g., polint(&xx[k-1],&yy[k-1],m,. . . )

CITED REFERENCES AND FURTHER READING: Knuth, D.E. 1973, Sorting and Searching, vol. 3 of The Art of Computer Programming (Reading, MA: Addison-Wesley), §6.2.1.

120

Chapter 3.

Interpolation and Extrapolation

3.5 Coefficients of the Interpolating Polynomial Occasionally you may wish to know not the value of the interpolating polynomial that passes through a (small!) number of points, but the coefficients of that polynomial. A valid use of the coefficients might be, for example, to compute simultaneous interpolated values of the function and of several of its derivatives (see §5.3), or to convolve a segment of the tabulated function with some other function, where the moments of that other function (i.e., its convolution with powers of x) are known analytically. However, please be certain that the coefficients are what you need. Generally the coefficients of the interpolating polynomial can be determined much less accurately than its value at a desired abscissa. Therefore it is not a good idea to determine the coefficients only for use in calculating interpolating values. Values thus calculated will not pass exactly through the tabulated points, for example, while values computed by the routines in §3.1–§3.3 will pass exactly through such points. Also, you should not mistake the interpolating polynomial (and its coefficients) for its cousin, the best fit polynomial through a data set. Fitting is a smoothing process, since the number of fitted coefficients is typically much less than the number of data points. Therefore, fitted coefficients can be accurately and stably determined even in the presence of statistical errors in the tabulated values. (See §14.8.) Interpolation, where the number of coefficients and number of tabulated points are equal, takes the tabulated values as perfect. If they in fact contain statistical errors, these can be magnified into oscillations of the interpolating polynomial in between the tabulated points. As before, we take the tabulated points to be yi ≡ y(xi ). If the interpolating polynomial is written as y = c0 + c1 x + c2 x2 + · · · + cN xN

(3.5.1)

then the ci ’s are required to satisfy the linear equation 1

x0

 1 . . . 1

x20

x1 .. .

x21 .. .

xN

x2N

 c · · · xN 0 0 N   · · · x 1   c1 · ..    .. . . · · · xN cN N

y 0     y1 = .   . . yN 

    

(3.5.2)

This is a Vandermonde matrix, as described in §2.8. One could in principle solve equation (3.5.2) by standard techniques for linear equations generally (§2.3); however the special method that was derived in §2.8 is more efficient by a large factor, of order N , so it is much better. Remember that Vandermonde systems can be quite ill-conditioned. In such a case, no numerical method is going to give a very accurate answer. Such cases do not, please note, imply any difficulty in finding interpolated values by the methods of §3.1, but only difficulty in finding coefficients. Like the routine in §2.8, the following is due to G.B. Rybicki. Note that the arrays are all assumed to be zero-offset.

3.5 Coefficients of the Interpolating Polynomial

121

#include "nrutil.h" void polcoe(float x[], float y[], int n, float cof[]) Given arrays x[0..n] and y[0..n] containing a tabulated function yi = f (xi ), this routine " returns an array of coefficients cof[0..n], such that yi = j cofj xji . { int k,j,i; float phi,ff,b,*s; s=vector(0,n); for (i=0;ir == x1.r && x->i == x1.i) return; Converged. if (iter % MT) *x=x1; else *x=Csub(*x,RCmul(frac[iter/MT],dx)); Every so often we take a fractional step, to break any limit cycle (itself a rare occurrence). } nrerror("too many iterations in laguer"); Very unusual — can occur only for complex roots. Try a different starting guess for the root. return; }

Here is a driver routine that calls laguer in succession for each root, performs the deflation, optionally polishes the roots by the same Laguerre method — if you are not going to polish in some other way — and finally sorts the roots by their real parts. (We will use this routine in Chapter 13.) #include <math.h> #include "complex.h" #define EPS 2.0e-6 #define MAXM 100 A small number, and maximum anticipated value of m. void zroots(fcomplex a[], int m, fcomplex roots[], int polish) " i Given the degree m and the m+1 complex coefficients a[0..m] of the polynomial m i=0 a(i)x , this routine successively calls laguer and finds all m complex roots in roots[1..m]. The boolean variable polish should be input as true (1) if polishing (also by Laguerre’s method) is desired, false (0) if the roots will be subsequently polished by other means. { void laguer(fcomplex a[], int m, fcomplex *x, int *its); int i,its,j,jj; fcomplex x,b,c,ad[MAXM]; for (j=0;j=1;j--) { Loop over each root to be found. x=Complex(0.0,0.0); Start at zero to favor convergence to smalllaguer(ad,j,&x,&its); est remaining root, and find the root. if (fabs(x.i) =0;jj--) { c=ad[jj]; ad[jj]=b; b=Cadd(Cmul(x,b),c); } } if (polish) for (j=1;j #include "nrutil.h" void usrfun(float *x,int n,float *fvec,float **fjac); #define FREERETURN {free_matrix(fjac,1,n,1,n);free_vector(fvec,1,n);\ free_vector(p,1,n);free_ivector(indx,1,n);return;} void mnewt(int ntrial, float x[], int n, float tolx, float tolf) Given an initial guess x[1..n] for a root in n dimensions, take ntrial Newton-Raphson steps to improve the root. Stop if the root converges in either summed absolute variable increments tolx or summed absolute function values tolf. { void lubksb(float **a, int n, int *indx, float b[]);

382

Chapter 9.

Root Finding and Nonlinear Sets of Equations

void ludcmp(float **a, int n, int *indx, float *d); int k,i,*indx; float errx,errf,d,*fvec,**fjac,*p; indx=ivector(1,n); p=vector(1,n); fvec=vector(1,n); fjac=matrix(1,n,1,n); for (k=1;k 500), then r is distributed approximately normally, with a mean of zero √ and a standard deviation of 1/ N . In that case, the (double-sided) significance of the correlation, that is, the probability that |r| should be larger than its observed value in the null hypothesis, is √ |r| N √ (14.5.2) erfc 2 where erfc(x) is the complementary error function, equation (6.2.8), computed by the routines erffc or erfcc of §6.2. A small value of (14.5.2) indicates that the

14.5 Linear Correlation

637

two distributions are significantly correlated. (See expression 14.5.9 below for a more accurate test.) Most statistics books try to go beyond (14.5.2) and give additional statistical tests that can be made using r. In almost all cases, however, these tests are valid only for a very special class of hypotheses, namely that the distributions of x and y jointly form a binormal or two-dimensional Gaussian distribution around their mean values, with joint probability density 1 2 2 p(x, y) dxdy = const. × exp − (a11 x − 2a12 xy + a22 y ) dxdy (14.5.3) 2 where a11 , a12 , and a22 are arbitrary constants. For this distribution r has the value a12 r = −√ a11 a22

(14.5.4)

There are occasions when (14.5.3) may be known to be a good model of the data. There may be other occasions when we are willing to take (14.5.3) as at least a rough and ready guess, since many two-dimensional distributions do resemble a binormal distribution, at least not too far out on their tails. In either situation, we can use (14.5.3) to go beyond (14.5.2) in any of several directions: First, we can allow for the possibility that the number N of data points is not large. Here, it turns out that the statistic ) N −2 t=r (14.5.5) 1 − r2 is distributed in the null case (of no correlation) like Student’s t-distribution with ν = N − 2 degrees of freedom, whose two-sided significance level is given by 1 − A(t|ν) (equation 6.4.7). As N becomes large, this significance and (14.5.2) become asymptotically the same, so that one never does worse by using (14.5.5), even if the binormal assumption is not well substantiated. Second, when N is only moderately large (≥ 10), we can compare whether the difference of two significantly nonzero r’s, e.g., from different experiments, is itself significant. In other words, we can quantify whether a change in some control variable significantly alters an existing correlation between two other variables. This is done by using Fisher’s z-transformation to associate each measured r with a corresponding z, 1+r 1 (14.5.6) z = ln 2 1−r Then, each z is approximately normally distributed with a mean value rtrue 1 1 + rtrue ln + z= 2 1 − rtrue N −1

(14.5.7)

where rtrue is the actual or population value of the correlation coefficient, and with a standard deviation σ(z) ≈ √

1 N −3

(14.5.8)

638

Chapter 14.

Statistical Description of Data

Equations (14.5.7) and (14.5.8), when they are valid, give several useful statistical tests. For example, the significance level at which a measured value of r differs from some hypothesized value rtrue is given by erfc

√ |z − z| N − 3 √ 2

(14.5.9)

where z and z are given by (14.5.6) and (14.5.7), with small values of (14.5.9) indicating a significant difference. (Setting z = 0 makes expression 14.5.9 a more accurate replacement for expression 14.5.2 above.) Similarly, the significance of a difference between two measured correlation coefficients r1 and r2 is  − z | |z 1 2  erfc  √ / 1 1 2 N1 −3 + N2 −3 

(14.5.10)

where z1 and z2 are obtained from r1 and r2 using (14.5.6), and where N1 and N2 are, respectively, the number of data points in the measurement of r1 and r2 . All of the significances above are two-sided. If you wish to disprove the null hypothesis in favor of a one-sided hypothesis, such as that r1 > r2 (where the sense of the inequality was decided a priori), then (i) if your measured r1 and r2 have the wrong sense, you have failed to demonstrate your one-sided hypothesis, but (ii) if they have the right ordering, you can multiply the significances given above by 0.5, which makes them more significant. But keep in mind: These interpretations of the r statistic can be completely meaningless if the joint probability distribution of your variables x and y is too different from a binormal distribution. #include <math.h> #define TINY 1.0e-20

Will regularize the unusual case of complete correlation.

void pearsn(float x[], float y[], unsigned long n, float *r, float *prob, float *z) Given two arrays x[1..n] and y[1..n], this routine computes their correlation coefficient r (returned as r), the significance level at which the null hypothesis of zero correlation is disproved (prob whose small value indicates a significant correlation), and Fisher’s z (returned as z), whose value can be used in further statistical tests as described above. { float betai(float a, float b, float x); float erfcc(float x); unsigned long j; float yt,xt,t,df; float syy=0.0,sxy=0.0,sxx=0.0,ay=0.0,ax=0.0; for (j=1;j 0 vjn < 0

(19.1.27)

Note that this scheme is only first-order, not second-order, accurate in the calculation of the spatial derivatives. How can it be “better”? The answer is one that annoys the mathematicians: The goal of numerical simulations is not always “accuracy” in a strictly mathematical sense, but sometimes “fidelity” to the underlying physics in a sense that is looser and more pragmatic. In such contexts, some kinds of error are much more tolerable than others. Upwind differencing generally adds fidelity to problems where the advected variables are liable to undergo sudden changes of state, e.g., as they pass through shocks or other discontinuities. You will have to be guided by the specific nature of your own problem. For the differencing scheme (19.1.27), the amplification factor (for constant v) is v∆t (1 − cos k∆x) − i v∆t sin k∆x ξ =1− ∆x ∆x v∆t v∆t 1− |ξ|2 = 1 − 2 ∆x (1 − cos k∆x) ∆x

(19.1.28) (19.1.29)

So the stability criterion |ξ|2 ≤ 1 is (again) simply the Courant condition (19.1.17). There are various ways of improving the accuracy of first-order upwind differencing. In the continuum equation, material originally a distance v∆t away

842

Chapter 19.

Partial Differential Equations

t or n staggered leapfrog

x or j Figure 19.1.5. Representation of the staggered leapfrog differencing scheme. Note that information from two previous time slices is used in obtaining the desired point. This scheme is second-order accurate in both space and time.

arrives at a given point after a time interval ∆t. In the first-order method, the material always arrives from ∆x away. If v∆t ∆x (to insure accuracy), this can cause a large error. One way of reducing this error is to interpolate u between j − 1 and j before transporting it. This gives effectively a second-order method. Various schemes for second-order upwind differencing are discussed and compared in [2-3].

Second-Order Accuracy in Time When using a method that is first-order accurate in time but second-order accurate in space, one generally has to take v∆t significantly smaller than ∆x to achieve desired accuracy, say, by at least a factor of 5. Thus the Courant condition is not actually the limiting factor with such schemes in practice. However, there are schemes that are second-order accurate in both space and time, and these can often be pushed right to their stability limit, with correspondingly smaller computation times. For example, the staggered leapfrog method for the conservation equation (19.1.1) is defined as follows (Figure 19.1.5): Using the values of un at time tn , compute the fluxes Fjn . Then compute new values un+1 using the time-centered values of the fluxes: − un−1 =− un+1 j j

∆t n n (F − Fj−1 ) ∆x j+1

(19.1.30)

The name comes from the fact that the time levels in the time derivative term “leapfrog” over the time levels in the space derivative term. The method requires that un−1 and un be stored to compute un+1 . For our simple model equation (19.1.6), staggered leapfrog takes the form un+1 − un−1 =− j j

v∆t n (u − unj−1 ) ∆x j+1

(19.1.31)

The von Neumann stability analysis now gives a quadratic equation for ξ, rather than a linear one, because of the occurrence of three consecutive powers of ξ when the

19.1 Flux-Conservative Initial Value Problems

843

form (19.1.12) for an eigenmode is substituted into equation (19.1.31), ξ 2 − 1 = −2iξ

v∆t sin k∆x ∆x

(19.1.32)

whose solution is * ξ = −i

v∆t sin k∆x ± ∆x

1−

v∆t sin k∆x ∆x

2 (19.1.33)

Thus the Courant condition is again required for stability. In fact, in equation (19.1.33), |ξ|2 = 1 for any v∆t ≤ ∆x. This is the great advantage of the staggered leapfrog method: There is no amplitude dissipation. Staggered leapfrog differencing of equations like (19.1.20) is most transparent if the variables are centered on appropriate half-mesh points: n unj+1 − unj ∂u = v ∂x j+1/2 ∆x n+1/2 n+1 uj − unj ∂u ≡ = ∂t j ∆t

n rj+1/2 ≡v n+1/2 sj

(19.1.34)

This is purely a notational convenience: we can think of the mesh on which r and s are defined as being twice as fine as the mesh on which the original variable u is defined. The leapfrog differencing of equation (19.1.20) is n+1 n − rj+1/2 rj+1/2 n+1/2 sj

∆t n−1/2 − sj ∆t

n+1/2

n+1/2

− sj = ∆x n n rj+1/2 − rj−1/2 =v ∆x sj+1

(19.1.35)

If you substitute equation (19.1.22) in equation (19.1.35), you will find that once again the Courant condition is required for stability, and that there is no amplitude dissipation when it is satisfied. If we substitute equation (19.1.34) in equation (19.1.35), we find that equation (19.1.35) is equivalent to − 2unj + un−1 un+1 un − 2unj + unj−1 j j 2 j+1 = v (∆t)2 (∆x)2

(19.1.36)

This is just the “usual” second-order differencing of the wave equation (19.1.2). We see that it is a two-level scheme, requiring both un and un−1 to obtain un+1 . In equation (19.1.35) this shows up as both sn−1/2 and r n being needed to advance the solution. For equations more complicated than our simple model equation, especially nonlinear equations, the leapfrog method usually becomes unstable when the gradients get large. The instability is related to the fact that odd and even mesh points are completely decoupled, like the black and white squares of a chess board, as shown

844

Chapter 19.


Figure 19.1.6. Origin of mesh-drift instabilities in a staggered leapfrog scheme. If the mesh points are imagined to lie in the squares of a chess board, then white squares couple to themselves, black to themselves, but there is no coupling between white and black. The fix is to introduce a small diffusive mesh-coupling piece.

in Figure 19.1.6. This mesh drifting instability is cured by coupling the two meshes through a numerical viscosity term, e.g., adding to the right side of (19.1.31) a small coefficient ( 1) times unj+1 − 2unj + unj−1. For more on stabilizing difference schemes by adding numerical dissipation, see, e.g., [4]. The Two-Step Lax-Wendroff scheme is a second-order in time method that avoids large numerical dissipation and mesh drifting. One defines intermediate values uj+1/2 at the half timesteps tn+1/2 and the half mesh points xj+1/2 . These are calculated by the Lax scheme: n+1/2

uj+1/2 =

1 n ∆t (u (F n − Fjn ) + unj ) − 2 j+1 2∆x j+1

(19.1.37)

n+1/2

Using these variables, one calculates the fluxes Fj+1/2 . Then the updated values are calculated by the properly centered expression un+1 j = unj − un+1 j

$ ∆t # n+1/2 n+1/2 Fj+1/2 − Fj−1/2 ∆x

(19.1.38)

n+1/2

The provisional values uj+1/2 are now discarded. (See Figure 19.1.7.) Let us investigate the stability of this method for our model advective equation, where F = vu. Substitute (19.1.37) in (19.1.38) to get = unj − α un+1 j

1 1 n (u + unj ) − α(unj+1 − unj ) 2 j+1 2 1 1 − (unj + unj−1 ) + α(unj − unj−1 ) 2 2

(19.1.39)

19.1 Flux-Conservative Initial Value Problems

845

two-step Lax Wendroff halfstep points

t or n

x or j Figure 19.1.7. Representation of the two-step Lax-Wendroff differencing scheme. Two halfstep points (⊗) are calculated by the Lax method. These, plus one of the original points, produce the new point via staggered leapfrog. Halfstep points are used only temporarily and do not require storage allocation on the grid. This scheme is second-order accurate in both space and time.

where α≡

v∆t ∆x

(19.1.40)

Then ξ = 1 − iα sin k∆x − α2 (1 − cos k∆x)

(19.1.41)

so |ξ|2 = 1 − α2 (1 − α2 )(1 − cos k∆x)2

(19.1.42)

The stability criterion |ξ|2 ≤ 1 is therefore α2 ≤ 1, or v∆t ≤ ∆x as usual. Incidentally, you should not think that the Courant condition is the only stability requirement that ever turns up in PDEs. It keeps doing so in our model examples just because those examples are so simple in form. The method of analysis is, however, general. Except when α = 1, |ξ|2 < 1 in (19.1.42), so some amplitude damping does occur. The effect is relatively small, however, for wavelengths large compared with the mesh size ∆x. If we expand (19.1.42) for small k∆x, we find |ξ|2 = 1 − α2 (1 − α2 )

(k∆x)4 +... 4

(19.1.43)

The departure from unity occurs only at fourth order in k. This should be contrasted with equation (19.1.16) for the Lax method, which shows that |ξ|2 = 1 − (1 − α2 )(k∆x)2 + . . . for small k∆x.

(19.1.44)

846

Chapter 19.


In summary, our recommendation for initial value problems that can be cast in flux-conservative form, and especially problems related to the wave equation, is to use the staggered leapfrog method when possible. We have personally had better success with it than with the Two-Step Lax-Wendroff method. For problems sensitive to transport errors, upwind differencing or one of its refinements should be considered.

Fluid Dynamics with Shocks As we alluded to earlier, the treatment of fluid dynamics problems with shocks has become a very complicated and very sophisticated subject. All we can attempt to do here is to guide you to some starting points in the literature. There are basically three important general methods for handling shocks. The oldest and simplest method, invented by von Neumann and Richtmyer, is to add artificial viscosity to the equations, modeling the way Nature uses real viscosity to smooth discontinuities. A good starting point for trying out this method is the differencing scheme in §12.11 of [1]. This scheme is excellent for nearly all problems in one spatial dimension. The second method combines a high-order differencing scheme that is accurate for smooth flows with a low order scheme that is very dissipative and can smooth the shocks. Typically, various upwind differencing schemes are combined using weights chosen to zero the low order scheme unless steep gradients are present, and also chosen to enforce various “monotonicity” constraints that prevent nonphysical oscillations from appearing in the numerical solution. References [2-3,5] are a good place to start with these methods. The third, and potentially most powerful method, is Godunov’s approach. Here one gives up the simple linearization inherent in finite differencing based on Taylor series and includes the nonlinearity of the equations explicitly. There is an analytic solution for the evolution of two uniform states of a fluid separated by a discontinuity, the Riemann shock problem. Godunov’s idea was to approximate the fluid by a large number of cells of uniform states, and piece them together using the Riemann solution. There have been many generalizations of Godunov’s approach, of which the most powerful is probably the PPM method [6]. Readable reviews of all these methods, discussing the difficulties arising when one-dimensional methods are generalized to multidimensions, are given in [7-9] . CITED REFERENCES AND FURTHER READING: Ames, W.F. 1977, Numerical Methods for Partial Differential Equations, 2nd ed. (New York: Academic Press), Chapter 4. Richtmyer, R.D., and Morton, K.W. 1967, Difference Methods for Initial Value Problems, 2nd ed. (New York: Wiley-Interscience). [1] Centrella, J., and Wilson, J.R. 1984, Astrophysical Journal Supplement, vol. 54, pp. 229–249, Appendix B. [2] Hawley, J.F., Smarr, L.L., and Wilson, J.R. 1984, Astrophysical Journal Supplement, vol. 55, pp. 211–246, §2c. [3] Kreiss, H.-O. 1978, Numerical Methods for Solving Time-Dependent Problems for Partial Differential Equations (Montreal: University of Montreal Press), pp. 66ff. [4] Harten, A., Lax, P.D., and Van Leer, B. 1983, SIAM Review, vol. 25, pp. 36–61. [5] Woodward, P., and Colella, P. 1984, Journal of Computational Physics, vol. 54, pp. 174–201. [6]

19.2 Diffusive Initial Value Problems

847

Roache, P.J. 1976, Computational Fluid Dynamics (Albuquerque: Hermosa). [7] Woodward, P., and Colella, P. 1984, Journal of Computational Physics, vol. 54, pp. 115–173. [8] Rizzi, A., and Engquist, B. 1987, Journal of Computational Physics, vol. 72, pp. 1–69. [9]

19.2 Diffusive Initial Value Problems Recall the model parabolic equation, the diffusion equation in one space dimension, ∂ ∂u ∂u = D (19.2.1) ∂t ∂x ∂x where D is the diffusion coefficient. Actually, this equation is a flux-conservative equation of the form considered in the previous section, with F = −D

∂u ∂x

(19.2.2)

the flux in the x-direction. We will assume D ≥ 0, otherwise equation (19.2.1) has physically unstable solutions: A small disturbance evolves to become more and more concentrated instead of dispersing. (Don’t make the mistake of trying to find a stable differencing scheme for a problem whose underlying PDEs are themselves unstable!) Even though (19.2.1) is of the form already considered, it is useful to consider it as a model in its own right. The particular form of flux (19.2.2), and its direct generalizations, occur quite frequently in practice. Moreover, we have already seen that numerical viscosity and artificial viscosity can introduce diffusive pieces like the right-hand side of (19.2.1) in many other situations. Consider first the case when D is a constant. Then the equation ∂u ∂2u =D 2 ∂t ∂x

(19.2.3)

can be differenced in the obvious way: n − unj un+1 uj+1 − 2unj + unj−1 j =D ∆t (∆x)2

(19.2.4)

This is the FTCS scheme again, except that it is a second derivative that has been differenced on the right-hand side. But this makes a world of difference! The FTCS scheme was unstable for the hyperbolic equation; however, a quick calculation shows that the amplification factor for equation (19.2.4) is k∆x 4D∆t 2 (19.2.5) sin ξ =1− (∆x)2 2 The requirement |ξ| ≤ 1 leads to the stability criterion 2D∆t ≤1 (∆x)2

(19.2.6)

848

Chapter 19.


The physical interpretation of the restriction (19.2.6) is that the maximum allowed timestep is, up to a numerical factor, the diffusion time across a cell of width ∆x. More generally, the diffusion time τ across a spatial scale of size λ is of order τ∼

λ2 D

(19.2.7)

Usually we are interested in modeling accurately the evolution of features with spatial scales λ ∆x. If we are limited to timesteps satisfying (19.2.6), we will need to evolve through of order λ2 /(∆x)2 steps before things start to happen on the scale of interest. This number of steps is usually prohibitive. We must therefore find a stable way of taking timesteps comparable to, or perhaps — for accuracy — somewhat smaller than, the time scale of (19.2.7). This goal poses an immediate “philosophical” question. Obviously the large timesteps that we propose to take are going to be woefully inaccurate for the small scales that we have decided not to be interested in. We want those scales to do something stable, “innocuous,” and perhaps not too physically unreasonable. We want to build this innocuous behavior into our differencing scheme. What should it be? There are two different answers, each of which has its pros and cons. The first answer is to seek a differencing scheme that drives small-scale features to their equilibrium forms, e.g., satisfying equation (19.2.3) with the left-hand side set to zero. This answer generally makes the best physical sense; but, as we will see, it leads to a differencing scheme (“fully implicit”) that is only first-order accurate in time for the scales that we are interested in. The second answer is to let small-scale features maintain their initial amplitudes, so that the evolution of the larger-scale features of interest takes place superposed with a kind of “frozen in” (though fluctuating) background of small-scale stuff. This answer gives a differencing scheme (“CrankNicholson”) that is second-order accurate in time. Toward the end of an evolution calculation, however, one might want to switch over to some steps of the other kind, to drive the small-scale stuff into equilibrium. Let us now see where these distinct differencing schemes come from: Consider the following differencing of (19.2.3), n+1 ! un+1 uj+1 − 2un+1 − unj + un+1 j j j−1 =D (19.2.8) ∆t (∆x)2 This is exactly like the FTCS scheme (19.2.4), except that the spatial derivatives on the right-hand side are evaluated at timestep n + 1. Schemes with this character are called fully implicit or backward time, by contrast with FTCS (which is called fully explicit). To solve equation (19.2.8) one has to solve a set of simultaneous linear . Fortunately, this is a simple problem because equations at each timestep for the un+1 j the system is tridiagonal: Just group the terms in equation (19.2.8) appropriately: n+1 n −αun+1 − αun+1 j−1 + (1 + 2α)uj j+1 = uj ,

j = 1, 2...J − 1

(19.2.9)

where α≡

D∆t (∆x)2

(19.2.10)

849


Supplemented by Dirichlet or Neumann boundary conditions at j = 0 and j = J, equation (19.2.9) is clearly a tridiagonal system, which can easily be solved at each timestep by the method of §2.4. What is the behavior of (19.2.8) for very large timesteps? The answer is seen most clearly in (19.2.9), in the limit α → ∞ (∆t → ∞). Dividing by α, we see that the difference equations are just the finite-difference form of the equilibrium equation ∂2u =0 ∂x2

(19.2.11)

What about stability? The amplification factor for equation (19.2.8) is 1

ξ=

1 + 4α sin2

k∆x 2

(19.2.12)

Clearly |ξ| < 1 for any stepsize ∆t. The scheme is unconditionally stable. The details of the small-scale evolution from the initial conditions are obviously inaccurate for large ∆t. But, as advertised, the correct equilibrium solution is obtained. This is the characteristic feature of implicit methods. Here, on the other hand, is how one gets to the second of our above philosophical answers, combining the stability of an implicit method with the accuracy of a method that is second-order in both space and time. Simply form the average of the explicit and implicit FTCS schemes: − unj un+1 D j = ∆t 2

n+1 n n n (un+1 + un+1 j+1 − 2uj j−1 ) + (uj+1 − 2uj + uj−1 ) (∆x)2

!

(19.2.13) Here both the left- and right-hand sides are centered at timestep n + 12 , so the method is second-order accurate in time as claimed. The amplification factor is k∆x 1 − 2α sin 2 ξ= k∆x 2 1 + 2α sin 2

2

(19.2.14)

so the method is stable for any size ∆t. This scheme is called the Crank-Nicholson scheme, and is our recommended method for any simple diffusion problem (perhaps supplemented by a few fully implicit steps at the end). (See Figure 19.2.1.) Now turn to some generalizations of the simple diffusion equation (19.2.3). Suppose first that the diffusion coefficient D is not constant, say D = D(x). We can adopt either of two strategies. First, we can make an analytic change of variable & y=

dx D(x)

(19.2.15)

850

Chapter 19.


t or n

FTCS (a)

Fully Implicit

(b)

x or j

(c)

Crank-Nicholson

Figure 19.2.1. Three differencing schemes for diffusive problems (shown as in Figure 19.1.2). (a) Forward Time Center Space is first-order accurate, but stable only for sufficiently small timesteps. (b) Fully Implicit is stable for arbitrarily large timesteps, but is still only first-order accurate. (c) Crank-Nicholson is second-order accurate, and is usually stable for large timesteps.

Then ∂ ∂u ∂u = D(x) ∂t ∂x ∂x

(19.2.16)

1 ∂2u ∂u = ∂t D(y) ∂y2

(19.2.17)

becomes

and we evaluate D at the appropriate yj . Heuristically, the stability criterion (19.2.6) in an explicit scheme becomes ! (∆y)2 (19.2.18) ∆t ≤ min j 2Dj−1 Note that constant spacing ∆y in y does not imply constant spacing in x. An alternative method that does not require analytically tractable forms for D is simply to difference equation (19.2.16) as it stands, centering everything appropriately. Thus the FTCS method becomes − unj un+1 Dj+1/2 (unj+1 − unj ) − Dj−1/2 (unj − unj−1 ) j = ∆t (∆x)2

(19.2.19)

where Dj+1/2 ≡ D(xj+1/2 )

(19.2.20)


851

and the heuristic stability criterion is

(∆x)2 ∆t ≤ min j 2Dj+1/2

(19.2.21)

The Crank-Nicholson method can be generalized similarly. The second complication one can consider is a nonlinear diffusion problem, for example where D = D(u). Explicit schemes can be generalized in the obvious way. For example, in equation (19.2.19) write Dj+1/2 =

1 D(unj+1 ) + D(unj ) 2

(19.2.22)

Implicit schemes are not as easy. The replacement (19.2.22) with n → n + 1 leaves us with a nasty set of coupled nonlinear equations to solve at each timestep. Often there is an easier way: If the form of D(u) allows us to integrate dz = D(u)du

(19.2.23)

analytically for z(u), then the right-hand side of (19.2.1) becomes ∂ 2 z/∂x2 , which we difference implicitly as n+1 n+1 − 2zjn+1 + zj−1 zj+1 (∆x)2

(19.2.24)

Now linearize each term on the right-hand side of equation (19.2.24), for example zjn+1

≡

z(un+1 ) j

=

z(unj )

=

z(unj )

+

(un+1 j

+

(un+1 j

∂z ∂u j,n

−

unj )

−

unj )D(unj )

(19.2.25)

This reduces the problem to tridiagonal form again and in practice usually retains the stability advantages of fully implicit differencing.

Schrodinger Equation ¨ Sometimes the physical problem being solved imposes constraints on the differencing scheme that we have not yet taken into account. For example, consider the time-dependent Schrödinger equation of quantum mechanics. This is basically a parabolic equation for the evolution of a complex quantity ψ. For the scattering of a wavepacket by a one-dimensional potential V (x), the equation has the form i

∂2 ψ ∂ψ = − 2 + V (x)ψ ∂t ∂x

(19.2.26)

(Here we have chosen units so that Planck’s constant ¯h = 1 and the particle mass m = 1/2.) One is given the initial wavepacket, ψ(x, t = 0), together with boundary

852

Chapter 19.


conditions that ψ → 0 at x → ±∞. Suppose we content ourselves with firstorder accuracy in time, but want to use an implicit scheme, for stability. A slight generalization of (19.2.8) leads to

ψjn+1 − ψjn i ∆t

!

! n+1 n+1 ψj+1 − 2ψjn+1 + ψj−1 =− + Vj ψjn+1 (∆x)2

(19.2.27)

for which

ξ= 1+i

1

4∆t sin2 (∆x)2

k∆x 2

+ Vj ∆t

(19.2.28)

This is unconditionally stable, but unfortunately is not unitary. The underlying physical problem requires that the total probability of finding the particle somewhere remains unity. This is represented formally by the modulus-square norm of ψ remaining unity: &

∞

−∞

|ψ|2 dx = 1

(19.2.29)

The initial wave function ψ(x, 0) is normalized to satisfy (19.2.29). The Schrödinger equation (19.2.26) then guarantees that this condition is satisfied at all later times. Let us write equation (19.2.26) in the form i

∂ψ = Hψ ∂t

(19.2.30)

where the operator H is H =−

∂2 + V (x) ∂x2

(19.2.31)

The formal solution of equation (19.2.30) is ψ(x, t) = e−iHt ψ(x, 0)

(19.2.32)

where the exponential of the operator is defined by its power series expansion. The unstable explicit FTCS scheme approximates (19.2.32) as ψjn+1 = (1 − iH∆t)ψjn

(19.2.33)

where H is represented by a centered finite-difference approximation in x. The stable implicit scheme (19.2.27) is, by contrast, ψjn+1 = (1 + iH∆t)−1 ψjn

(19.2.34)

These are both first-order accurate in time, as can be seen by expanding equation (19.2.32). However, neither operator in (19.2.33) or (19.2.34) is unitary.

19.3 Initial Value Problems in Multidimensions

853

The correct way to difference Schrödinger’s equation [1,2] is to use Cayley’s form for the finite-difference representation of e−iHt , which is second-order accurate and unitary: e−iHt

1 − 12 iH∆t 1 + 12 iH∆t

(19.2.35)

In other words, ,

, 1 + 12 iH∆t ψjn+1 = 1 − 12 iH∆t ψjn

(19.2.36)

On replacing H by its finite-difference approximation in x, we have a complex tridiagonal system to solve. The method is stable, unitary, and second-order accurate in space and time. In fact, it is simply the Crank-Nicholson method once again! CITED REFERENCES AND FURTHER READING: Ames, W.F. 1977, Numerical Methods for Partial Differential Equations, 2nd ed. (New York: Academic Press), Chapter 2. Goldberg, A., Schey, H.M., and Schwartz, J.L. 1967, American Journal of Physics, vol. 35, pp. 177–186. [1] Galbraith, I., Ching, Y.S., and Abraham, E. 1984, American Journal of Physics, vol. 52, pp. 60– 68. [2]

19.3 Initial Value Problems in Multidimensions The methods described in §19.1 and §19.2 for problems in 1 + 1 dimension (one space and one time dimension) can easily be generalized to N + 1 dimensions. However, the computing power necessary to solve the resulting equations is enormous. If you have solved a one-dimensional problem with 100 spatial grid points, solving the two-dimensional version with 100 × 100 mesh points requires at least 100 times as much computing. You generally have to be content with very modest spatial resolution in multidimensional problems. Indulge us in offering a bit of advice about the development and testing of multidimensional PDE codes: You should always first run your programs on very small grids, e.g., 8 × 8, even though the resulting accuracy is so poor as to be useless. When your program is all debugged and demonstrably stable, then you can increase the grid size to a reasonable one and start looking at the results. We have actually heard someone protest, “my program would be unstable for a crude grid, but I am sure the instability will go away on a larger grid.” That is nonsense of a most pernicious sort, evidencing total confusion between accuracy and stability. In fact, new instabilities sometimes do show up on larger grids; but old instabilities never (in our experience) just go away. Forced to live with modest grid sizes, some people recommend going to higherorder methods in an attempt to improve accuracy. This is very dangerous. Unless the solution you are looking for is known to be smooth, and the high-order method you

854

Chapter 19.


are using is known to be extremely stable, we do not recommend anything higher than second-order in time (for sets of first-order equations). For spatial differencing, we recommend the order of the underlying PDEs, perhaps allowing second-order spatial differencing for first-order-in-space PDEs. When you increase the order of a differencing method to greater than the order of the original PDEs, you introduce spurious solutions to the difference equations. This does not create a problem if they all happen to decay exponentially; otherwise you are going to see all hell break loose!

Lax Method for a Flux-Conservative Equation As an example, we show how to generalize the Lax method (19.1.15) to two dimensions for the conservation equation ∂u = −∇ · F = − ∂t

∂Fx ∂Fy + ∂x ∂y

(19.3.1)

Use a spatial grid with xj = x0 + j∆ (19.3.2)

yl = y0 + l∆ We have chosen ∆x = ∆y ≡ ∆ for simplicity. Then the Lax scheme is un+1 j,l =

1 n (u + unj−1,l + unj,l+1 + unj,l−1 ) 4 j+1,l ∆t n n n n (F − − Fj−1,l + Fj,l+1 − Fj,l−1 ) 2∆ j+1,l

(19.3.3)

Note that as an abbreviated notation Fj+1 and Fj−1 refer to Fx, while Fl+1 and Fl−1 refer to Fy . Let us carry out a stability analysis for the model advective equation (analog of 19.1.6) with Fx = vx u,

Fy = vy u

(19.3.4)

This requires an eigenmode with two dimensions in space, though still only a simple dependence on powers of ξ in time, unj,l = ξ n eikx j∆ eiky l∆

(19.3.5)

Substituting in equation (19.3.3), we find ξ=

1 (cos kx∆ + cos ky ∆) − iαx sin kx∆ − iαy sin ky ∆ 2

where αx =

vx ∆t , ∆

αy =

vy ∆t ∆

(19.3.6) (19.3.7)

19.3 Initial Value Problems in Multidimensions

855

The expression for |ξ|2 can be manipulated into the form

1 2 2 − (αx + αy ) |ξ| = 1 − (sin kx∆ + sin ky ∆) 2 1 − (cos kx∆ − cos ky ∆)2 − (αy sin kx ∆ − αx sin ky ∆)2 4 2

2

2

(19.3.8)

The last two terms are negative, and so the stability requirement |ξ|2 ≤ 1 becomes 1 − (α2x + α2y ) ≥ 0 2 ∆ ∆t ≤ √ 2 2(vx + vy2 )1/2

or

(19.3.9) (19.3.10)

This is an example of the general result for the N -dimensional Courant condition: If |v| is the maximum propagation velocity in the problem, then ∆t ≤ √

∆ N |v|

(19.3.11)

is the Courant condition.

Diffusion Equation in Multidimensions Let us consider the two-dimensional diffusion equation, ∂u =D ∂t

∂2u ∂2u + 2 ∂x2 ∂y

(19.3.12)

An explicit method, such as FTCS, can be generalized from the one-dimensional case in the obvious way. However, we have seen that diffusive problems are usually best treated implicitly. Suppose we try to implement the Crank-Nicholson scheme in two dimensions. This would give us $ 1 # 2 n+1 n 2 n 2 n+1 2 n un+1 j,l = uj,l + α δx uj,l + δx uj,l + δy uj,l + δy uj,l 2 Here α≡

D∆t ∆2

∆ ≡ ∆x = ∆y

δx2 unj,l ≡ unj+1,l − 2unj,l + unj−1,l

(19.3.13) (19.3.14) (19.3.15)

and similarly for δy2 unj,l . This is certainly a viable scheme; the problem arises in solving the coupled linear equations. Whereas in one space dimension the system was tridiagonal, that is no longer true, though the matrix is still very sparse. One possibility is to use a suitable sparse matrix technique (see §2.7 and §19.0). Another possibility, which we generally prefer, is a slightly different way of generalizing the Crank-Nicholson algorithm. It is still second-order accurate in time and space, and unconditionally stable, but the equations are easier to solve than

856

Chapter 19.


(19.3.13). Called the alternating-direction implicit method (ADI), this embodies the powerful concept of operator splitting or time splitting, about which we will say more below. Here, the idea is to divide each timestep into two steps of size ∆t/2. In each substep, a different dimension is treated implicitly: n+1/2

uj,l

un+1 j,l

$ 1 # n+1/2 = unj,l + α δx2 uj,l + δy2 unj,l 2 $ 1 # n+1/2 n+1/2 = uj,l + α δx2 uj,l + δy2 un+1 j,l 2

(19.3.16)

The advantage of this method is that each substep requires only the solution of a simple tridiagonal system.

Operator Splitting Methods Generally The basic idea of operator splitting, which is also called time splitting or the method of fractional steps, is this: Suppose you have an initial value equation of the form ∂u = Lu ∂t

(19.3.17)

where L is some operator. While L is not necessarily linear, suppose that it can at least be written as a linear sum of m pieces, which act additively on u, Lu = L1 u + L2 u + · · · + Lm u

(19.3.18)

Finally, suppose that for each of the pieces, you already know a differencing scheme for updating the variable u from timestep n to timestep n + 1, valid if that piece of the operator were the only one on the right-hand side. We will write these updatings symbolically as un+1 = U1 (un , ∆t) un+1 = U2 (un , ∆t) ···

(19.3.19)

un+1 = Um (un , ∆t) Now, one form of operator splitting would be to get from n to n + 1 by the following sequence of updatings: un+(1/m) = U1 (un , ∆t) un+(2/m) = U2 (un+(1/m) , ∆t) ··· un+1 = Um (un+(m−1)/m , ∆t)

(19.3.20)

19.4 Fourier and Cyclic Reduction Methods

857

For example, a combined advective-diffusion equation, such as ∂u ∂2 u ∂u = −v +D 2 ∂t ∂x ∂x

(19.3.21)

might profitably use an explicit scheme for the advective term combined with a Crank-Nicholson or other implicit scheme for the diffusion term. The alternating-direction implicit (ADI) method, equation (19.3.16), is an example of operator splitting with a slightly different twist. Let us reinterpret (19.3.19) to have a different meaning: Let U1 now denote an updating method that includes algebraically all the pieces of the total operator L, but which is desirably stable only for the L1 piece; likewise U2 , . . . Um . Then a method of getting from un to un+1 is un+1/m = U1 (un , ∆t/m) un+2/m = U2 (un+1/m , ∆t/m) (19.3.22)

··· un+1 = Um (un+(m−1)/m , ∆t/m)

The timestep for each fractional step in (19.3.22) is now only 1/m of the full timestep, because each partial operation acts with all the terms of the original operator. Equation (19.3.22) is usually, though not always, stable as a differencing scheme for the operator L. In fact, as a rule of thumb, it is often sufficient to have stable Ui ’s only for the operator pieces having the highest number of spatial derivatives — the other Ui ’s can be unstable — to make the overall scheme stable! It is at this point that we turn our attention from initial value problems to boundary value problems. These will occupy us for the remainder of the chapter. CITED REFERENCES AND FURTHER READING: Ames, W.F. 1977, Numerical Methods for Partial Differential Equations, 2nd ed. (New York: Academic Press).

19.4 Fourier and Cyclic Reduction Methods for Boundary Value Problems As discussed in §19.0, most boundary value problems (elliptic equations, for example) reduce to solving large sparse linear systems of the form A·u=b

(19.4.1)

either once, for boundary value equations that are linear, or iteratively, for boundary value equations that are nonlinear.

858

Chapter 19.


Two important techniques lead to “rapid” solution of equation (19.4.1) when the sparse matrix is of certain frequently occurring forms. The Fourier transform method is directly applicable when the equations have coefficients that are constant in space. The cyclic reduction method is somewhat more general; its applicability is related to the question of whether the equations are separable (in the sense of “separation of variables”). Both methods require the boundaries to coincide with the coordinate lines. Finally, for some problems, there is a powerful combination of these two methods called FACR (Fourier Analysis and Cyclic Reduction). We now consider each method in turn, using equation (19.0.3), with finite-difference representation (19.0.6), as a model example. Generally speaking, the methods in this section are faster, when they apply, than the simpler relaxation methods discussed in §19.5; but they are not necessarily faster than the more complicated multigrid methods discussed in §19.6.

Fourier Transform Method The discrete inverse Fourier transform in both x and y is

ujl =

J−1 L−1 1 u 9mn e−2πijm/J e−2πiln/L JL m=0 n=0

(19.4.2)

This can be computed using the FFT independently in each dimension, or else all at once via the routine fourn of §12.4 or the routine rlft3 of §12.5. Similarly,

ρjl =

J−1 L−1 1 ρ9mn e−2πijm/J e−2πiln/L JL m=0 n=0

(19.4.3)

If we substitute expressions (19.4.2) and (19.4.3) in our model problem (19.0.6), we find # $ u 9mn e2πim/J + e−2πim/J + e2πin/L + e−2πin/L − 4 = ρ9mn ∆2 or u 9mn =

ρ9mn ∆2 2πn 2πm + cos −2 2 cos J L

(19.4.4) (19.4.5)

Thus the strategy for solving equation (19.0.6) by FFT techniques is: • Compute ρ9mn as the Fourier transform ρ9mn =

J−1 L−1

ρjl e2πimj/J e2πinl/L

j=0 l=0

• Compute u 9mn from equation (19.4.5).

(19.4.6)


859

• Compute ujl by the inverse Fourier transform (19.4.2). The above procedure is valid for periodic boundary conditions. In other words, the solution satisfies ujl = uj+J,l = uj,l+L

(19.4.7)

Next consider a Dirichlet boundary condition u = 0 on the rectangular boundary. Instead of the expansion (19.4.2), we now need an expansion in sine waves:

ujl =

J−1 L−1 πln 2 2 πjm sin u 9mn sin J L m=1 n=1 J L

(19.4.8)

This satisfies the boundary conditions that u = 0 at j = 0, J and at l = 0, L. If we substitute this expansion and the analogous one for ρjl into equation (19.0.6), we find that the solution procedure parallels that for periodic boundary conditions: • Compute ρ9mn by the sine transform ρ9mn =

J−1 L−1

ρjl sin

j=1 l=1

πln πjm sin J L

(19.4.9)

(A fast sine transform algorithm was given in §12.3.) • Compute u 9mn from the expression analogous to (19.4.5), u 9mn =

∆2 ρ9mn $ πn πm + cos −2 2 cos J L #

(19.4.10)

• Compute ujl by the inverse sine transform (19.4.8). If we have inhomogeneous boundary conditions, for example u = 0 on all boundaries except u = f(y) on the boundary x = J∆, we have to add to the above solution a solution uH of the homogeneous equation ∂2u ∂2u + 2 =0 ∂x2 ∂y

(19.4.11)

that satisfies the required boundary conditions. In the continuum case, this would be an expression of the form uH =

n

An sinh

nπy nπx sin J∆ L∆

(19.4.12)

where An would be found by requiring that u = f(y) at x = J∆. In the discrete case, we have uH jl

L−1 πnl 2 πnj sin = An sinh L n=1 J L

(19.4.13)

860

Chapter 19.


If f(y = l∆) ≡ fl , then we get An from the inverse formula An =

L−1 1 πnl fl sin sinh πn L

(19.4.14)

l=1

The complete solution to the problem is u = ujl + uH jl

(19.4.15)

By adding appropriate terms of the form (19.4.12), we can handle inhomogeneous terms on any boundary surface. A much simpler procedure for handling inhomogeneous terms is to note that whenever boundary terms appear on the left-hand side of (19.0.6), they can be taken over to the right-hand side since they are known. The effective source term is therefore ρjl plus a contribution from the boundary terms. To implement this idea formally, write the solution as u = u + uB

(19.4.16)

where u = 0 on the boundary, while uB vanishes everywhere except on the boundary. There it takes on the given boundary value. In the above example, the only nonzero values of uB would be uB J,l = fl

(19.4.17)

The model equation (19.0.3) becomes ∇2 u = −∇2 uB + ρ

(19.4.18)

or, in finite-difference form, uj+1,l + uj−1,l + uj,l+1 + uj,l−1 − 4uj,l = B B B B 2 − (uB j+1,l + uj−1,l + uj,l+1 + uj,l−1 − 4uj,l ) + ∆ ρj,l

(19.4.19)

All the uB terms in equation (19.4.19) vanish except when the equation is evaluated at j = J − 1, where uJ,l + uJ−2,l + uJ−1,l+1 + uJ−1,l−1 − 4uJ−1,l = −fl + ∆2 ρJ−1,l

(19.4.20)

Thus the problem is now equivalent to the case of zero boundary conditions, except that one row of the source term is modified by the replacement ∆2 ρJ−1,l → ∆2 ρJ−1,l − fl

(19.4.21)

The case of Neumann boundary conditions ∇u = 0 is handled by the cosine expansion (12.3.17): L J πln 2 2 πjm cos u 9mn cos ujl = J L m=0 n=0 J L

(19.4.22)


861

Here the double prime notation means that the terms for m = 0 and m = J should be multiplied by 12 , and similarly for n = 0 and n = L. Inhomogeneous terms ∇u = g can be again included by adding a suitable solution of the homogeneous equation, or more simply by taking boundary terms over to the right-hand side. For example, the condition ∂u = g(y) ∂x becomes

at x = 0

u1,l − u−1,l = gl 2∆

(19.4.23)

(19.4.24)

where gl ≡ g(y = l∆). Once again we write the solution in the form (19.4.16), where now ∇u = 0 on the boundary. This time ∇uB takes on the prescribed value on the boundary, but uB vanishes everywhere except just outside the boundary. Thus equation (19.4.24) gives uB −1,l = −2∆gl

(19.4.25)

All the uB terms in equation (19.4.19) vanish except when j = 0: u1,l + u−1,l + u0,l+1 + u0,l−1 − 4u0,l = 2∆gl + ∆2 ρ0,l

(19.4.26)

Thus u is the solution of a zero-gradient problem, with the source term modified by the replacement ∆2 ρ0,l → ∆2 ρ0,l + 2∆gl

(19.4.27)

Sometimes Neumann boundary conditions are handled by using a staggered grid, with the u’s defined midway between zone boundaries so that first derivatives are centered on the mesh points. You can solve such problems using similar techniques to those described above if you use the alternative form of the cosine transform, equation (12.3.23).

Cyclic Reduction Evidently the FFT method works only when the original PDE has constant coefficients, and boundaries that coincide with the coordinate lines. An alternative algorithm, which can be used on somewhat more general equations, is called cyclic reduction (CR). We illustrate cyclic reduction on the equation ∂u ∂2u ∂2u + c(y)u = g(x, y) + 2 + b(y) ∂x2 ∂y ∂y

(19.4.28)

This form arises very often in practice from the Helmholtz or Poisson equations in polar, cylindrical, or spherical coordinate systems. More general separable equations are treated in [1].

862

Chapter 19.


The finite-difference form of equation (19.4.28) can be written as a set of vector equations uj−1 + T · uj + uj+1 = gj ∆2

(19.4.29)

Here the index j comes from differencing in the x-direction, while the y-differencing (denoted by the index l previously) has been left in vector form. The matrix T has the form T = B − 21

(19.4.30)

where the 21 comes from the x-differencing and the matrix B from the y-differencing. The matrix B, and hence T, is tridiagonal with variable coefficients. The CR method is derived by writing down three successive equations like (19.4.29): uj−2 + T · uj−1 + uj = gj−1 ∆2 uj−1 + T · uj + uj+1 = gj ∆2 uj + T · uj+1 + uj+2 = gj+1 ∆

(19.4.31) 2

Matrix-multiplying the middle equation by −T and then adding the three equations, we get (1)

uj−2 + T(1) · uj + uj+2 = gj ∆2

(19.4.32)

This is an equation of the same form as (19.4.29), with T(1) = 21 − T2 (1)

gj

= ∆2 (gj−1 − T · gj + gj+1 )

(19.4.33)

After one level of CR, we have reduced the number of equations by a factor of two. Since the resulting equations are of the same form as the original equation, we can repeat the process. Taking the number of mesh points to be a power of 2 for simplicity, we finally end up with a single equation for the central line of variables: (f)

T(f) · uJ/2 = ∆2 gJ/2 − u0 − uJ

(19.4.34)

Here we have moved u0 and uJ to the right-hand side because they are known boundary values. Equation (19.4.34) can be solved for uJ/2 by the standard tridiagonal algorithm. The two equations at level f − 1 involve uJ/4 and u3J/4 . The equation for uJ/4 involves u0 and uJ/2 , both of which are known, and hence can be solved by the usual tridiagonal routine. A similar result holds true at every stage, so we end up solving J − 1 tridiagonal systems. In practice, equations (19.4.33) should be rewritten to avoid numerical instability. For these and other practical details, refer to [2].

19.5 Relaxation Methods for Boundary Value Problems

863

FACR Method The best way to solve equations of the form (19.4.28), including the constant coefficient problem (19.0.3), is a combination of Fourier analysis and cyclic reduction, the FACR method [3-6]. If at the rth stage of CR we Fourier analyze the equations of the form (19.4.32) along y, that is, with respect to the suppressed vector index, we will have a tridiagonal system in the x-direction for each y-Fourier mode: 9kj + u 9kj+2r = ∆2 gj u 9kj−2r + λk u (r)

(r)k

(19.4.35)

(r)

Here λk is the eigenvalue of T(r) corresponding to the kth Fourier mode. For (r) the equation (19.0.3), equation (19.4.5) shows that λk will involve terms like cos(2πk/L) − 2 raised to a power. Solve the tridiagonal systems for u 9kj at the levels r r r r j = 2 , 2 × 2 , 4 × 2 , ..., J − 2 . Fourier synthesize to get the y-values on these x-lines. Then fill in the intermediate x-lines as in the original CR algorithm. The trick is to choose the number of levels of CR so as to minimize the total number of arithmetic operations. One can show that for a typical case of a 128×128 mesh, the optimal level is r = 2; asymptotically, r → log2 (log2 J). A rough estimate of running times for these algorithms for equation (19.0.3) is as follows: The FFT method (in both x and y) and the CR method are roughly comparable. FACR with r = 0 (that is, FFT in one dimension and solve the tridiagonal equations by the usual algorithm in the other dimension) gives about a factor of two gain in speed. The optimal FACR with r = 2 gives another factor of two gain in speed. CITED REFERENCES AND FURTHER READING: Swartzrauber, P.N. 1977, SIAM Review, vol. 19, pp. 490–501. [1] Buzbee, B.L, Golub, G.H., and Nielson, C.W. 1970, SIAM Journal on Numerical Analysis, vol. 7, pp. 627–656; see also op. cit. vol. 11, pp. 753–763. [2] Hockney, R.W. 1965, Journal of the Association for Computing Machinery, vol. 12, pp. 95–113. [3] Hockney, R.W. 1970, in Methods of Computational Physics, vol. 9 (New York: Academic Press), pp. 135–211. [4] Hockney, R.W., and Eastwood, J.W. 1981, Computer Simulation Using Particles (New York: McGraw-Hill), Chapter 6. [5] Temperton, C. 1980, Journal of Computational Physics, vol. 34, pp. 314–329. [6]

19.5 Relaxation Methods for Boundary Value Problems As we mentioned in §19.0, relaxation methods involve splitting the sparse matrix that arises from finite differencing and then iterating until a solution is found. There is another way of thinking about relaxation methods that is somewhat more physical. Suppose we wish to solve the elliptic equation Lu = ρ

(19.5.1)

864

Chapter 19.


where L represents some elliptic operator and ρ is the source term. Rewrite the equation as a diffusion equation, ∂u = Lu − ρ ∂t

(19.5.2)

An initial distribution u relaxes to an equilibrium solution as t → ∞. This equilibrium has all time derivatives vanishing. Therefore it is the solution of the original elliptic problem (19.5.1). We see that all the machinery of §19.2, on diffusive initial value equations, can be brought to bear on the solution of boundary value problems by relaxation methods. Let us apply this idea to our model problem (19.0.3). The diffusion equation is ∂u ∂2u ∂2u = + 2 −ρ ∂t ∂x2 ∂y

(19.5.3)

If we use FTCS differencing (cf. equation 19.2.4), we get n un+1 j,l = uj,l +

∆t , n u + unj−1,l + unj,l+1 + unj,l−1 − 4unj,l − ρj,l ∆t (19.5.4) ∆2 j+1,l

Recall from (19.2.6) that FTCS differencing is stable in one spatial dimension only if ∆t/∆2 ≤ 12 . In two dimensions this becomes ∆t/∆2 ≤ 14 . Suppose we try to take the largest possible timestep, and set ∆t = ∆2 /4. Then equation (19.5.4) becomes un+1 j,l =

- ∆2 1, n uj+1,l + unj−1,l + unj,l+1 + unj,l−1 − ρj,l 4 4

(19.5.5)

Thus the algorithm consists of using the average of u at its four nearest-neighbor points on the grid (plus the contribution from the source). This procedure is then iterated until convergence. This method is in fact a classical method with origins dating back to the last century, called Jacobi’s method (not to be confused with the Jacobi method for eigenvalues). The method is not practical because it converges too slowly. However, it is the basis for understanding the modern methods, which are always compared with it. Another classical method is the Gauss-Seidel method, which turns out to be important in multigrid methods (§19.6). Here we make use of updated values of u on the right-hand side of (19.5.5) as soon as they become available. In other words, the averaging is done “in place” instead of being “copied” from an earlier timestep to a later one. If we are proceeding along the rows, incrementing j for fixed l, we have un+1 j,l =

$ ∆2 1# n n n+1 uj+1,l + un+1 ρj,l j−1,l + uj,l+1 + uj,l−1 − 4 4

(19.5.6)

This method is also slowly converging and only of theoretical interest when used by itself, but some analysis of it will be instructive. Let us look at the Jacobi and Gauss-Seidel methods in terms of the matrix splitting concept. We change notation and call u “x,” to conform to standard matrix notation. To solve A·x=b

(19.5.7)


865

we can consider splitting A as A =L+D+U

(19.5.8)

where D is the diagonal part of A, L is the lower triangle of A with zeros on the diagonal, and U is the upper triangle of A with zeros on the diagonal. In the Jacobi method we write for the rth step of iteration D · x(r) = −(L + U) · x(r−1) + b

(19.5.9)

For our model problem (19.5.5), D is simply the identity matrix. The Jacobi method converges for matrices A that are “diagonally dominant” in a sense that can be made mathematically precise. For matrices arising from finite differencing, this condition is usually met. What is the rate of convergence of the Jacobi method? A detailed analysis is beyond our scope, but here is some of the flavor: The matrix −D−1 · (L + U) is the iteration matrix which, apart from an additive term, maps one set of x’s into the next. The iteration matrix has eigenvalues, each one of which reflects the factor by which the amplitude of a particular eigenmode of undesired residual is suppressed during one iteration. Evidently those factors had better all have modulus < 1 for the relaxation to work at all! The rate of convergence of the method is set by the rate for the slowest-decaying eigenmode, i.e., the factor with largest modulus. The modulus of this largest factor, therefore lying between 0 and 1, is called the spectral radius of the relaxation operator, denoted ρs . The number of iterations r required to reduce the overall error by a factor 10−p is thus estimated by r≈

p ln 10 (− ln ρs )

(19.5.10)

In general, the spectral radius ρs goes asymptotically to the value 1 as the grid size J is increased, so that more iterations are required. For any given equation, grid geometry, and boundary condition, the spectral radius can, in principle, be computed analytically. For example, for equation (19.5.5) on a J × J grid with Dirichlet boundary conditions on all four sides, the asymptotic formula for large J turns out to be ρs 1 −

π2 2J 2

(19.5.11)

The number of iterations r required to reduce the error by a factor of 10−p is thus r

1 2pJ 2 ln 10 pJ 2 2 π 2

(19.5.12)

In other words, the number of iterations is proportional to the number of mesh points, J 2 . Since 100 × 100 and larger problems are common, it is clear that the Jacobi method is only of theoretical interest.

866

Chapter 19.


The Gauss-Seidel method, equation (19.5.6), corresponds to the matrix decomposition (L + D) · x(r) = −U · x(r−1) + b

(19.5.13)

The fact that L is on the left-hand side of the equation follows from the updating in place, as you can easily check if you write out (19.5.13) in components. One can show [1-3] that the spectral radius is just the square of the spectral radius of the Jacobi method. For our model problem, therefore, ρs 1 − r

π2 J2

1 pJ 2 ln 10 pJ 2 2 π 4

(19.5.14)

(19.5.15)

The factor of two improvement in the number of iterations over the Jacobi method still leaves the method impractical.

Successive Overrelaxation (SOR) We get a better algorithm — one that was the standard algorithm until the 1970s — if we make an overcorrection to the value of x(r) at the rth stage of Gauss-Seidel iteration, thus anticipating future corrections. Solve (19.5.13) for x(r) , add and subtract x(r−1) on the right-hand side, and hence write the Gauss-Seidel method as x(r) = x(r−1) − (L + D)−1 · [(L + D + U) · x(r−1) − b]

(19.5.16)

The term in square brackets is just the residual vector ξ (r−1) , so x(r) = x(r−1) − (L + D)−1 · ξ (r−1)

(19.5.17)

Now overcorrect, defining x(r) = x(r−1) − ω(L + D)−1 · ξ (r−1)

(19.5.18)

Here ω is called the overrelaxation parameter, and the method is called successive overrelaxation (SOR). The following theorems can be proved [1-3] : • The method is convergent only for 0 < ω < 2. If 0 < ω < 1, we speak of underrelaxation. • Under certain mathematical restrictions generally satisfied by matrices arising from finite differencing, only overrelaxation (1 < ω < 2 ) can give faster convergence than the Gauss-Seidel method. • If ρJacobi is the spectral radius of the Jacobi iteration (so that the square of it is the spectral radius of the Gauss-Seidel iteration), then the optimal choice for ω is given by ω=

1+

(

2 1 − ρ2Jacobi

(19.5.19)


867

• For this optimal choice, the spectral radius for SOR is ρSOR =

ρJacobi ( 1 + 1 − ρ2Jacobi

2 (19.5.20)

As an application of the above results, consider our model problem for which ρJacobi is given by equation (19.5.11). Then equations (19.5.19) and (19.5.20) give 2 1 + π/J 2π 1− J

ω ρSOR

(19.5.21) for large J

(19.5.22)

Equation (19.5.10) gives for the number of iterations to reduce the initial error by a factor of 10−p , r

1 pJ ln 10 pJ 2π 3

(19.5.23)

Comparing with equation (19.5.12) or (19.5.15), we see that optimal SOR requires of order J iterations, as opposed to of order J 2 . Since J is typically 100 or larger, this makes a tremendous difference! Equation (19.5.23) leads to the mnemonic that 3-figure accuracy (p = 3) requires a number of iterations equal to the number of mesh points along a side of the grid. For 6-figure accuracy, we require about twice as many iterations. How do we choose ω for a problem for which the answer is not known analytically? That is just the weak point of SOR! The advantages of SOR obtain only in a fairly narrow window around the correct value of ω. It is better to take ω slightly too large, rather than slightly too small, but best to get it right. One way to choose ω is to map your problem approximately onto a known problem, replacing the coefficients in the equation by average values. Note, however, that the known problem must have the same grid size and boundary conditions as the actual problem. We give for reference purposes the value of ρJacobi for our model problem on a rectangular J × L grid, allowing for the possibility that ∆x = ∆y:

ρJacobi =

2 π ∆x cos ∆y L 2 ∆x 1+ ∆y

π cos + J

(19.5.24)

Equation (19.5.24) holds for homogeneous Dirichlet or Neumann boundary conditions. For periodic boundary conditions, make the replacement π → 2π. A second way, which is especially useful if you plan to solve many similar elliptic equations each time with slightly different coefficients, is to determine the optimum value ω empirically on the first equation and then use that value for the remaining equations. Various automated schemes for doing this and for “seeking out” the best values of ω are described in the literature. While the matrix notation introduced earlier is useful for theoretical analyses, for practical implementation of the SOR algorithm we need explicit formulas.

868

Chapter 19.


Consider a general second-order elliptic equation in x and y, finite differenced on a square as for our model equation. Corresponding to each row of the matrix A is an equation of the form aj,l uj+1,l + bj,l uj−1,l + cj,l uj,l+1 + dj,l uj,l−1 + ej,l uj,l = fj,l

(19.5.25)

For our model equation, we had a = b = c = d = 1, e = −4. The quantity f is proportional to the source term. The iterative procedure is defined by solving (19.5.25) for uj,l : u*j,l =

1 (fj,l − aj,l uj+1,l − bj,l uj−1,l − cj,l uj,l+1 − dj,l uj,l−1 ) ej,l

(19.5.26)

Then unew j,l is a weighted average old unew j,l = ωu*j,l + (1 − ω)uj,l

(19.5.27)

We calculate it as follows: The residual at any stage is ξj,l = aj,l uj+1,l + bj,l uj−1,l + cj,l uj,l+1 + dj,l uj,l−1 + ej,l uj,l − fj,l (19.5.28) and the SOR algorithm (19.5.18) or (19.5.27) is old unew j,l = uj,l − ω

ξj,l ej,l

(19.5.29)

This formulation is very easy to program, and the norm of the residual vector ξj,l can be used as a criterion for terminating the iteration. Another practical point concerns the order in which mesh points are processed. The obvious strategy is simply to proceed in order down the rows (or columns). Alternatively, suppose we divide the mesh into “odd” and “even” meshes, like the red and black squares of a checkerboard. Then equation (19.5.26) shows that the odd points depend only on the even mesh values and vice versa. Accordingly, we can carry out one half-sweep updating the odd points, say, and then another half-sweep updating the even points with the new odd values. For the version of SOR implemented below, we shall adopt odd-even ordering. The last practical point is that in practice the asymptotic rate of convergence in SOR is not attained until of order J iterations. The error often grows by a factor of 20 before convergence sets in. A trivial modification to SOR resolves this problem. It is based on the observation that, while ω is the optimum asymptotic relaxation parameter, it is not necessarily a good initial choice. In SOR with Chebyshev acceleration, one uses odd-even ordering and changes ω at each halfsweep according to the following prescription: ω(0) = 1 ω(1/2) = 1/(1 − ρ2Jacobi /2) ω(n+1/2) = 1/(1 − ρ2Jacobi ω(n) /4), ω(∞) → ωoptimal

n = 1/2, 1, ..., ∞

(19.5.30)


869

The beauty of Chebyshev acceleration is that the norm of the error always decreases with each iteration. (This is the norm of the actual error in uj,l. The norm of the residual ξj,l need not decrease monotonically.) While the asymptotic rate of convergence is the same as ordinary SOR, there is never any excuse for not using Chebyshev acceleration to reduce the total number of iterations required. Here we give a routine for SOR with Chebyshev acceleration. #include <math.h> #define MAXITS 1000 #define EPS 1.0e-5 void sor(double **a, double **b, double **c, double **d, double **e, double **f, double **u, int jmax, double rjac) Successive overrelaxation solution of equation (19.5.25) with Chebyshev acceleration. a, b, c, d, e, and f are input as the coefficients of the equation, each dimensioned to the grid size [1..jmax][1..jmax]. u is input as the initial guess to the solution, usually zero, and returns with the final value. rjac is input as the spectral radius of the Jacobi iteration, or an estimate of it. { void nrerror(char error_text[]); int ipass,j,jsw,l,lsw,n; double anorm,anormf=0.0,omega=1.0,resid; Double precision is a good idea for jmax bigger than about 25. for (j=2;j<jmax;j++) Compute initial norm of residual and terminate iteration when norm has been reduced by a factor EPS. for (l=2;l<jmax;l++) anormf += fabs(f[j][l]); Assumes initial u is zero. for (n=1;n m = 2. Of course the P and R operators should enforce the boundary conditions for your problem. The easiest way to do this is to rewrite the difference equation to have homogeneous boundary conditions by modifying the source term if necessary (cf. §19.4). Enforcing homogeneous boundary conditions simply requires the P operator to produce zeros at the appropriate boundary points. The corresponding R is then found by R = P † .

Full Multigrid Algorithm So far we have described multigrid as an iterative scheme, where one starts with some initial guess on the finest grid and carries out enough cycles (V-cycles, W-cycles,. . . ) to achieve convergence. This is the simplest way to use multigrid: Simply apply enough cycles until some appropriate convergence criterion is met. However, efficiency can be improved by using the Full Multigrid Algorithm (FMG), also known as nested iteration. Instead of starting with an arbitrary approximation on the finest grid (e.g., uh = 0), the first approximation is obtained by interpolating from a coarse-grid solution: uh = PuH

(19.6.20)

The coarse-grid solution itself is found by a similar FMG process from even coarser grids. At the coarsest level, you start with the exact solution. Rather than proceed as in Figure 19.6.1, then, FMG gets to its solution by a series of increasingly tall “N’s,” each taller one probing a finer grid (see Figure 19.6.2). Note that P in (19.6.20) need not be the same P used in the multigrid cycles. It should be at least of the same order as the discretization Lh , but sometimes a higher-order operator leads to greater efficiency. It turns out that you usually need one or at most two multigrid cycles at each level before proceeding down to the next finer grid. While there is theoretical guidance on the required number of cycles (e.g., [2]), you can easily determine it empirically. Fix the finest level and study the solution values as you increase the number of cycles per level. The asymptotic value of the solution is the exact solution of the difference equations. The difference between this exact solution and the solution for a small number of cycles is the iteration error. Now fix the number of cycles to be large, and vary the number of levels, i.e., the smallest value of h used. In this way you can estimate the truncation error for a given h. In your final production code, there is no point in using more cycles than you need to get the iteration error down to the size of the truncation error. The simple multigrid iteration (cycle) needs the right-hand side f only at the finest level. FMG needs f at all levels. If the boundary conditions are homogeneous,

878

Chapter 19.


S S S E

S

S

S S

E

S

S

S S

E

S E

S S S E

S E

S E

S S

S E

S S

S E

4-grid ncycle = 1

S S

S S

S E

S S

S S

S

4-grid ncycle = 2

E

Figure 19.6.2. Structure of cycles for the full multigrid (FMG) method. This method starts on the coarsest grid, interpolates, and then refines (by “V’s”), the solution onto grids of increasing fineness.

you can use fH = Rfh . This prescription is not always safe for inhomogeneous boundary conditions. In that case it is better to discretize f on each coarse grid. Note that the FMG algorithm produces the solution on all levels. It can therefore be combined with techniques like Richardson extrapolation. We now give a routine mglin that implements the Full Multigrid Algorithm for a linear equation, the model problem (19.0.6). It uses red-black Gauss-Seidel as the smoothing operator, bilinear interpolation for P, and half-weighting for R. To change the routine to handle another linear problem, all you need do is modify the functions relax, resid, and slvsml appropriately. A feature of the routine is the dynamical allocation of storage for variables defined on the various grids.

#include "nrutil.h" #define NPRE 1 #define NPOST 1 #define NGMAX 15

Number of relaxation sweeps before . . . . . . and after the coarse-grid correction is computed.

void mglin(double **u, int n, int ncycle) Full Multigrid Algorithm for solution of linear elliptic equation, here the model problem (19.0.6). On input u[1..n][1..n] contains the right-hand side ρ, while on output it returns the solution. The dimension n must be of the form 2j + 1 for some integer j. (j is actually the number of grid levels used in the solution, called ng below.) ncycle is the number of V-cycles to be used at each level. { void addint(double **uf, double **uc, double **res, int nf); void copy(double **aout, double **ain, int n); void fill0(double **u, int n); void interp(double **uf, double **uc, int nf); void relax(double **u, double **rhs, int n); void resid(double **res, double **u, double **rhs, int n); void rstrct(double **uc, double **uf, int nc); void slvsml(double **u, double **rhs);

19.6 Multigrid Methods for Boundary Value Problems

879

unsigned int j,jcycle,jj,jpost,jpre,nf,ng=0,ngrid,nn; double **ires[NGMAX+1],**irho[NGMAX+1],**irhs[NGMAX+1],**iu[NGMAX+1]; nn=n; while (nn >>= 1) ng++; if (n != 1+(1L NGMAX) nrerror("increase NGMAX in mglin."); nn=n/2+1; ngrid=ng-1; irho[ngrid]=dmatrix(1,nn,1,nn); Allocate storage for r.h.s. on grid ng − 1, rstrct(irho[ngrid],u,nn); and fill it by restricting from the fine grid. while (nn > 3) { Similarly allocate storage and fill r.h.s. on all nn=nn/2+1; coarse grids. irho[--ngrid]=dmatrix(1,nn,1,nn); rstrct(irho[ngrid],irho[ngrid+1],nn); } nn=3; iu[1]=dmatrix(1,nn,1,nn); irhs[1]=dmatrix(1,nn,1,nn); slvsml(iu[1],irho[1]); Initial solution on coarsest grid. free_dmatrix(irho[1],1,nn,1,nn); ngrid=ng; for (j=2;j=0;n--,++(*nb)) { Loop over the bits in the stored nc=(*nb >> 3); Huffman code for ich. if (++nc >= *lcode) { fprintf(stderr,"Reached the end of the ’code’ array.\n"); fprintf(stderr,"Attempting to expand its size.\n"); *lcode *= 1.5; if ((*codep=(unsigned char *)realloc(*codep, (unsigned)(*lcode*sizeof(unsigned char)))) == NULL) { nrerror("Size expansion failed."); } } l=(*nb) & 7; if (!l) (*codep)[nc]=0; Set appropriate bits in code. if (hcode->icod[k] & setbit[n]) (*codep)[nc] |= setbit[l]; } }

Decoding a Huffman-encoded message is slightly more complicated. The coding tree must be traversed from the top down, using up a variable number of bits: typedef struct { unsigned long *icod,*ncod,*left,*right,nch,nodemax; } huffcode; void hufdec(unsigned long *ich, unsigned char *code, unsigned long lcode, unsigned long *nb, huffcode *hcode) Starting at bit number nb in the character array code[1..lcode], use the Huffman code stored in the structure hcode to decode a single character (returned as ich in the range 0..nch-1) and increment nb appropriately. Repeated calls, starting with nb = 0 will return successive characters in a compressed message. The returned value ich=nch indicates end-of-message. The structure hcode must already have been defined and allocated in your main program, and also filled by a call to hufmak. { long nc,node; static unsigned char setbit[8]={0x1,0x2,0x4,0x8,0x10,0x20,0x40,0x80}; node=hcode->nodemax; for (;;) { Set node to the top of the decoding tree, and loop nc=(*nb >> 3); until a valid character is obtained. if (++nc > lcode) { Ran out of input; with ich=nch indicating end of *ich=hcode->nch; message. return; } node=(code[nc] & setbit[7 & (*nb)++] ? hcode->right[node] : hcode->left[node]); Branch left or right in tree, depending on its value. if (node nch) { If we reach a terminal node, we have a complete character and can return.

20.4 Huffman Coding and Compression of Data

909

*ich=node-1; return; } } }

For simplicity, hufdec quits when it runs out of code bytes; if your coded message is not an integral number of bytes, and if Nch is less than 256, hufdec can return a spurious final character or two, decoded from the spurious trailing bits in your last code byte. If you have independent knowledge of the number of characters sent, you can readily discard these. Otherwise, you can fix this behavior by providing a bit, not byte, count, and modifying the routine accordingly. (When Nch is 256 or larger, hufdec will normally run out of code in the middle of a spurious character, and it will be discarded.)

Run-Length Encoding For the compression of highly correlated bit-streams (for example the black or white values along a facsimile scan line), Huffman compression is often combined with run-length encoding: Instead of sending each bit, the input stream is converted to a series of integers indicating how many consecutive bits have the same value. These integers are then Huffman-compressed. The Group 3 CCITT facsimile standard functions in this manner, with a fixed, immutable, Huffman code, optimized for a set of eight standard documents [8,9] .

CITED REFERENCES AND FURTHER READING: Gallager, R.G. 1968, Information Theory and Reliable Communication (New York: Wiley). Hamming, R.W. 1980, Coding and Information Theory (Englewood Cliffs, NJ: Prentice-Hall). Storer, J.A. 1988, Data Compression: Methods and Theory (Rockville, MD: Computer Science Press). Nelson, M. 1991, The Data Compression Book (Redwood City, CA: M&T Books). Huffman, D.A. 1952, Proceedings of the Institute of Radio Engineers, vol. 40, pp. 1098–1101. [1] Ziv, J., and Lempel, A. 1978, IEEE Transactions on Information Theory, vol. IT-24, pp. 530–536. [2] Cleary, J.G., and Witten, I.H. 1984, IEEE Transactions on Communications, vol. COM-32, pp. 396–402. [3] Welch, T.A. 1984, Computer, vol. 17, no. 6, pp. 8–19. [4] Bentley, J.L., Sleator, D.D., Tarjan, R.E., and Wei, V.K. 1986, Communications of the ACM, vol. 29, pp. 320–330. [5] Jones, D.W. 1988, Communications of the ACM, vol. 31, pp. 996–1007. [6] Sedgewick, R. 1988, Algorithms, 2nd ed. (Reading, MA: Addison-Wesley), Chapter 22. [7] Hunter, R., and Robinson, A.H. 1980, Proceedings of the IEEE, vol. 68, pp. 854–867. [8] Marking, M.P. 1990, The C Users’ Journal, vol. 8, no. 6, pp. 45–54. [9]

910

Chapter 20.

Less-Numerical Algorithms

20.5 Arithmetic Coding We saw in the previous section that a perfect (entropy-bounded) coding scheme would use Li = − log2 pi bits to encode character i (in the range 1 ≤ i ≤ Nch ), if pi is its probability of occurrence. Huffman coding gives a way of rounding the Li ’s to close integer values and constructing a code with those lengths. Arithmetic coding [1], which we now discuss, actually does manage to encode characters using noninteger numbers of bits! It also provides a convenient way to output the result not as a stream of bits, but as a stream of symbols in any desired radix. This latter property is particularly useful if you want, e.g., to convert data from bytes (radix 256) to printable ASCII characters (radix 94), or to case-independent alphanumeric sequences containing only A-Z and 0-9 (radix 36). In arithmetic coding, an input message of any length is represented as a real number R in the range 0 ≤ R < 1. The longer the message, the more precision required of R. This is best illustrated by an example, so let us return to the fictitious language, Vowellish, of the previous section. Recall that Vowellish has a 5 character alphabet (A, E, I, O, U), with occurrence probabilities 0.12, 0.42, 0.09, 0.30, and 0.07, respectively. Figure 20.5.1 shows how a message beginning “IOU” is encoded: The interval [0, 1) is divided into segments corresponding to the 5 alphabetical characters; the length of a segment is the corresponding pi . We see that the first message character, “I”, narrows the range of R to 0.37 ≤ R < 0.46. This interval is now subdivided into five subintervals, again with lengths proportional to the pi ’s. The second message character, “O”, narrows the range of R to 0.3763 ≤ R < 0.4033. The “U” character further narrows the range to 0.37630 ≤ R < 0.37819. Any value of R in this range can be sent as encoding “IOU”. In particular, the binary fraction .011000001 is in this range, so “IOU” can be sent in 9 bits. (Huffman coding took 10 bits for this example, see §20.4.) Of course there is the problem of knowing when to stop decoding. The fraction .011000001 represents not simply “IOU,” but “IOU. . . ,” where the ellipses represent an infinite string of successor characters. To resolve this ambiguity, arithmetic coding generally assumes the existence of a special Nch + 1th character, EOM (end of message), which occurs only once at the end of the input. Since EOM has a low probability of occurrence, it gets allocated only a very tiny piece of the number line. In the above example, we gave R as a binary fraction. We could just as well have output it in any other radix, e.g., base 94 or base 36, whatever is convenient for the anticipated storage or communication channel. You might wonder how one deals with the seemingly incredible precision required of R for a long message. The answer is that R is never actually represented all at once. At any give stage we have upper and lower bounds for R represented as a finite number of digits in the output radix. As digits of the upper and lower bounds become identical, we can left-shift them away and bring in new digits at the low-significance end. The routines below have a parameter NWK for the number of working digits to keep around. This must be large enough to make the chance of an accidental degeneracy vanishingly small. (The routines signal if a degeneracy ever occurs.) Since the process of discarding old digits and bringing in new ones is performed identically on encoding and decoding, everything stays synchronized.

911

20.5 Arithmetic Coding

0.4033

0.46

1.0 A 0.9

A

0.37819 A

A 0.3780

0.45 0.400

0.8 0.7

0.3778

0.44 E

0.43

E

0.395

E

0.3776

E

0.3774

0.6 0.42

0.4

0.3772

0.390

0.5 I

0.41

I

0.40

0.385

0.3 O 0.2

0.39

I

O

0.3770

I

0.3768 O

O 0.3766

0.380 0.38

0.1 U 0.0

0.3764 U

0.37

U

U 0.3763

0.37630

Figure 20.5.1. Arithmetic coding of the message “IOU...” in the fictitious language Vowellish. Successive characters give successively finer subdivisions of the initial interval between 0 and 1. The final value can be output as the digits of a fraction in any desired radix. Note how the subinterval allocated to a character is proportional to its probability of occurrence.

The routine arcmak constructs the cumulative frequency distribution table used to partition the interval at each stage. In the principal routine arcode, when an interval of size jdif is to be partitioned in the proportions of some n to some ntot, say, then we must compute (n*jdif)/ntot. With integer arithmetic, the numerator is likely to overflow; and, unfortunately, an expression like jdif/(ntot/n) is not equivalent. In the implementation below, we resort to double precision floating arithmetic for this calculation. Not only is this inefficient, but different roundoff errors can (albeit very rarely) make different machines encode differently, though any one type of machine will decode exactly what it encoded, since identical roundoff errors occur in the two processes. For serious use, one needs to replace this floating calculation with an integer computation in a double register (not available to the C programmer). The internally set variable minint, which is the minimum allowed number of discrete steps between the upper and lower bounds, determines when new lowsignificance digits are added. minint must be large enough to provide resolution of all the input characters. That is, we must have pi × minint > 1 for all i. A value of 100Nch, or 1.1/ min pi , whichever is larger, is generally adequate. However, for safety, the routine below takes minint to be as large as possible, with the product minint*nradd just smaller than overflow. This results in some time inefficiency, and in a few unnecessary characters being output at the end of a message. You can

912

Chapter 20.


decrease minint if you want to live closer to the edge. A final safety feature in arcmak is its refusal to believe zero values in the table nfreq; a 0 is treated as if it were a 1. If this were not done, the occurrence in a message of a single character whose nfreq entry is zero would result in scrambling the entire rest of the message. If you want to live dangerously, with a very slightly more efficient coding, you can delete the IMAX( ,1) operation. #include "nrutil.h" #include ANSI header file containing integer ranges. #define MC 512 #ifdef ULONG_MAX Maximum value of unsigned long. #define MAXINT (ULONG_MAX >> 1) #else #define MAXINT 2147483647 #endif Here MC is the largest anticipated value of nchh; MAXINT is a large positive integer that does not overflow. typedef struct { unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; } arithcode; void arcmak(unsigned long nfreq[], unsigned long nchh, unsigned long nradd, arithcode *acode) Given a table nfreq[1..nchh] of the frequency of occurrence of nchh symbols, and given a desired output radix nradd, initialize the cumulative frequency table and other variables for arithmetic compression in the structure acode. { unsigned long j; if (nchh > MC) nrerror("input radix may not exceed MC in arcmak."); if (nradd > 256) nrerror("output radix may not exceed 256 in arcmak."); acode->minint=MAXINT/nradd; acode->nch=nchh; acode->nrad=nradd; acode->ncumfq[1]=0; for (j=2;jnch+1;j++) acode->ncumfq[j]=acode->ncumfq[j-1]+IMAX(nfreq[j-1],1); acode->ncum=acode->ncumfq[acode->nch+2]=acode->ncumfq[acode->nch+1]+1; }

The structure acode must be defined and allocated in your main program with statements like this: #include "nrutil.h" #define MC 512 Maximum anticipated value of nchh in arcmak. #define NWK 20 Keep this value the same as in arcode, below. typedef struct { unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; } arithcode; ... arithcode acode; ... acode.ilob=(unsigned long *)lvector(1,NWK); Allocate space within acode. acode.iupb=(unsigned long *)lvector(1,NWK); acode.ncumfq=(unsigned long *)lvector(1,MC+2);

20.5 Arithmetic Coding

913

Individual characters in a message are coded or decoded by the routine arcode, which in turn uses the utility arcsum. #include <stdio.h> #include <stdlib.h> #define NWK 20 #define JTRY(j,k,m) ((long)((((double)(k))*((double)(j)))/((double)(m)))) This macro is used to calculate (k*j)/m without overflow. Program efficiency can be improved by substituting an assembly language routine that does integer multiply to a double register. typedef struct { unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; } arithcode; void arcode(unsigned long *ich, unsigned char **codep, unsigned long *lcode, unsigned long *lcd, int isign, arithcode *acode) Compress (isign = 1) or decompress (isign = −1) the single character ich into or out of the character array *codep[1..lcode], starting with byte *codep[lcd] and (if necessary) incrementing lcd so that, on return, lcd points to the first unused byte in *codep. Note that the structure acode contains both information on the code, and also state information on the particular output being written into the array *codep. An initializing call with isign=0 is required before beginning any *codep array, whether for encoding or decoding. This is in addition to the initializing call to arcmak that is required to initialize the code itself. A call with ich=nch (as set in arcmak) has the reserved meaning “end of message.” { void arcsum(unsigned long iin[], unsigned long iout[], unsigned long ja, int nwk, unsigned long nrad, unsigned long nc); void nrerror(char error_text[]); int j,k; unsigned long ihi,ja,jh,jl,m; if (!isign) { Initialize enough digits of the upper and lower bounds. acode->jdif=acode->nrad-1; for (j=NWK;j>=1;j--) { acode->iupb[j]=acode->nrad-1; acode->ilob[j]=0; acode->nc=j; if (acode->jdif > acode->minint) return; Initialization complete. acode->jdif=(acode->jdif+1)*acode->nrad-1; } nrerror("NWK too small in arcode."); } else { if (isign > 0) { If encoding, check for valid input character. if (*ich > acode->nch) nrerror("bad ich in arcode."); } else { If decoding, locate the character ich by bisection. ja=(*codep)[*lcd]-acode->ilob[acode->nc]; for (j=acode->nc+1;jnrad; ja += ((*codep)[*lcd+j-acode->nc]-acode->ilob[j]); } ihi=acode->nch+1; *ich=0; while (ihi-(*ich) > 1) { m=(*ich+ihi)>>1; if (ja >= JTRY(acode->jdif,acode->ncumfq[m+1],acode->ncum)) *ich=m; else ihi=m; } if (*ich == acode->nch) return; Detected end of message. } Following code is common for encoding and decoding. Convert character ich to a new subrange [ilob,iupb).

914

Chapter 20.


jh=JTRY(acode->jdif,acode->ncumfq[*ich+2],acode->ncum); jl=JTRY(acode->jdif,acode->ncumfq[*ich+1],acode->ncum); acode->jdif=jh-jl; arcsum(acode->ilob,acode->iupb,jh,NWK,acode->nrad,acode->nc); arcsum(acode->ilob,acode->ilob,jl,NWK,acode->nrad,acode->nc); How many leading digits to output (if encoding) or skip over? for (j=acode->nc;jnch && acode->iupb[j] != acode->ilob[j]) break; if (*lcd > *lcode) { fprintf(stderr,"Reached the end of the ’code’ array.\n"); fprintf(stderr,"Attempting to expand its size.\n"); *lcode += *lcode/2; if ((*codep=(unsigned char *)realloc(*codep, (unsigned)(*lcode*sizeof(unsigned char)))) == NULL) { nrerror("Size expansion failed"); } } if (isign > 0) (*codep)[*lcd]=(unsigned char)acode->ilob[j]; ++(*lcd); } if (j > NWK) return; Ran out of message. Did someone forget to encode a acode->nc=j; terminating ncd? for(j=0;acode->jdifminint;j++) How many digits to shift? acode->jdif *= acode->nrad; if (acode->nc-j < 1) nrerror("NWK too small in arcode."); if (j) { Shift them. for (k=acode->nc;kiupb[k-j]=acode->iupb[k]; acode->ilob[k-j]=acode->ilob[k]; } } acode->nc -= j; for (k=NWK-j+1;kiupb[k]=acode->ilob[k]=0; } return;

Normal return.

}

void arcsum(unsigned long iin[], unsigned long iout[], unsigned long ja, int nwk, unsigned long nrad, unsigned long nc) Used by arcode. Add the integer ja to the radix nrad multiple-precision integer iin[nc..nwk]. Return the result in iout[nc..nwk]. { int j,karry=0; unsigned long jtmp; for (j=nwk;j>nc;j--) { jtmp=ja; ja /= nrad; iout[j]=iin[j]+(jtmp-ja*nrad)+karry; if (iout[j] >= nrad) { iout[j] -= nrad; karry=1; } else karry=0; } iout[nc]=iin[nc]+ja+karry; }

If radix-changing, rather than compression, is your primary aim (for example to convert an arbitrary file into printable characters) then you are of course free to set all the components of nfreq equal, say, to 1.

20.6 Arithmetic at Arbitrary Precision

915

CITED REFERENCES AND FURTHER READING: Bell, T.C., Cleary, J.G., and Witten, I.H. 1990, Text Compression (Englewood Cliffs, NJ: PrenticeHall). Nelson, M. 1991, The Data Compression Book (Redwood City, CA: M&T Books). Witten, I.H., Neal, R.M., and Cleary, J.G. 1987, Communications of the ACM, vol. 30, pp. 520– 540. [1]

20.6 Arithmetic at Arbitrary Precision Let’s compute the number π to a couple of thousand decimal places. In doing so, we’ll learn some things about multiple precision arithmetic on computers and meet quite an unusual application of the fast Fourier transform (FFT). We’ll also develop a set of routines that you can use for other calculations at any desired level of arithmetic precision. To start with, we need an analytic algorithm for π. Useful algorithms are quadratically convergent, i.e., they double the number of significant digits at each iteration. Quadratically convergent algorithms for π are based on the AGM (arithmetic geometric mean) method, which also finds application to the calculation of elliptic integrals (cf. §6.11) and in advanced implementations of the ADI method for elliptic partial differential equations (§19.5). Borwein and Borwein [1] treat this subject, which is beyond our scope here. One of their algorithms for π starts with the initializations √ X0 = 2 √ π0 = 2 + 2 (20.6.1) √ 4 Y0 = 2 and then, for i = 0, 1, . . . , repeats the iteration

πi+1

Yi+1

(

1 Xi + √ Xi Xi+1 + 1 = πi Yi + 1 ( Yi Xi+1 + ( 1 Xi+1 = Yi + 1

Xi+1 =

1 2

(20.6.2)

The value π emerges as the limit π∞ . Now, to the question of how to do arithmetic to arbitrary precision: In a high-level language like C, a natural choice is to work in radix (base) 256, so that character arrays can be directly interpreted as strings of digits. At the very end of our calculation, we will want to convert our answer to radix 10, but that is essentially a frill for the benefit of human ears, accustomed to the familiar chant, “three point

916

Chapter 20.


one four one five nine. . . .” For any less frivolous calculation, we would likely never leave base 256 (or the thence trivially reachable hexadecimal, octal, or binary bases). We will adopt the convention of storing digit strings in the “human” ordering, that is, with the first stored digit in an array being most significant, the last stored digit being least significant. The opposite convention would, of course, also be possible. “Carries,” where we need to partition a number larger than 255 into a low-order byte and a high-order carry, present a minor programming annoyance, solved, in the routines below, by the use of the macros LOBYTE and HIBYTE. It is easy at this point, following Knuth [2], to write a routine for the “fast” arithmetic operations: short addition (adding a single byte to a string), addition, subtraction, short multiplication (multiplying a string by a single byte), short division, ones-complement negation; and a couple of utility operations, copying and left-shifting strings. (On the diskette, these functions are all in the single file mpops.c.) #define LOBYTE(x) ((unsigned char) ((x) & 0xff)) #define HIBYTE(x) ((unsigned char) ((x) >> 8 & 0xff)) Multiple precision arithmetic operations done on character strings, interpreted as radix 256 numbers. This set of routines collects the simpler operations. void mpadd(unsigned char w[], unsigned char u[], unsigned char v[], int n) Adds the unsigned radix 256 integers u[1..n] and v[1..n] yielding the unsigned integer w[1..n+1]. { int j; unsigned short ireg=0; for (j=n;j>=1;j--) { ireg=u[j]+v[j]+HIBYTE(ireg); w[j+1]=LOBYTE(ireg); } w[1]=HIBYTE(ireg); } void mpsub(int *is, unsigned char w[], unsigned char u[], unsigned char v[], int n) Subtracts the unsigned radix 256 integer v[1..n] from u[1..n] yielding the unsigned integer w[1..n]. If the result is negative (wraps around), is is returned as −1; otherwise it is returned as 0. { int j; unsigned short ireg=256; for (j=n;j>=1;j--) { ireg=255+u[j]-v[j]+HIBYTE(ireg); w[j]=LOBYTE(ireg); } *is=HIBYTE(ireg)-1; } void mpsad(unsigned char w[], unsigned char u[], int n, int iv) Short addition: the integer iv (in the range 0 ≤ iv ≤ 255) is added to the unsigned radix 256 integer u[1..n], yielding w[1..n+1]. { int j; unsigned short ireg; ireg=256*iv;

20.6 Arithmetic at Arbitrary Precision

917

for (j=n;j>=1;j--) { ireg=u[j]+HIBYTE(ireg); w[j+1]=LOBYTE(ireg); } w[1]=HIBYTE(ireg); } void mpsmu(unsigned char w[], unsigned char u[], int n, int iv) Short multiplication: the unsigned radix 256 integer u[1..n] is multiplied by the integer iv (in the range 0 ≤ iv ≤ 255), yielding w[1..n+1]. { int j; unsigned short ireg=0; for (j=n;j>=1;j--) { ireg=u[j]*iv+HIBYTE(ireg); w[j+1]=LOBYTE(ireg); } w[1]=HIBYTE(ireg); } void mpsdv(unsigned char w[], unsigned char u[], int n, int iv, int *ir) Short division: the unsigned radix 256 integer u[1..n] is divided by the integer iv (in the range 0 ≤ iv ≤ 255), yielding a quotient w[1..n] and a remainder ir (with 0 ≤ ir ≤ 255). { int i,j; *ir=0; for (j=1;j=1;j--) { ireg=255-u[j]+HIBYTE(ireg); u[j]=LOBYTE(ireg); } } void mpmov(unsigned char u[], unsigned char v[], int n) Move v[1..n] onto u[1..n]. { int j; for (j=1;j

Numerical recipes in C

Numerical recipes in C

Numerical recipes in FORTRAN77

Numerical recipes in FORTRAN