A THEORETICAL INTRODUCTION TO NUMERICAL ANALYSIS
Victor S. Rvaben'kii Semvon V. Tsvnkov
Boca Raton
London
New York ...
164 downloads
1358 Views
9MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
A THEORETICAL INTRODUCTION TO NUMERICAL ANALYSIS
Victor S. Rvaben'kii Semvon V. Tsvnkov
Boca Raton
London
New York
Chapman &; Hall/CRC is an imprint of the
Taylor &; Francis Group, an informa business
& Hall!CRC & Francis Group
Chapman Taylor
6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 334872742
© 2007 by Taylor & Francis Group, LLC & Hall!CRC is an imprint of Taylor & Francis Group, an Informa business
Chapman
No claim to original U.S. Government works Printed in the United States of America on acidfree paper 10 9 8 7 65 4 3 2 1 International Standard Book Number10: 1584886072 (Hardcover) International Standard Book Number13: 9781584886075 (Hardcover) This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the conse quences of their use. No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www. copyright.com (http://www.copyright.coml) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 9787508400. CCC is a notforprofit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the
CRC Press Web site at
http://www.crcpress.com
Contents
Preface Acknowledgments
1
Introduction
1.1
1.2 1.3
1 .4
Discretization Exercises Conditioning . . Exercises Error 1.3.1 Unavoidable Error 1.3.2 Error of the Method 1 .3.3 Roundoff Error . . Exercises . . . . . . On Methods of Computation 1 .4. 1 Accuracy . . . . 1 .4.2 Operation Count . . 1 .4.3 Stability . . . . . . . 1.4.4 Loss of Significant Digits 1.4.5 Convergence . . . 1.4.6 General Comments Exercises . . . . .
I
Interpolation of Functions. Quadratures
2
Algebraic Interpolation
2. 1
2.2
Existence and Uniqueness of Interpolating Polynomial . . . . . . . 2. 1.1 The Lagrange Form of Interpolating Polynomial . . . . . . 2. 1.2 The Newton Form of Interpolating Polynomial. Divided Differences . . . . . . . . . . . . . . . . . . . . . . 2. 1.3 Comparison of the Lagrange and Newton Forms . . . . . . 2. 1.4 Conditioning of the Interpolating Polynomial . . . . . . . . 2. 1.5 On Poor Convergence of Interpolation with Equidistant Nodes . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . Classical Piecewise Polynomial Interpolation . . . . . 2.2. 1 Definition of Piecewise Polynomial Interpolation
xi xiii
1
4 5 6 7 7
8 10 10 11 12 13 14 14 15 18 18 19 21
25
25 25
26 31 32 33 34 35 35 iii
iv Formula for the Interpolation Error . . . . . . . . . . . . . Approximation of Derivatives for a Grid Function . . . . . . Estimate of the Unavoidable Error and the Choice of Degree for Piecewise Polynomial Interpolation . . . . . 2.2.5 Saturation of Piecewise Polynomial Interpolation Exercises . . . . . . . . . . . . . . . . . . . . . Smooth Piecewise Polynomial Interpolation (Splines) 2.3. 1 Local Interpolation of Smoothness s and Its Properties 2.3.2 Nonlocal Smooth Piecewise Polynomial Interpolation 2.3.3 Proof of Theorem 2. 1 1 . . . . . . . Exercises . . . . . . . . . . . . . . Interpolation of Functions of Two Variables 2.4. 1 Structured Grids . 2.4.2 Unstructured Grids Exercises . . . . 2.2.2 2.2.3 2.2.4
2.3
2.4
3
Trigonometric Interpolation
3. 1
3.2
Interpolation of Periodic Functions . . . . . . . . . . . . . . . 3 . 1 . 1 An Important Particular Choice of Interpolation Nodes . . . 3 . 1 .2 Sensitivity of the Interpolating Polynomial to Perturbations of the Function Values . . . . . . . . . . . . . 3 . 1 .3 Estimate of Interpolation Error . . . . . . . . . . . . . . . . 3 . 1 .4 An Alternative Choice of Interpolation Nodes . . . . . . . . Interpolation of Functions on an Interval. Relation between Algebraic and Trigonometric Interpolation 3.2. 1 Periodization . . . . . . . . . . . 3.2.2 Trigonometric Interpolation . . . 3.2.3 Chebyshev Polynomials. Relation between Algebraic and Trigonometric Interpolation . . . . . . . . . . . . . . . . . 3.2.4 Properties of Algebraic Interpolation with Roots of the Chebyshev Polynomial Tn+! (x) as Nodes . . . . . . . . . 3.2.5 An Algorithm for Evaluating the Interpolating Polynomial . 3.2.6 Algebraic Interpolation with Extrema of the Chebyshev Polynomial Tn (x) as Nodes . . . . . . . . . . . . . . . . . . 3.2.7 More on the Lebesgue Constants and Convergence of Interpolants . Exercises . . . . . . . . . . . . . . . .
.
.
4
Computation of Definite Integrals. Quadratures
4. 1
4.2
Trapezoidal Rule, Simpson's Formula, and the Like 4. 1 . 1 General Construction of Quadrature Formulae . 4. 1 .2 Trapezoidal Rule . . 4. 1 .3 Simpson's Formula . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . Quadrature Formulae with No Saturation. Gaussian Quadratures
35 38 40 42 42 43 43 48 53 56 57 57 59 60 61
62 62
67 68 72 73 73 75 75 77 78 79 80 89 91
91 92 93 98 102 1 02
v 4.3 4.4
II
5
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . Improper Integrals. Combination of Numerical and Analytical Methods . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . Multiple Integrals . . . . . . . . . . . . . . . . . . . 4.4. 1 Repeated Integrals and Quadrature Formulae 4.4.2 The Use of Coordinate Transformations 4.4.3 The Notion of Monte Carlo Methods .
Systems of Scalar Equations Systems of Linear Algebraic Equations: Direct Methods
5.1
5.2
5.3
5.4
5.5 5.6
Different Forms of Consistent Linear Systems 5. 1 . 1 Canonical Form of a Linear System . . . . . . 5 . 1 .2 Operator Form . . . . . . . . . . . . . . . . . 5 . 1 . 3 FiniteDifference Dirichlet Problem for the Poisson Equation . . . . . . . . . . . . . . Exercises . . . . . . . . . . . Linear Spaces, Norms, and Operators 5.2. 1 Normed Spaces . . . . . . . 5.2.2 Norm of a Linear Operator . Exercises . . . . . . . . Conditioning of Linear Systems . . 5.3. 1 Condition Number . . . . . 5.3.2 Characterization of a Linear System by Means of Its Condition Number . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . Gaussian Elimination and Its TriDiagonal Version 5 .4. 1 Standard Gaussian Elimination . . 5.4.2 TriDiagonal Elimination . . . . . . . . . . 5.4.3 Cyclic TriDiagonal Elimination . . . . . . 5.4.4 Matrix Interpretation of the Gaussian Elimination. LU Factorization . . . . . . . . . . . . . . 5.4.5 Cholesky Factorization . . . . . . . . . . . . . . 5 .4.6 Gaussian Elimination with Pivoting . . . . . . . 5 .4.7 An Algorithm with a Guaranteed Error Estimate Exercises . . . . . . . . . . . . . . . . . . . . . Minimization of Quadratic Functions and Its Relation to Linear Systerns . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . The Method of Conjugate Gradients 5.6. 1 Construction of the Method 5 .6.2 Flexibility in Specifying the Operator A 5.6.3 Computational Complexity . Exercises . . . . . . . . . . . . . . . .
107 108 1 10 1 10 111 1 12 1 13 115 119
1 20 1 20 121
121 1 24 1 24 1 26 1 29 131 133 1 34 1 36 139 1 40 141 145 148 149 153 1 54 155 156 1 57 1 59 1 59 1 59 1 63 1 63 1 63
vi 5.7
6
Finite Fourier Series . . . . . . . . . . . . . . . . . . . . . 5.7. 1 Fourier Series for Grid Functions . . . . . . . . . . 5.7.2 Representation of Solution as a Finite Fourier Series 5.7.3 Fast Fourier Transform . Exercises . . . . . . . . . . . . . .
Iterative Methods for Solving Linear Systems
6. 1
Richardson Iterations and the Like . . . . . 6. 1 . 1 General Iteration Scheme . . . . . . 6. 1 .2 A Necessary and Sufficient Condition for Convergence . 6. 1 .3 The Richardson Method for A = A * > 0 . 6. 1 .4 Preconditioning . 6. 1 . 5 Scaling . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . 6.2 Chebyshev Iterations and Conjugate Gradients 6.2.1 Chebyshev Iterations 6.2.2 Conjugate Gradients Exercises . . . . . . 6.3 Krylov Subspace Iterations . 6.3.1 Definition of Krylov Subspaces 6.3.2 GMRES . . Exercises . . . . 6.4 Multigrid Iterations . . . 6.4. 1 Idea of the Method 6.4.2 Description of the Algorithm . 6.4.3 Bibliography Comments Exercises . . . . . . . . . . . .
.
7
Overdetermined Linear Systems. The Method of Least Squares
7. 1
Examples of Problems that Result in Overdetermined Systems 7. 1 . 1 Processing of Experimental Data. Empirical Formulae 7. 1 .2 Improving the Accuracy of Experimental Results by Increasing the Number of Measurements . . . . . . . . 7.2 Weak Solutions of Full Rank Systems. QR Factorization 7.2. 1 Existence and Uniqueness o f Weak Solutions . . . 7.2.2 Computation of Weak Solutions. QR Factorization 7.2.3 Geometric Interpretation of the Method of Least Squares 7.2.4 Overdetermined Systems in the Operator Form . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Rank Deficient Systems. Singular Value Decomposition . . . . 7.3.1 Singular Value Decomposition and MoorePenrose Pseudoinverse . . . . . . . . . . . . 7.3.2 Minimum Norm Weak Solution Exercises . . . . . . . . . . . . .
1 64 1 65 168 1 69 171 173
174 1 74 1 78 181 1 88 1 92 1 93 1 94 1 94 1 96 1 97 1 98 1 99 20 1 204 204 205 208 210 210 211
21 1 211
213 214 214 217 220 221 222 225 225 227 229
vii 8
8. 1
8.2
8.3
III
9
231
Numerical Solution of Nonlinear Equations and Systems
Commonly U sed Methods of Rootfinding 8. 1 . 1 The Bisection Method 8. 1 .2 The Chord Method . 8 . 1 .3 The Secant Method . 8. 1 .4 Newton's Method . . Fixed Point Iterations . . . . 8.2. 1 The Case of One Scalar Equation 8.2.2 The Case of a System of Equations Exercises . . . . . . . . . . . . . . Newton's Method . . . . . . . . . . . . . . 8.3. 1 Newton's Linearization for One Scalar Equation 8.3.2 Newton's Linearization for Systems 8.3.3 Modified Newton's Methods Exercises . . . . . . . . . . . . . .
233 233 234 235 236 237 237 240 242 242 242 244 246 247
The Method of Finite Differences for the Numerical Solution of Differential Equations 249
253
Numerical Solution of Ordinary Differential Equations
9. 1
9.2
9.3
Examples of FiniteDifference Schemes. Convergence 253 9. 1 . 1 Examples of Difference Schemes . . . . . . . 254 9. 1 .2 Convergent Difference Schemes . . . . . . . . 256 259 9. 1 .3 Verification of Convergence for a Difference Scheme Approximation of Continuous Problem by a Difference Scheme. 260 Consistency . . . . . . . . . . . . . . . . . . . . 9.2. 1 Truncation Error 8/(11) . . . . . . . . . . 261 262 9.2.2 Evaluation of the Truncation Error 8/(11) 264 9.2.3 Accuracy of Order hk . . . . . . . . . . . 265 9.2.4 Examples . . . . . . . . . . . . . . . . . 9.2.5 Replacement of Derivatives by Difference Quotients 269 269 9.2.6 Other Approaches to Constructing Difference Schemes . Exercises . . . . . . . . . . . 27 1 Stability of FiniteDifference Schemes . . . . . . . . . . . . . . 27 1 9.3. 1 Definition of Stability . . . . . . . . . . . . . . . . 272 9.3.2 The Relation between Consistency, Stability, and Conver273 gence . . . . . . . . . . . . . . . . . . . . . 277 9.3.3 Convergent Scheme for an Integral Equation 9.3.4 The Effect of Rounding . . . . 278 280 9.3.5 General Comments. Astability 283 Exercises . . . . . . . . . The RungeKutta Methods . . . . 284 9.4 . 1 The RungeKutta Schemes 284 9.4.2 Extension to Systems . 286 Exercises . . . . . . . . . 288 .
9.4
.
viii 9.5
9.6 9.7
Solution of Boundary Value Problems 9.5 . 1 The Shooting Method . . 9.5.2 TriDiagonal Elimination . 9.5.3 Newton's Method . . . . . Exercises . . . . . . . . . Saturation of FiniteDifference Methods by Smoothness Exercises . . . . . . . . The Notion of Spectral Methods Exercises . . . . . . . .
10 FiniteDifference Schemes for Partial Differential Equations
1 0 . 1 Key Definitions and Illustrating Examples 1 0. 1 . 1 Definition of Convergence 10. 1 .2 Definition of Consistency . . . . 10. 1.3 Definition of Stability . . . . . . 10. 1 .4 The Courant, Friedrichs, and Lewy Condition 10. 1 .5 The Mechanism of lnstability . . . . . . . . 10. 1 .6 The Kantorovich Theorem . . . . . . . . . . 10. 1 .7 On the Efficacy of FiniteDifference Schemes . 1 0 . 1 . 8 Bibliography Comments . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . 1 0.2 Construction of Consistent Difference Schemes . . . . 1 0.2. 1 Replacement of Derivatives by Difference Quotients 10.2.2 The Method of Undetermined Coefficients 10.2.3 Other Methods. Phase Error . 10.2.4 PredictorCorrector Schemes . . . . . . . . Exercises . . . . . . . . . . . . . . . . . 10.3 Spectral Stability Criterion for FiniteDifference Cauchy Problems 10.3. 1 Stability with Respect to Initial Data . . . . . 10.3.2 A Necessary Spectral Condition for Stability 10.3.3 Examples . . . . . . . . . . . . . . . . . . . 10.3.4 Stability in C . . . . . . . . . . . . . . . . . 10.3.5 Sufficiency of the Spectral Stability Condition in 12 1 0.3.6 Scalar Equations vs. Systems . . . . . . . Exercises . . . . . . . . . . . . . . . . . 1 0.4 Stability for Problems with Variable Coefficients 10.4. 1 The Principle of Frozen Coefficients . . . 10.4.2 Dissipation of FiniteDifference Schemes Exercises . . . . . . . . . . . . . . . 1 0.5 Stability for Initial Boundary Value Problems . . 10.5. 1 The BabenkoGelfand Criterion . . . . . 10.5.2 Spectra of the Families of Operators. The GodunovRyaben'kii Criterion 10.5.3 The Energy Method . . . . . . . . . . . .
288 289 29 1 29 1 292 293 300 301 306 307
307 307 309 312 317 319 320 322 323 324 327 327 333 340 344 345 349 349 350 352 362 362 365 367 369 369 372 377 377 377
385 402
IX
1 0.5.4 A Necessary and Sufficient Condition of Stability. Kreiss Criterion . . . . . . . . . . Exercises . . . . . . . . . . . . . 10.6 Maximum Principle for the Heat Equation 1 0.6. 1 An Explicit Scheme 10.6.2 An Implicit Scheme Exercises . . . . . .
The
11 Discontinuous Solutions and Methods of Their Computation
1 1. 1 Differential Form of an Integral Conservation Law . . . . 1 1 . 1 . 1 Differential Equation in the Case of Smooth Solutions 1 1 . 1 .2 The Mechanism of Formation of Discontinuities 1 1 . 1 .3 Condition at the Discontinuity . . . . . . . . . 1 1 . 1 .4 Generalized Solution of a Differential Problem 1 1 . 1 .5 The Riemann Problem . . . Exercises . . . . . . . . . . 1 1 .2 Construction of Difference Schemes 1 1 .2. 1 Artificial Viscosity . . . . . 1 1 .2.2 The Method of Characteristics 1 1 .2.3 Conservative Schemes. The Godunov Scheme Exercises . . . . . . . . . .
12 Discrete Methods for Elliptic Problems
1 2. 1 A Simple FiniteDifference Scheme. The Maximum Principle 1 2. 1 . 1 Consistency . . . . . . . . . . . . 1 2. 1 .2 Maximum Principle and Stability 12. 1 .3 Variable Coefficients . . . . . . . Exercises . . . . . . . . . . . . . 1 2.2 The Notion of Finite Elements. Ritz and Galerkin Approximations 1 2.2. 1 Variational Problem . 1 2.2.2 The Ritz Method . . . . . . . . . . . . . . . 12.2.3 The Galerkin Method . . . . . . . . . . . . . 12.2.4 An Example of Finite Element Discretization 1 2.2.5 Convergence of Finite Element Approximations . Exercises . . . . . . . . . . . . . . . . . . . . .
IV
409 418 422 422 425 426 427
428 428 429 43 1 433 434 436 436 437 438 439 444 445
446 447 448 45 1 452 453 454 458 460 464 466 469
The Methods of Boundary Equations for the Numerical 471 Solution of Boundary Value Problems
13 Boundary Integral Equations and the Method of Boundary Elements
1 3 . 1 Reduction of Boundary Value Problems to Integral Equations 1 3.2 Discretization of Integral Equations and Boundary Elements 1 3.3 The Range of Applicability for Boundary Elements . . . . . .
475
475 479 480
x
14 Boundary Equations with Projections and the Method of Difference Po483 tentials
14. 1 Formulation of Model Problems . . . . . 14. 1. 1 Interior Boundary Value Problem 1 4. 1 .2 Exterior Boundary Value Problem 1 4. 1 .3 Problem of Artificial Boundary Conditions 14.1.4 Problem of Two Subdomains . 14. 1 .5 Problem of Active Shielding . 1 4.2 Difference Potentials . . . . . . . . . 1 4.2. 1 Auxiliary Difference Problem 1 4.2.2 The Potential u+ = P+vy . . . 14.2.3 Difference Potential u  = Pvy 14.2.4 Cauchy Type Difference Potential w± = P±vy . 14.2.5 Analogy with Classical Cauchy Type Integral 1 4.3 Solution of Model Problems . . . . . . . 14.3.1 Interior Boundary Value Problem . . . . . 1 4.3.2 Exterior Boundary Value Problem . . . . . 14.3.3 Problem of Artificial Boundary Conditions 14.3.4 Problem of Two Subdomains . 14.3.5 Problem of Active Shielding 1 4.4 General Remarks . . . . 1 4.5 Bibliography Comments . . . . . .
List of Figures Referenced Books Referenced Journal Articles Index
484 485 485 485 486 487 488 488 489 492 493 497 498 498 500 50 1 501 503 505 506
507
509
517 521
Preface
This book introduces the key ideas and concepts of numerical analysis. The dis cussion focuses on how one can represent different mathematical models in a form that enables their efficient study by means of a computer. The material learned from this book can be applied in various contexts that require the use of numerical meth ods. The general methodology and principles of numerical analysis are illustrated by specific examples of the methods for real analysis, linear algebra, and differential equations. The reason for this particular selection of subjects is that these methods are proven, provide a number of wellknown efficient algorithms, and are used for solving different applied problems that are often quite distinct from one another. The contemplated readership of this book consists of beginning graduate and se nior undergraduate students in mathematics, science and engineering. It may also be of interest to working scientists and engineers. The book offers a first mathematical course on the subject of numerical analysis. It is carefully structured and can be read in its entirety, as well as by selected parts. The portions of the text considered more difficult are clearly identified; they can be skipped during the first reading without creating any substantial gaps in the material studied otherwise. In particular, more difficult subjects are discussed in Sections 2.3 . 1 and 2.3.3, Sections 3. 1 .3 and 3.2.7, parts of Sections 4.2 and 9.7, Section 10.5, Section 12.2, and Chapter 14. Hereafter, numerical analysis is interpreted as a mathematical discipline. The ba sic concepts, such as discretization, error, efficiency, complexity, numerical stability, consistency, convergence, and others, are explained and illustrated in different parts of the book with varying levels of depth using different subject material. Moreover, some ideas and views that are addressed, or at least touched upon in the text, may also draw the attention of more advanced readers. First and foremost, this applies to the key notion of the saturation of numerical methods by smoothness. A given method of approximation is said to be saturated by smoothness if, because of its design, it may stop short of reaching the intrinsic accuracy limit (unavoidable error) determined by the smoothness of the approximated solution and by the discretization parameters. If, conversely, the accuracy of approximation selfadjusts to the smoothness, then the method does not saturate. Examples include algebraic vs. trigonometric interpola tion, NewtonCotes vs. Gaussian quadratures, finitedifference vs. spectral methods for differential equations, etc. Another advanced subject is an introduction to the method of difference potentials in Chapter 14. This is the first account of difference potentials in the educational literature. The method employs discrete analogues of modified Calderon's potentials and boundary projection operators. It has been successfully applied to solving a variety of direct and inverse problems in fluids, acoustics, and electromagnetism. This book covers three semesters of instruction in the framework of a commonly Xl
xii used curriculum with three credit hours per semester. Three semesterlong courses can be designed based on Parts I, II, and III of the book, respectively. Part I includes interpolation of functions and numerical evaluation of definite integrals. Part II cov ers direct and iterative solution of consistent linear systems, solution of overdeter mined linear systems, and solution of nonlinear equations and systems. Part III discusses finitedifference methods for differential equations. The first chapter in this part, Chapter 9, is devoted to ordinary differential equations and serves an intro ductory purpose. Chapters 10, 1 1 , and 12 cover different aspects of finitedifference approximation for both steadystate and evolution partial differential equations, in cluding rigorous analysis of stability for initial boundary value problems and ap proximation of the weak solutions for nonlinear conservation laws. Alternatively, for the curricula that introduce numerical differentiation right after the interpolation of functions and quadratures, the material from Chapter 9 can be added to a course based predominantly on Part I of the book. A rigorous mathematical style is maintained throughout the book, yet very little use is made of the apparatus of functional analysis. This approach makes the book accessible to a much broader audience than only mathematicians and mathematics majors, while not compromising any fundamentals in the field. A thorough expla nation of the key ideas in the simplest possible setting is always prioritized over various technicalities and generalizations. All important mathematical results are accompanied by proofs. At the same time, a large number of examples are provided that illustrate how those results apply to the analysis of individual problems. This book has no objective whatsoever of describing as many different methods and techniques as possible. On the contrary, it treats only a limited number of well known methodologies, and only for the purpose of exemplifying the most fundamen tal concepts that unite different branches of the discipline. A number of important results are given as exercises for independent study. Altogether, many exercises supplement the core material; they range from elementary to quite challenging. Some exercises require computer implementation of the corresponding tech niques. However, no substantial emphasis is put on issues related to programming. In other words, any computer implementation serves only as an illustration of the rel evant mathematical concepts and does not carry an independent learning objective. For example, it may be useful to have different iteration schemes implemented for a system of linear algebraic equations. By comparing how their convergence rates depend on the condition number, one can subsequently judge the efficiency from a mathematical standpoint. However, other efficiency issues, e.g., runtime efficiency determined by the software and/or computer platform, are not addressed as there is no direct relation between them and the mathematical analysis of numerical methods. Likewise, no substantial emphasis is put on any specific applications. Indeed, the goal is to clearly and concisely present the key mathematical concepts pertinent to the analysis of numerical methods. This provides a foundation for the subsequent specialized training. Subjects such as computational fluid dynamics, computational acoustics, computational electromagnetism, etc., are very well addressed in the lit erature. Most corresponding books require some numerical background from the reader, the background of precisely the kind that the current text offers.
Acknowledgments
This book has a Russian language prototype [RyaOO] that withstood two editions: in 1 994 and in 2000. It serves as the main numerical analysis text at Moscow Institute for Physics and Technology. The authors are most grateful to the rector of the Insti tute at the time, Academician O. M. Belotserkovskii, who has influenced the original concept of this textbook. Compared to [RyaOO] , the current book is completely rewritten. It accommodates the differences that exist between the Russian language culture and the English lan guage culture of mathematics education. Moreover, the current textbook includes a very considerable amount of additional material. When writing Part III of the book, we exploited the ideas and methods previously developed in [GR64] and [GR87]. When writing Chapter 14, we used the approach of [Rya02, Introduction]. We are indebted to all our colleagues and friends with whom we discussed the subject of teaching the numerical analysis. The book has greatly benefited from all those discussions. In particular, we would like to thank S. Abarbanel, K. Brush linskii, V. Demchenko, A. Chertock, L. Choudov, L. Demkowicz, A. Ditkowski, R. Fedorenko, G. Fibich, P. Gremaud, T. Hagstrom, V. Ivanov, C. Kelley, D. Keyes, A. Kholodov, V. Kosarev, A. Kurganov, C. Meyer, N. Onofrieva, I. Petrov, V. Pirogov, L. Strygina, E. Tadmor, E. Turkel, S. Utyuzhnikov, and A. Zabrodin. We also remem ber the late K. Babenko, O. Lokutsievskii, and Yu. Radvogin. We would like to specially thank Alexandre Chorin of UC Berkeley and David Gottlieb of Brown University who read the manuscript prior to publication. A crucial and painstaking task of proofreading the manuscript was performed by the students who took classes on the subject of this book when it was in prepara tion. We are most grateful to L. Bilbro, A. Constantinescu, S. Ernsberger, S. Grove, A. Peterson, H. Qasimov, A. Sampat, and W. Weiselquist. All the imperfections still remaining are a sole responsibility of the authors. It is also a pleasure for the second author to thank Arje Nachman and Richard Albanese of the US Air Force for their consistent support of the second author's research work during and beyond the period of time when the book was written. And last, but not least, we are very grateful to the CRC Press Editor, Sunil Nair, as well as to the company staff in London and in Florida, for their advice and assistance. Finally, our deepest thanks go to our families for their patience and understanding without which this book project would have never been completed. V.
Ryaben'kii, Moscow, Russia S. Tsynkov, Raleigh, USA
August 2006
xiii
Chapter 1 Introduction
Modern numerical mathematics provides a theoretical foundation behind the use of electronic computers for solving applied problems. A mathematical approach to any such problem typically begins with building a model for the phenomenon of interest (situation, process, object, device, laboratory/experimental setting, etc.). Classical examples of mathematical models include definite integrals, equation of a pendulum, the heat equation, equations of elasticity, equations of electromagnetic waves, and many other equations of mathematical physics. For comparison, we should also mention here a model used in formal logics  the Boolean algebra. Analytical methods have always been considered a fundamental means for study ing the mathematical models. In particular, these methods allow one to obtain closed form exact solutions for some special cases (for example, tabular integrals). There are also classes of problems for which one can obtain a solution in the form of a power series, Fourier series, or some other expansion. In addition, a certain role has always been played by approximate computations. For example, quadrature formu lae are used for the evaluation of definite integrals. The advent of computers in the middle of the twentieth century has drastically increased our capability of performing approximate computations. Computers have essentially transformed approximate computations into a dominant tool for the anal ysis of mathematical models. Analytical methods have not lost their importance, and have even gained some additional "functionality" as components of combined ana lytical/computational techniques and as verification tools. Yet sophisticated math ematical models are analyzed nowadays mostly with the help of computers. Com puters have dramatically broadened the applicability range of mathematical meth ods in many traditional areas, such as mechanics, physics, and engineering. They have also facilitated a rapid expansion of the mathematical methods into various nontraditional fields, such as management, economics, finance, chemistry, biology, psychology, linguistics, ecology, and others. Computers provide a capability of storing large (but still finite) arrays of numbers, and performing arithmetic operations with these numbers according to a given pro gram that would run with a fast (but still finite) execution speed. Therefore, comput ers may only be appropriate for studying those particular models that are described by finite sets of numbers and require no more than finite sequences of arithmetic operations to be performed. Besides the arithmetic operations per se, a computer model can also contain comparisons between numbers that are typically needed for the automated control of subsequent computations. In the traditional fields, one frequently employs such mathematical models as
2
A Theoretical Introduction to Numerical Analysis
functions, derivatives, integrals, and differential equations. To enable the use of com puters, these original models must therefore be (approximately) replaced by the new models that would only be based on finite arrays of numbers supplemented by finite sequences of arithmetic operations for their processing (i.e., finite algorithms). For example, a function can be replaced by a table of its numerical values; the derivative
df =lim f(x+h) f(x) dx h>O h can be replaced by an approximate formula, such as
j'(x) � f(x+h)  f(x) ,
h
where h is fixed (and small); a definite integral can be replaced by its integral sum; a boundary value problem for the differential equation can be replaced by the problem of finding its solution at the discrete nodes of some grid, so that by taking a suitable (i.e., sufficiently small) grid size an arbitrary desired accuracy can be achieved. In so doing, among the two methods that could seem equivalent at first glance, one may produce good results while the other may turn out completely inapplicable. The reason can be that the approximate solution it generates would not approach the exact solution as the grid size decreases, or that the approximate solution would turn out overly sensitive to the small roundoff errors. The subject of numerical analysis is precisely the theory of those models and al gorithms that are applicable, i.e., that can be efficiently implemented on comput ers. This theory is intimately connected with many other branches of mathematics: Approximation theory and interpolation of functions, ordinary and partial differen tial equations, integral equations, complexity theory for functional classes and algo rithms, etc., as well as with the theory and practice of programming languages. In general, both the exploratory capacity and the methodological advantages that com puters deliver to numerous applied areas are truly unparalleled. Modern numerical methods allow, for example, the computation of the flow of fluid around a given aerodynamic configuration, e.g., an airplane, which in most cases would present an insurmountable task for analytical methods (like a nontabular integral). Moreover, the use of computers has enabled an entirely new scientific methodol ogy known as computational experiment, i.e., computations aimed at verifying the hypotheses, as well as at monitoring the behavior of the model, when it is not known ahead of time what may interest the researcher. In fact, computational experiment may provide a sufficient level of feedback for the original formulation of the problem to be noticeably refined. In other words, numerical computations help accumulate the vital information that eventually allows one to identify the most interesting cases and results in a given area of study. Many remarkable observations, and even dis coveries, have been made along this route that empowered the development of the theory and have found important practical applications as well. Computers have also facilitated the application of mathematical methods to non traditional areas, for which few or no "compact" mathematical models, such as dif ferential equations, are readily available. However, other models can be built that
Introduction
3
lend themselves to the analysis by means of a computer. A model of this kind can often be interpreted as a direct numerical counterpart (such as encoding) of the ob ject of interest and of the pertinent relations between its elements (e.g., a language or its abridged subset and the corresponding words and phrases). The very possi bility of studying such models on a computer prompts their construction, which, in tum, requires that the rules and guiding principles that govern the original object be clearly and unambiguously identified. On the other hand, the results of computer simulations, e.g., a machine translation of the simplified text from one language to another, provide a practical criterion for assessing the adequacy of the theories that constitute the foundation of the corresponding mathematical model (e.g., linguistic theories). Furthermore, computers have made it possible to analyze probabilistic models that require large amounts of test computations, as well as the socalled imitation models that describe the object or phenomenon of interest without simplifications (e.g., functional properties of a telephone network). The variety of problems that can benefit from the use of computers is huge. For solving a given problem, one would obviously need to know enough specific detail. Clearly, this knowledge cannot be obtained ahead of time for all possible scenarios. Therefore, the purpose of this book is rather to provide a systematic perspective on those fundamental ideas and concepts that span across different applied disciplines and can be considered established in the field of numerical analysis. Having mastered the material of this book, one should encounter little or no difficulties when receiving subsequent specialized training required for the successful work in a given research or industrial field. The general methodology and principles of numerical analysis are illustrated in the book by "sampling" the methods designed for mathematical analy sis, linear algebra, and differential equations. The reason for this particular selection is that the aforementioned methods are most mature, lead to a number of wellknown, efficient algorithms, and are extensively used for solving various applied problems that are often quite distant from one another. Let us mention here some of the general ideas and concepts that require the most thorough attention in every particular setting. These general ideas acquire a concrete interpretation and meaning in the context of each specific problem that needs to be solved on a computer. They are the discretization of the problem, conditioning of the problem, numerical error, and computational stability of a given algorithm. In addition, comparison of the algorithms along different lines obviously plays a central role when selecting a specific method. The key criteria for comparison are accuracy, storage, and operation count requirements, as well as the efficiency of utilization of the input information. On top of that, different algorithms may vary in how amenable they are to parallelization  a technique that allows one to conduct computations simultaneously on multiprocessor computer platforms. In the rest of the Introduction, we provide a brief overview of the foregoing no tions and concepts. It helps create a general perspective on the subject of numerical mathematics, and establishes a foundation for studying the subsequent material.
4
A Theoretical Introduction to Numerical Analysis
1.1
Discretization
Let f(x) be a function of the continuous argument x E [0, 1]. Assume that this function provides (some of) the required input data for a given problem that needs to be approximately solved on a computer. The value of the function f at every given x can be either measured or obtained numerically. Then, to store this function in the memory of a computer, one may need to approximately characterize it with a table of values at a finite set of points: Xl , x2 , ' " ,xn . This is an elementary example of discretization: The problem of storing the function defined on the interval [0, I], which is a continuum o f points, is replaced b y the problem o f storing a table o f its discrete values at the subset of points X l , x2 , ' " ,X/l that all belong to this interval. Let now f(x) be sufficiently smooth, and assume that we need to calculate its derivative at a given point x. The problem of exactly evaluating the expression
f(x + h)  f(x) j' (x) = hlim >O h that contains a limit can be replaced by the problem of computing an approximate value of this expression using one of the following formulae:
j' (x) ;:::; f(x + h)h  f(x) ,
(1.1)
 h) , (1.2) j' (x);:::; f(x)  f(x h  f(x  h) . (1 .3 ) j' (x) ;:::; f(x + h) 2h Similarly, the second derivative f" (x) can be replaced by the finite formula: (1.4) f" (X) ;:::; f(x + h)  2fh x) + f(x  h) . One can show that all these formulae become more and more accurate as h becomes smaller; this is the subject of Exercise I , and the details of the analysis can be found in Section 9.2. 1. Moreover, for every fixed h, each formula (1.1)(1.4) will only require a finite set of values of f and a finite number of arithmetic operations. These formulae are examples of discretization for the derivatives l' (x) and f" (x).
�
Let us now consider a boundary value problem:
d2y 2 = cosx, (1 .5) dx2  x Y y(O) = 2, y( l ) = 3 , where the unknown function y = y(x) is defined on the interval 0 � x � I. To con struct a discrete approximation of problem (1.5), let us first partition the interval
5
Introduction
into N equal subintervals of size h = N 1 • Instead of the continuous func tion y(x), we will be looking for a finite set of its values Yo, Y I , ... ,YN on the grid Xk = kh, k = 0, 1 , . . . , N. At the interior nodes of this grid: Xb k = 1 ,2 , . . . ,N  1 , we can approximately replace the second derivative Y" (x) by expression ( 1.4). After substituting into the differential equation of ( 1 .5) this yields:
[0, 1 ]
Yk+l 
�k + Ykl
_
(kh ?Yk = cos (kh ) ,
k=1,2,. . . ,N 
1.
( 1 . 6)
Furthermore, the boundary conditions at x = 0 and at x = 1 from ( 1 .5) translate into: Yo = 2 ,
YN
=
3.
( 1 .7)
The system of N + 1 linear algebraic equations ( 1 .6), ( 1 .7) contains exactly as many unknowns Yo, YI , . . . ,Y N, and renders a discrete counterpart of the boundary value problem (1 .5). One can, in fact, show that the finer the grid, i.e., the larger the N, the more accurate will the approximation be that the discrete solution of problem (1.6), ( 1 .7) provides for the continuous solution of problem (1.5). Later, this fact will be formulated and proven rigorously. Let us denote the continuous boundary value problem (1.5) by M� , and the discrete boundary value problem ( 1 .6), ( 1 .7) by MN. By taking N = 2,3,. . . , we associate an infinite sequence of discrete problems {MN} with the continuous problem M� . When computing the solution to a given problem MN for any fixed N, we only have to work with a finite array of numbers that specify the input data, and with a finite set of unknown quantities YO,YI, Y2 , . . . , YN. It is, however, the entire infinite sequence of finite discrete models {MN} that plays the central role from the standpoint of numer ical mathematics. Indeed, as those models happen to be more and more accurate, we can always choose a sufficiently large N that would guarantee any desired accuracy of approximation. In general, there are many different ways of transitioning from a given continuous problem M� to the sequence {MN} of its discrete counterparts. In other words, the approximation ( 1 . 6 ), ( 1 .7) of the boundary value problem (1.5) is by no means the only one possible. Let {MN} and {M�} be two sequences of approximations, and let us also assume that the computational costs of obtaining the discrete solutions of MN and M� are the same. Then, a better method of discretization would be the one that provides the same accuracy of approximation with a smaller value of N. Let us also note that for two seemingly equivalent discretization methods MN and M�, it may happen that one will approximate the continuous solution of problem M� with an increasingly high accuracy as N increases, whereas the other will yield "an approximate solution" that would bear less and less resemblance to the continuous solution of M�. We will encounter situations like this in Part III of the book, where we also discuss how the corresponding difficulties can be partially or fully overcome. Exercises
1. LetI(x) have as many bounded derivatives as needed. Show that the approximation error of formulae (1.1), (1 .2), (1.3), and (1 .4), is 0(11), O(h), 0(h2), and 0(h2).
6
1 .2
A Theoretical Introduction to Numerical Analysis
Conditioning
Speaking in most general terms, for any given problem one can basically identify the input data and the output result(s), i.e., the solution, so that the former determine the latter. In this book, we will mostly analyze problems for which the solution exists and is unique. If, in addition, the solution depends continuously on the data, i.e., if for a vanishing perturbation of the data the corresponding perturbation of the solution will also be vanishing, then the problem is said to be wellposed. A somewhat more subtle characterization of the problem, on top of its well posedness, is known as the conditioning. It has to do with quantifying the sensitivity of the solution, or some of its key characteristics, to perturbations of the input data. This sensitivity may vary strongly for different problems that could otherwise look very similar. If it is "low" (weak), then the problem is said to be well conditioned; if, conversely, the sensitivity is "high" then the problem is ill conditioned. The notions of low and high are, of course, problemspecific. We emphasize that the concept of conditioning pertains to both continuous and discrete problems. Typically, not only do ill conditioned problems require excessively accurate definition of the input data, but also appear more difficult for computations. Consider, for example, the quadratic equation x2  2ax + 1 = 0 for lal > 1 . It has two real roots that can be expressed as functions of the argument a: XI,2 a ± v'a2  1 . We will interpret a as the datum in the problem, and Xl = XI ( a) and X2 = X2 (a) as the corresponding solution. Clearly, the sensitivity of the solution to the perturbations of a can be characterized by the magnitude of the derivatives = 1 ± I . Indeed, LlxI ,2 � �a. We can easily see that the derivatives =
d;�2 v':Zd;�2 d;�2 are small for large ial, but they become large when a approaches I. We can
therefore conclude that the problem of finding the roots of x2  2ax + 1 = 0 is well conditioned when lal » 1 , and ill conditioned when lal = 0'( 1 ) . We should also note that conditioning can be improved if, instead of the original quadratic equation, we consider its equivalent x2  J X + I = 0, where f3 = a + v'a2  1 . In this case, XI = f3 and X2 = f3 I; the two roots coincide for 1f31 = 1 , or equivalently, I a I = 1 . However, the problem of evaluating f3 = f3 (a) is still ill conditioned near I a I = 1 . Our next example involves a simple ordinary differential equation. Let y = y(t) be the concentration of some substance at the time t, and assume that it satisfies:
�p2
dy  l Oy = O. dt Let us take an arbitrary to , 0 :s; to :s; 1 , and perform an approximate measurement of the actual concentration yO = y(to ) at this moment of time, thus obtaining: yl t =tQ = Yo· Our overall task will be to determine the concentration y y(t) at all other moments of time t from the interval [0, 1 ] . =
Introduction
7
If we knew the quantity Yo = y(to ) exactly, then we could have used the exact fonnula available for the concentration: ( 1 .8) y(t) yoelO(tto). We, however, only know the approximate value Yo � Yo of the unknown quantity Yo. =
Therefore, instead of ( 1 .8), the next best thing is to employ the approximate fonnula:
y* (t) yoeIO(tto). =
( 1 .9)
Clearly, the error y* y of the approximate fonnula ( 1 .9) is given by:
y* (t) y(t) (Y o  yo)elO(tto), 0 � t � 1. Assume now that we need to measure Yo to the the accuracy 8, jyo  Yo j < 8, that =
would be sufficient to guarantee an initially prescribed tolerance
f
for determining
y(t) everywhere on the interval 0 � t � 1 , i.e., would guarantee the error estimate: jy* (t) y(t) j < f, O � t � l . It is easy to see that max jy*(t)  y(t) j = jy* ( I)  y( l ) j jyo _Yo j elO(lto). This 09 0: =
sin x  e5:.f(x) 5:. sinx + e.
( 1 . 1 1)
xo  85:.xo 5:.xo + 8,
( 1 . 1 2)
Let the value of the argument x = Xo be also measured approximately: x = xo, so that regarding the actual Xo we can only say that
where 8 > 0 characterizes the accuracy of the measurement. One can easily see from Figure 1 . 1 that any point on the interval [a, b] of yj sinx+£ variable y, where a sin (xo  8)  e I sin x and b = sin (xo + 8) + e, can serve in I sinx€ bL  : the capacity of y = f(xo ). Clearly, by taking an arbitrary y* E [a, b] as the ap proximate value of y f(xo), one can always guarantee the error estimate: =
=
ly  y*1 5:. I b  al · FIGURE 1 . 1 : Unavoidable error.
( 1 . 1 3)
For the given uncertainty in the input data, see formulae ( 1 . 1 1 ) and 0 . 1 2), this estimate cannot be considerably im proved. In fact, the best error estimate
Introduction
9
that one can guarantee is obtained by choosing y* exactly in the middle of the interval
[a, b]:
y* = y�Pt = (a + b) 12.
From Figure 1 . 1 we then conclude that
Iy  Y* I :::: I b  a 1 /2.
( 1 . 1 4)
This inequality transforms into an exact equality when y (xo ) = a or when y (xo ) = b. As such, the quantity Ib  a l /2 is precisely the unavoidable (or irreducible) error, Le., the minimum error content that will always be present in the solution and that cannot be "dodged" no matter how the approximation y* is actually chosen, simply because of the uncertainty that exists in the input data. For the optimal choice of the approximate solution y;pt the smallest error ( 1 . 1 4) can be guaranteed; otherwise, the appropriate error estimate is ( 1 . 1 3). We see, however, that the optimal error estimate ( 1 . 1 4) is not that much better than the general estimate ( 1 . 1 3). We will therefore stay within reason if we interpret any arbitrary point y* E [a, b], rather than only y;pl' as an approximate solution for y (xo ) obtained within the limits of the unavoidable error. In so doing, the quantity Ib  a l shall replace I b  a l /2 of ( 1 . 1 4) as the estimate of the unavoidable error. Along with the simplest illustrative example of Figure 1 . 1 , let us consider another example that would be a little more realistic and would involve one of the most common problem formulations in numerical analysis, namely, that of reconstructing a function of continuous argument given its tabulated values at some discrete set of points. More precisely, let the values f(Xk ) of the function f = f(x) be known at the equidistant grid nodes Xk = kh, h > 0, k = 0, ± 1 , ±2, . . . Let us also assume that the first derivative of f(x) is bounded everywhere: 1J' (x) I :::: 1 , and that together with f(xd, this is basically all the information that we have about f(x). We need to be able to obtain the (approximate) value of f(x) at an arbitrary "intermediate" point x that does not necessarily coincide with any of the nodes Xk. A large variety of methods have been developed in the literature for solving this problem. Later, we will consider interpolation by means of algebraic (Chapter 2) and trigonometric (Chapter 3) polynomials. There are other ways of building the approximating polynomials, e.g., the least squares fit, and there are other types of functions that can be used as a basis for the approximation, e.g., wavelets. Each specific method will obviously have its own accuracy. We, however, are going to show that irrespective ofany particular technique used for reconstructing f(x), there will always be error due to incomplete specification of the input data. This error merely reflects the uncertainty in the formulation; it is unavoidable and cannot be suppressed by any "smart" choice of the reconstruction procedure. Consider the simplest case f(Xk ) = 0 for all k = 0, ± I , ±2, . . . . Clearly, the func tion fl (x) == 0 has the required trivial table of values, and also I ff (x) I :::: 1 . Along with fl (x), it is easy to find another function that would satisfy the same constraints, e.g., 12 (x) � sin (�X ) . Indeed, 12 (Xk) 0 , and If�(x) 1 = I cos (�X ) I :::: 1 . We there fore see that there are at least two different functions that cannot be told apart based on the available information. Consequently, the error max I fI (x)  12 (x) I = tJ(h) is .
=
=
x
10
A Theoretical Introduction to Numerical Analysis
unavoidable when reconstructing the function f(x), given its tabulated values f(xk) and the fact that its first derivative is bounded, no matter what specific reconstruction methodology may be employed. For more on the notion of the unavoidable error in the context of reconstructing continuous functions from their discrete values see Section 2.2.4 of Chapter 2. 1.3.2
Error o f the Method
Let y* = sinxo. The number y* belongs to the interval [a, b]; it can be considered a nonimprovable approximate solution of the first problem analyzed in Section 1 .3.1. For this solution, the error satisfies estimate (1. 13) and i s unavoidable. The point y* = sin xo has been selected among other points of the interval [a, b] only because it is given by the formula convenient for subsequent analysis. To evaluate the quantity y' = sin xo on a computer, let us use Taylor's expansion for the function sinx: sinx
=x

x3
x5
3! 5! . . . .
+

Thus, for computing y' one can take one of the following approximate expressions:
y* � Yi = xo , (xO ) 3 Y* � Y2* = XO*  � '
( 1 . 15)
n )2k  1 (x,,o* ,_ l )k� l c'.,.  ... * = * ( � Y Yn � (2k  1 ) ! k By choosing a specific formulae (1.15) for the approximate evaluation o f y* , we select our method of computation. The quantity Iy*  y;' 1 is then known as the error ofthe computational method. In fact, we are considering a family of methods param eterized by the integer n. The larger the n the smaller the error, see (1. 15); and by taking a sufficiently large n we can always make sure that the associated error will be smaller than any initially prescribed threshold. It, however, does not make sense to drive the computational error much further down than the level of the unavoidable error. Therefore, the number n does not need to be taken excessively large. On the other hand, if n is taken too small so that the error of the method appears much larger than the unavoidable error, then one can say that the chosen method does not fully utilize the information about the solution that is contained in the input data, or equivalently, loses a part of this information. 1 .3.3
Roundoff Error
Assume that we have fixed the computational method by selecting a particular n in ( 1 . 15), i.e., by setting y* � y�. When calculating this y:' on an actual computer, we will, generally speaking, obtain a different value Y:. due to rounding. Rounding
11
Introduction
is an intrinsic feature of the floatingpoint arithmetic on computers, as they only operate with numbers that can be represented as finite binary fractions of a given fixed length. As such, all other real numbers (e.g., infinite fractions) may only be stored approximately in the computer memory, and the corresponding approximation procedure is known as rounding. The error I Y�  Y� I is called the round off error. This error shall not noticeably exceed the error of the computational method. Oth erwise, a loss of the overall accuracy will be incurred due to the roundoff error. Exercises I
1. A s s ume that w e need to cal cul ate the val uey = J(x) of s ome functionJ(x), w hil e there is an uncertainty i n the i nput datax*: x* 8 ::; x ::; x* + o. H ow does the corres pondi ng unav oi dabl e error depend onx* and on0 for thefoll ow ing functi ons: 
a) b)
J(x) = s inx; J(x) = l nx, w herex > O?
F or w hat val ues ofx* , obtai ned by approxi matel y meas uri ng the"l oos e" quantity x wi th the accuracy 8, can one g uarantee onl y a one s ided upper bound for l nx i n probl emb)? F ind thi s upper bound. 2. L et the functi on J(x) be defi ned by i ts val ues s ampl ed on the g ri d Xk = kh, w here h = 1/N and k = 0, ± 1, ±2, . . . . I n addi ti on to thes e di s crete val ues, as s ume that
max \J"(x) j ::; 1 . x
P rove that as the avail abl e i nput data are i ncompl ete, they do not, g eneralI y s peaki ng, all ow one to recons truct the functi on at an arbitrary giv en poi ntx wi th accuracy better than the unavoi dabl e errorE (h) = h2 / n2. Hint. S how that al ong wi th the function J(x) == 0, w hich obviousl y has all its g ri d val ues equal to z ero, another functi on, cp(x) = (h2 / n2) sin(Nnx), al s o has all /l i ts g ri d val ues equal to z ero, and s ati sfi es the condi ti on max j cp (x) j ::; 1 , w hil e max IJ(x)  cp(x) I x
=
x
h2 /n2.
3. L et J = J(x) be a functi on, s uch that the abs ol ute val ue of i ts s econd deri vati ve does not ex ceed 1 . S how that the approxi mation error for the formul a:
J(x + h)  J(x) 1' (x) ;:::; h w il l not ex ceed h.
4. L et J = J(x) be a functi on that has bounded s econd deri vati ve: \:Ix : IJ"(x) I ::; 1. F or any x, the val ue of the functi on J(x) is meas ured and comes out to be equal to s ome j* (x); i n s o doi ng w e as s ume that the accuracy of the meas urement g uarantees the foll owi ng es timate: IJ(x)  J* (x) 1 ::; E , E > 0. S uppos e now that w e need to approxi matel y evaluate the fi rs t derivati ve l' (x).
I Hereafter,
we will be using the symbol * to indicate the increased level of difficulty for a given problem.
12
A Theoretical Introduction to Numerical Analysis a) H ow s hall one choos e the p arameter h s o that to minimiz e the guar anteed error es timate of the app rox imate formu la:
f (x + h)  f* (x) . J' (x) � * h b) S how that g iven the ex is ting u ncertainty in the inpu t data, the u navoidable error of evalu ating J' (x) is at leas t 6 ( J£) , no matter w hat sp ecifi c method is u s ed. Cons ider tw o fu nctions, f(x) == 0 and f* (x) = c s in(xl J£). Clearly, the abs olu te valu e of the s econd derivativefor either of thes e tw ofu nctions does not ex ceed 1 . M oreover, max If(x)  * (x) I S; c. A t the s ame time,
Hint.
x
*
I d/dx
f
/
d l I
x
I
= 6 (J£) . dx = J£ cos J£
B y compar ing the s olu tions of su bp roblems a) and b), verif y that the sp ecifi c app rox i matefor mu la forJ' (x) g iven in a) y ields the er ror of the ir redu cible order 6 ( J£) ; and als o s how that the u navoidable error is, in fact, ex actly of order 6 ( J£).
5. F or s toring the information abou t a linear fu nction f(x) = kx + b, a S; x S; f3, that s atisfi es the inequ alities: 0 S; f(x) S; 1, w eu s e a tablew ith s ix available cells, s uch that one of the ten dig its: 0, 1 , 2, . . , 9, can be w ritten into each cell. What is the u navoidable err or of recons tru cting the fu nction, if the foreg oing s ix cells of the table are fi lled according to one of the follow ing recip es ? .
a) T hefi rs t three cells contain thefi rs t three dig its that app ear rig ht after the decimal p ointw hen the nu mberf( a) is rep res ented as a normaliz ed decimalfraction; and the remaining three cells contain the fi rs t three dig its after the decimal p oint in the normaliz ed decimal fractionforf(f3 ).
b) Let a = 0 and f3 = 102 . T he fi rs t three cells contain the fi rs t three dig its in the normaliz ed decimalfractionfork, thefou rth cell contai ns either0 or 1 dep ending on the s ig n of k, and the remaining tw o cells contain the fi rs t tw o dig its after the decimal p oint in the normaliz ed deci mal fraction for b. c)* S how that irresp ective of any sp ecifi c s trateg y for fi lling out the aforementioned s ix cell table, the u navoidable er ror of recons tru cting the linear fu nction f(x) = 3 kx + b is alw ay s at leas t 10 . 6 Hint. Bu ild 10 diff erentfunctions from theforeg oing clas s, su ch that the max i mu m modu lu s of the diff erence betw een any tw o of them w ill be at leas t 103 .
1.4
On Methods of Computation
Suppose that a mathematical model is constructed for studying a given object or phenomenon, and subsequently this model is analyzed using mathematical and com
Introduction
13
putational means. For example, under certain assumptions the following problem:
I
( 1 . 16)
yeO) = 0, ddtY 1=0 = 1 '
can provide an adequate mathematical model for small oscillations of a pendulum, where yet) is the pendulum displacement from its equilibrium at the time t. A study of harmonic oscillations based on this mathematical model, i.e., on the Cauchy problem ( 1 . 1 6), can benefit from a priori knowledge about the physical na ture of the object of study. In particular, one can predict, based on physical reasoning, that the motion of the pendulum will be periodic. However, once the mathematical model ( 1 . 1 6) has been built, it becomes a separate and independent object that can be investigated using any available mathematical tools, including those that have little or no relation to the physical origins of the problem. For example, the numerical value of the solution y sin t to problem ( 1 . 1 6) at any given moment of time t = Z can be obtained by expanding sinz into the Taylor series:
=
Z3 Z5 . Slll Z = Z   +   . . . 3! 5! , and subsequently taking its appropriate partial sum. In so doing, representation of the function sin t as a power series hardly admits any tangible physical interpretation. In general, when solving a given problem on the computer, many different meth ods, or different algorithms, can be used. Some of them may prove far superior to others. In subsequent parts of the book, we are going to describe a number of established, robust and efficient, algorithms for frequently encountered classes of problems in numerical analysis. In the meantime, let us briefly explain how the algorithms may differ. Assume that for computing the solution y to a given problem we can employ two algorithms, and that yield the approximate solutions yj (X) and Yi = (X) , respectively, where X denotes the entire required set of the input data. In so doing, a variety of situations may occur.
= AI
A I A2 ,
A2
1 .4.1
Accuracy
The algorithm
A2 may be more accurate than the algorithm A [y yj [ » [y  yi [.
x 1 x2k1 y� = tJ(I)k1 (2k  l ) ! '
I,
that is:
For example, let us approximately evaluate y = sin ! x=o using the expansion: . n
( 1 . 17)
14
A Theoretical Introduction to Numerical Analysis
The algorithmAI will correspond to taking n = 1 in formula (1. 17), and the algorithm A2 will correspond to taking n = 2 in formula (1. 17). Then, obviously, i sin O. I  yii » i sin O. l  yi i · 1 .4 . 2
Operation Count
Both algorithms may provide the same accuracy, but the computation of yj = A I (X) may require many more arithmetic operations than the computation of yi = A 2 (X), Suppose, for example, that we need to find the value of 1 x \ o24 clearly, y
=
(
I x
:
)
for x = 0.99. Let AI b e the algorithm that would perform the computations directly using the given formula, i.e., by raising 0.99 to the powers 1 , 2, . . . , 1023 one after another, and subsequently adding the results. Let A2 be the algorithm that would perform the computations according to the formula:
1 024 y = 1 1 0.99 0.99 The accuracy of these two algorithms is the same  both are absolutely accurate provided that there are no roundoff errors. However, the first algorithm requires considerably more arithmetic operations, i.e., it is computationally more expensive. Namely, for successively computing
x,
� = x · x,
...,
one will have to perform 1022 multiplications. On the other hand, to compute 0.99\ 024 one only needs 10 multiplications: 0.992 0.99 · 0.99, 0.994 = 0.992 . 0.992 , . . . ,
=
1 .4 . 3
Stability
The algorithms, again, may yield the same accuracy, but Al (X) may be computa tionally unstable, whereas A 2 (X) may be stable. For example, to evaluate y = sinx with the prescribed tolerance £ 103 , i.e., to guarantee iy  y* i S 10  3 , let us employ the same finite Taylor expansion as in formula (1. 17):
=
yi = y'i (x) =
� ( _ I )k I (2kx2k I ) ! ' n
1
where n = n(£ ) is to be chosen to ensure that the inequality
iy  yi i S 10  3
( 1 . 1 8)
Introduction
15
will hold. The first algorithm A J will compute the result directly according to (1.18). If I x l :; then by noticing that the following inequality holds already for n = 5:
n/2,
( n ) 2n 1 < 10  3 ,
1 (2n  1 ) ! 2 we can reduce the sum (1.18) to
:;
Clearly, the computations by this formula will only be weakly sensitive to round off errors when evaluating each term on the righthand side. Moreover, as for I x l those terms rapidly decay when the power grows, there is no room for the cancellation of significant digits, and the algorithm A I will be computationally stable. Consider now Ix l » 1 ; for example, x = Then, for achieving the prescribed accuracy of e = 1 3 , the number n should satisfy the inequality:
n/2,
0
100.
1002n 1
< 1 3 (2n  1 ) !  0 ,
...,_ :
which yields an obvious conservative lower bound for n: n > 49. This implies that the terms in sum (1. 18) become small only for sufficiently large n. At the same time, the first few leading terms in this sum will be very large. A small relative error committed when computing those terms will result in a large absolute error; and since taking a difference of large quantities to evaluate a small quantity sinx ( I sinx l 1) is prone to the loss of significant digits (see Section 1 .4.4), the algorithm A I in this case will be computationally unstable. On the other hand, in the case of large x a stable algorithm A2 for evaluating sinx is also easy to build. Let us represent a given x in the form x = + z, where Izi and is integer. Then,
:;
In
I
Y2* = A2 (X)
sinx = (  l / sin z, Z3 Z5 = ( _ 1 )i z   +    .
(
Z7)
3 ! 5 ! 7!
This algorithm has the same stability properties as the algorithm A I for Ixl 1 .4.4
:; n /2
:; n/2.
Loss of S ignificant Digits
Most typically, numerical instability manifests itself through a strong amplifica tion of the small roundoff errors in the course of computations. A key mechanism for the amplification is the loss of significant digits, which is a purely computer related phenomenon that only occurs because the numbers inside a computer are represented as finite (binary) fractions (see Section 1 .3.3). If computers could oper ate with infinite fractions (no rounding), then this phenomenon would not take place.
16
A Theoretical Introduction to Numerical Analysis
Consider two real numbers a and b represented in a computer by finite fractions with m significant digits after the decimal point:
a = O.a l a2a3 " . am , b = O.blb2b3 ' " bll/ ' We are assuming that both numbers are normalized and that they have the same exponent that we are leaving out for simplicity. Suppose that these two numbers are close to one another, i.e., that the first k out of the total of m digits coincide:
Then the difference a  b will only have m  k < m significant digits (provided that ak+ I > bk+ I , which we, again, assume for simplicity):
a  b = O. "v' O . . . OCk+ I · · · Cm· k The reason for this reduction from m to m  k, which is called the loss of significant digits, is obvious. Even though the actual numbers a and b may be represented by the fractions much longer than m digits, or by infinite fractions, the computer simply has no information about anything beyond digit number m. Even if the result a  b is subsequently normalized:
Cm+ I . . . Cm+k · f3  k , a  b = O.Ck+ I . . . Cm � artifacts
where f3 is the radix, or base (f3 = 2 for all computers), then the digits from cm + I through Cm +k will still be completely artificial and will have nothing to do with the true representation of a  b. It is clear that the loss of significant digits may lead to a very considerable degra dation of the overall accuracy. The error once committed at an intermediate stage of the computation will not disappear and will rather "propagate" further and con taminate the subsequent results. Therefore, when organizing the computations, it is not advisable to compute small numbers as differences of large numbers. For example, suppose that we need to evaluate the function = 1  cosx for x which is close to 1. Then cosx will also be close to and significant digits could be lost when computing A l (x) 1  cosx. Of course, there is an easy fix for this difficulty. Instead of the original formula we should use = A2 (x) = 2 sin2 � . The loss o f significant digits may cause an instability even i f the original continu ous problem is well conditioned. Indeed, assume that we need to compute the value of the function = Vi  v:x=1. Conditioning of this problem can be judged by evaluating the maximum ratio of the resulting relative error in the solution over the eliciting relative error in the input data:
1,
=
sup fu;
f(x) I1LVru:I1//llfxll 1!'(x) I BIfI �
f(x)
f(x)
1 1_ 2 Vi
=�
_
1_
_
1
Ixl
v:x=1 I Vi  v:x=11
=
Ixl
2 Viv:x=1 '
x
Introduction
17
For large the previous quantity is approximately equal to !, which means that the problem is perfectly well conditioned. Yet we can expect to incur a loss of signifi cant digits when 1 . Consider, for example, 12345, and assume that we are operating in a sixdigit decimal arithmetic. Then:
x»
x=
=
.JX=1 1 1 1 . 10355529865 . . . ::::: v'x = 1 1 1 . 10805551 354 . . . :::::
1 1 1 . 1 04, 1 1 1 . 1 08,
= A (x) =(xv'x) = Jx f = t2 , = v'x t2 t2 af . hl  _t2_ l at2 l IfI Itt  t2 1' and we conclude that this number is large when t2 is close to tl , which is precisely the case for large x. In other words, although the entire function is well conditioned, there is an intermediate stage that is ill conditioned, and it gives rise to large errors in
and consequently, l I ::::: 1 1 1 . 108  1 1 1 . 104 0.004. At the same time, the true answer is 0.004500214891 . . ., which implies that our approx imate computation carries an error of roughly 1 1 %. To understand where this error is corning from, consider f as a function of two arguments: f = f(tl , t2) t) where t l and = JX=1 Conditioning with respect to the second argument can be estimated as follows: 
.
the course of computation. This example illustrates why it is practically impossible to design a stable numerical procedure for an ill conditioned continuous problem. A remedy to overcome the previous hurdle is quite easy to find:
1 1 f(x) = A2 (x) = v'x+ JX=1 ::::: 1 1 1 . 1 04 + 1 1 1 . 108 = 0.00450020701 . . . .
2 I 2ax + = a XI , (a) = a 2 a X = a a2 a » 2 dX2 (a) . M = a2 t l as a t +oo. I da I IX2 1 Ja  1 Nevertheless, the computation by the formula X2 (a) = a  Ja2  will obviously digits for large a. A cure may be to compute 2 ofandsignificant then X2 = 1 /XI. Note that even for the equation x2 XIbe2ax(a)prone= ato0,+theforJalosswhich both roots XI.2 (a ) = a Ja2 + 1 are well conditioned for = all a, the computation of X2 (a) = a � Ja2 + 1 is still prone to the loss of significant digits and as such, to instability. This is a considerably more accurate answer. Yet another example is given by the same quadratic equation x2 1 0 as we considered in Section 1.2. The roots ± J  have been found to be ill conditioned for close to 1. However, for 1 both roots are clearly well conditioned. In particular, for J I we have: 

'
I


I
I
±
18
A Theoretical Introduction to Numerical Analysis
1 .4 . 5
Convergence
Finally, the algorithm may be either convergent or divergent. Suppose we need to compute the value of y = In ( 1 +x). Let us employ the power series: Y=
.xl
x3 x4
In ( 1 + x) = x  "2 + "3  4 + . . .
and set
n
(1.19)
J!'
( 1.20) y* (x) ::::; y;' = I, ( l ) k+ l _ . k k= l In doing so, we will obtain a method of approximately evaluating y = In ( 1 + x) that will depend on n as a parameter. If I x l q < 1, then lim� y;' (x) y (x) , i.e., the error committed when computing
=
n>
=
y (x) according to formula (1 .20) will be vanishing as n increases. If, however, x > 1, then lim y;' (x) = 00 , because the convergence radius for the series (1.19) is r = 1. In this case the algorithm based on formula (1 .20) diverges, and cannot be used for ll+oo
computations. 1 .4 . 6
General Comments
Basically, the properties of continuous wellposedness and numerical stability, as well as those of ill and well conditioning, are independent. There are, however, certain relations between these concepts. •
First of all, it is clear that no numerical method can ever fix a continuous ill posedness. 2
•
For a wellposed continuous problem there may be stable and unstable dis cretizations.
•
Even for a well conditioned continuous problem one can still obtain both stable and unstable discretizations.
•
For an iIIconditioned continuous problem a discretization will typically be unstable.
Altogether, we can say that numerical methods cannot improve things in the per spective of wellposedness and conditioning.
In the book, we are going to discuss some other characteristics of numerical algo rithms as well. We will see the algorithms that admit easy parallelization, and those that are limited to sequential computations; algorithms that automatically adapt to specific characteristics of the input data, such as their smoothness, and those that only partially take it into account; algorithms that have a straightforward logical structure, as well as the more elaborate ones. 2 The opposite of wellposedness, when there is no continuous dependence of the solution on the data.
Introduction
19
Exercises
1. Propose an algorithm for evaluating y = In( I
2.
+x) that would also apply to X > 1 . Show that the intermediate stages of the algorithm A2 from page 17 are well condi
tioned, and there is no danger of losing significant digits when computing:
f(x) = A2 (X) =
1
x I Vxx + v'x=T
3. Consider the problem of evaluating the sequence of numbers Xo , X I , . . . , XN that satisfy the difference equations:
and the additional condition: ( 1 .2 1 ) We introduce two algorithms for computing XII . First, let
Xn = un + cvn' n = O, I , . . . ,N.
( 1 .22)
Then, in the algorithm AI we define Un, n = 0, 1 , . . . ,N, as solution of the system:
2un  lIn+l = I + nz /Nz, subject to the initial condition: Consequently, the sequence
Vn,
/1
= 0, 1 , . . . ,N  I,
( 1 .23)
Uo = 0. n = 0, 1 , . . . ,N, is defined by the equalities:
( 1 .24)
2Vn  Vn+ 1 = 0, n = O, I, . . . ,N  l, Vo = 1 ,
( 1 .25) ( 1 .26)
and the constant C of ( 1 . 22) is obtained from the condition ( 1 . 2 1 ). In so doing, the actual values of U n and VII are computed consecutively using the formulae:
Un + 1 = 2un  ( I + n2 /Nz), n = O, I , . . . , vn+l = 2"+I , n = O, I , . . . . In the algorithm A , lin , n = 0, I , . . . ,N, is still defined as solution to system ( 1 .23), z but instead of the condition ( 1 .24) an alternative condition UN = 0 is employed. The sequence Vn , n = 0, 1 , . . . ,N, is again defined as a solution to system ( 1 .25), but instead of the condition ( 1 . 26) we use VN = 1 . a) Verify that the second algorithm, lently") unstable.
A z , is stable while the first one, A
I,
is ("vio
b) Implement both algorithms on the computer and try to compare their perfor mance for N = 1 0 and for N = 100.
Part I Interpolation of Functions. Quadratures
21
One of the key concepts in mathematics is that of a function. In the simplest case, the function y = j(x), a S; x S; b, can be specified in the closed form, i.e., defined by means of a finite formula, say, y = x2 . This formula can subsequently be transformed into a computer code that will calculate the value of y = xl for every given x. In real life settings, however, the functions of interest are rarely available in the closed form. Instead, a finite array of numbers, commonly referred to as the table, would often be associated with the function y = j(x). By processing the numbers from the table in a particular prescribed way, one should be able to obtain an approximate value of the function j(x) at any point x. For instance, a table can contain several leading coefficients of a power series for j(x). In this case, processing the table would mean calculating the corresponding partial sum of the series. Let us, for example, take the function
y = ex, O S; x S; 1 , for which the power series converges for all x, and consider the table
1 1 1, 1 .1 ' 2 .1
 "
"
'
1 1 n.
o f its n + 1 leading Taylor coefficients, where n > 0 i s given. The larger the n , the more accurately can one reconstruct the function j(x) = � from this table. In so doing, the formula
x x2 . . . ;il eX ::::: 1 + + + +1 ! 2! n!
i s used for processing the table. In most cases, however, the table that is supposed to characterize the function y = j(x) would not contain its Taylor coefficients, and would rather be obtained by sampling the values of this function at some finite set of points XO , X I , . . . , Xn E [a, b]. In practice, sampling can be rendered by either measurements or computations. This naturally gives rise to the problem of reconstructing (e.g., interpolating) the function j(x) at the "intermediate" locations x that do not necessarily coincide with any of the nodes XO , XI , ' " ,xn. The two most widely used and most efficient interpolation techniques are alge braic interpolation and trigonometric interpolation. We are going to analyze both of them. In addition, in the current Part I of the book we will also consider the problem of evaluating definite integrals of a given function when the latter, again, is specified by a finite table of its numerical values. The motivation behind considering this problem along with interpolation is that the main approaches to approximate evaluation of definite integrals, i.e., to obtaining the socalled quadrature formulae, are very closely related to the interpolation techniques. Before proceeding further, let us also mention several books for additional reading on the subject: [Hen64, IK66, CdB80, Atk89, PT96, QSSOO, Sch02, DB03].
23
Chapter 2 Algebraic Interp ol a tion
Let XO,XI , . . . ,Xn be a given set of points, and let f(XO),j(XI ), . . . ,j(xn) be values of the function f(x) at these points (assumed known). The onetoone correspondence Xo
I
Xl
I
...
I
Xn
will be called a table of values of the function f(x) at the nodes XO,XI , . . . ,Xn. We need to realize, of course, that for actual computer implementations one may only use the numbers that can be represented as finite binary fractions (Section 1 .3.3 of the Introduction), whereas the values f(x do not necessarily have to belong to this class (e.g., vi:3). Therefore, the foregoing table may, in fact, contain rounded rather than true values of the function f(x). A polynomial P,, (x) == Pn (x,j,XO,XI , . . . ,xll) of degree no greater than n that has the form
j)
P,, (X) = Co + CIX+ . . . + cllx" and coincides with f(XO),j(XI ) , . . . ,j(xn) at the nodes XO,XI , . . . ,Xn, respectively, is called the algebraic interpolating polynomial.
2.1
2.1.1
Existence and Uniqueness of Interpolating Polyno mial The Lagrange Form of Interpolating Polynomial
THEOREM 2. 1 Let XO,XI , . . . ,XII be a given set of distinct interpolation nodes, and let the val'ues f(XO),j(XI ), . . . ,j(xn) of the function f(x) be kno'Wn at these nodes. There is one and only one algebraic polynomial Pn(x) == Pn(x,j,XO ,Xl , . . . ,XII) of degree no greater than n that 'Would coincide 'With the given f(Xk ) at the nodes Xk ,
k = O, l , . . . , n. PRO OF
We will first show that there may be no more than one interpo
25
A Theoretical Introduction to Numerical Analysis
26
1 P� ) ( x ) RII(x) (x)
lating polynomial, and will subsequently construct it explicitly. Assume that there are two algebraic interpolating polynomials, and I Then, the difference between these two polynomials, = P,) ) p'\2) is also a polynomial of degree no greater than n that vanishes at the n + 1 points However, for any polynomial that is not identically equal to zero, the number of roots (counting their multiplicities) is equal to the degree. Therefore, = 0, i.e., which proves uniqueness. Let us now introduce the auxiliary polynomials
pP) ((xx)).,
XO,Xl , . . ,Xn. RIl (x)
p�J ) (x) = p,Fl (x),
It is clear that each is a polynomial of degree no greater than n, and that the following equalities hold:
lk(x)
j = O, l , . . . , n.
(x) Pn(X) f(Pn(xxo), jlo,(XxO,) XI f(, . x. ,xn)(x) f(xn)ln(x) f(xj)lj(x) Pn(Xj) f(xj) . Pn(x,j,XO,XI , . . ,xn).
Then, the polynomial P,,
� =
given by the equality
+
d ll
+ ...+
(2. 1 )
is precisely the interpolating polynomial that we are seeking. Indeed, its degree is no greater than n, because each term is a polynomial of degree no greater than n. Moreover, it is clear that this polynomial satisfies the equalities = for all j = 0, I , . . , n. 0 Let us emphasize that not only have we proven Theorem 2. 1 , but we have also written the interpolating polynomial explicitly using formula (2. 1 ). This formula is known as the Lagrange form of the interpolating polynomial. There are other convenient forms of the unique interpolating polynomial The Newton form is used particularly often. 2.1.2
The Newton Form o f Interpolating Polynomial. vided D ifferences
Xa,f(xf(Xba)x,,)xcf(,XXdb),, j(Xc),jXk(Xd), f(xd = f(f(xd,xk,xm) Xk, Xm (Xk Xm
Let nodes function point:
Di
f(x) f(Xk)
etc., be values of the function at the given etc. A Newton's divided difference of order zero of the at the point is defined as simply the value of the function at this
A divided difference of order one arbitrary pair of points and
k = a,b,c, d, . . . .
f(x)
of the function is defined for an do not have to be neighbors, and we allow
Algebraic Interpolation
27
Xk Z xm) through the previously introduced divided differences of order zero:
In general, a divided difference of order n: f(XO ,X I , . . . ,XII) for the function f(x) is defined through the preceding divided differences of order n  1 as follows: (2.2) Note that all the points XO ,XJ , . . . ,Xn in formula (2.2) have to be distinct, but they do not have to be arranged in any particular way, say, from the smallest to the largest value of Xj or vice versa. Having defined the Newton divided differences according to (2.2), we can now represent the interpolating polynomial Pn (x, f,xo ,X I , . . . ,xn ) in the following Newton
form
I
:
Pn (X,j,XO ,X I , . . . ,xn ) = f(xo) + (x  xo)f(xo ,x l ) + . . . + (x  xO ) (X  X I ) . . . (x  XII  I )J(XO ,X I , . . . ,xn )'
(2.3)
Formula (2.3) itself will be proven later. In the meantime, we will rather establish several useful corollaries that it implies. COROLLARY 2. 1 The following equality holds:
PI1 (x,f,xo,x J , . . . ,XII) = Pn  I (x,f,XO ,X I , ' " ,Xn  I ) + (x  xo) (x  x I } . . . (x  xn  d f(xo,X I , . . . ,xn ) ' PRO OF
Immediately follows from formula (2.3) .
(2.4)
o
COROLLARY 2.2 The divided difference f(XO ,X I , . . . ,xn ) of order n is equal to the coefficient Cll in front of the term XIl in the interpolating polynomial
In other words, the following equality holds: (2.5) 1 A more detailed account of divided differences and their role in building the interpolating polynomials can be found, e.g., in [CdB801.
28
A Theoretical Introduction to Numerical Analysis
PROO F It is clear that the monomial X' on the righthand side of expression (2.3) is multiplied by the coefficient j(xo,x " . . . ,XII )' 0
COROLLARY 2.3 The divided difference j(XO ,X l , . . . ,xn ) may be equal to zero if and only if the quantities j(xo), j(X l ) , . . . ,j(xn) are nodal values of some polynomial Qm (x) of degree m that is strictly less than n (m < n) .
If j(XO , X I , . . . ,XII) = 0 , then formula (2.3) implies that the degree of the interpolating polynomial PII(x,j,XO ,X I , . . . ,XII) is less than n, because ac cording to equality (2.5) the coefficient Cn in front of X' is equal to zero. As the nodal values of this interpolating polynomial are equal to j(Xj), j = 0 , 1 , . . . , n, we can simply set Qm (x) = P,, (X), Conversely, as the interpolating polyno mial of degree no greater than n is unique ( Theorem 2 . 1 ) , the polynomial Qm (x) with nodal values j(xo), J(x, ), . . . ,j(xn ) must coincide with the inter polating polynomial P', (x,j,XO ,X I , . . . ,Xn) = cnx" + clI _ I x,  1 + . + co . As m < n, equality Qm (x) = Pn(x) implies that Cll = 0 Then, according to formula (2.5), 0 j(XO ,XI , . . . ,XII) = O. PROO F
.
. .
COROLLARY 2.4 The divided difference j(xo,X\ , . . . ,xn) remains unchanged under any arbitrary permutation of its arguments XO ,X I , · · · ,Xn . PROO F Due to its uniqueness, the interpolating polynomial Pn (x) will not be affected by the order of the interpolation nodes. Let Xo ,x'" . . . ,x" be a per mutation of XO ,X I , . . . ,XI!; then, \Ix : Pn (x,j,xO ,X l , . . . ,XII) = P', (x,j,Xo ,x; , . . . ,x� ). Consequently, along with formula (2.3) one can write
P', (X,j,XO ,X " . . . ,XII ) = j(x� ) + (x  x�)j(x� ,x; ) + . . . + (x  Xo ) (x  x; ) . . . (x  X;,  l )j(x� ,X; , . . . ,x;, ) . According to Corollary 2.2, one can therefore conclude that
(2.6) By comparing formulae (2.5) and (2.6) , one can see that j(Xo ,X; , . ,x,, ) . . .
j(XO ,XI , . . . ,xn )
=
0
COROLLARY 2.5 The following equality holds:
 Pn  1 (xn,j,XO ,X l , ' " ,XII I ) j(XO ,Xl " " ,XIl )  j(xn) (XIl  Xo ) (XII  X I ) . . . (Xn  Xn  I ) . _
(2.7)
Algebraic Interpolation
29
Let us set x = Xn in equality (2.4) ; then its lefthand side becomes j(xn ) , and formula (2.7) follows. 0
PRO O F
equal to
THEOREM 2.2 The interpolating polynomial P',(x,j,XO , XI , . . . ,XII) can be represented in the Newton form, i. e . , equality {2. 3} does hold. PRO O F We will nse induction with respect to n. For n = ° (and n = 1 ) formula (2.3 ) obviously holds. Assume now that it has already been justified for n = 1 , 2 , . . . , k, and let us show that it will also hold for n = k + 1 . In other words, let us prove the following equality:
=
Pk+ 1 (x, j,XQ,XI , ' " ,Xk ,Xk+ l ) Pk (x,j, xo ,X I , . . . ,Xk ) + j(XQ ,X I , ' " ,Xk ,Xk+ 1 ) (x xo)(x  x I ) . . . (x  Xk ) '
(2.8)
Notice that due to the assumption of the induction, formula ( 2.3 ) is valid for ::::: k. Consequently, the proofs of Corollaries 2 . 1 through 2.5 that we have carried out on the basis of formula ( 2.3 ) will also remain valid for n ::::: k. To prove equality (2.8), we will first demonstrate that the polynomial Pk + 1 (x,j,XO , XI , ' " , Xk , Xk+I ) can be represented in the form: n
(2.9) Indeed, it is clear that on the righthand side of formula ( 2.9 ) we have a polynomial of degree no greater than k + 1 that is equal to j(Xj ) at all nodes xj , j = 0, 1 , . . . , k + 1 . Therefore, the expression on the the righthand side of ( 2.9 ) is actually the interpolating polynomial which proves that ( 2.9 ) is a true equality. Next, by comparing formulae ( 2.8 ) and ( 2.9) we see that in order to justify (2.8) we need to establish the equality: (2. 10) Using the same argument as in the proof of Corollary 2.4, and also employing Corollary 2. 1 , we can write:
Pk (X,j,Xo ,XI , ' " ,Xk ) = Pk (X,j,X I , X2 , · · · , Xk , XO ) = Pk  I (X,j,x I , X2 , ' " ,Xk ) + j(XJ ,X2 , ' " ,xk ,xo) (x  x d (x  X2 ) . . . (X  Xk ) '
(2. 1 1 )
A Theoretical Introduction to Numerical Analysis
30
Then, by substituting x = Xk+ I into (2. 1 1 ) , we can transform the righthand side of equality (2. 10) into:
j(Xk+ l )  Pk (Xk+ I ,f,XO ,X l , . · . , Xk ) (Xk+ I  XO ) (Xk+ I  x I ) . . . (Xk+ I  Xk ) 1 j(Xk+ I )  Pk l (Xk+ l , f, xI , · · · , Xk ) j(X I ,X2 , · · · ,Xk ,XO ) Xk+ 1 Xo xk+ I  XO (Xk+ l X I ) . . . (Xk+ I  Xk )
(2. 1 2)
By virtue of Corollary 2.5, the minuend on the righthand side of equality (2.12) is equal to: 1
j(X l ,X2 , . . . ,Xk ,Xk+ I ), xk+ 1  Xo
whereas in the subtrahend, according to Corollary 2.4, one can change the order of the arguments so that it would coincide with
j(XO ,X I , . . . ,Xk ) Xk+ I  Xo Consequently, the righthand side of equality (2.12) is equal to
j(X I ,X2 , · · · ,xk+ d  j(XO ,X l , . . . ,Xk ) � j(X ,X , ,X )  O I · · · k+ l · Xk+ I  Xo In other words, equality ( 2.12 ) coincides with equality ( 2.10 ) that we need to establish in order to justify formula (2.8 ) . This completes the proof.
0
THEOREM 2.3 Let Xo < X I < . . . < Xn; assume also that the function j(x) is defined on the interval Xo ::; x ::; Xn , and is n times differentiable on this interval. Then,
l
l1 n.I j(XO ,X I , . . . ,Xn ) _  ddx"j x=� = j ( I1 ) ( ':>;: ) ,
where � is some point from the interval PRO OF
(2. 13)
[xo,xn] .
Consider an auxiliary function (2. 14)
defined on [xo,xn ] ; it obviously has a minimum of n + 1 zeros on this interval located at the nodes XO ,X I , . . . ,Xn . Then, according to the Rolle (mean value) theorem, its first derivative vanishes at least at one point in between every two neighboring zeros of lP(x). Therefore, the function lP'(x) will have a minimum of n zeros on the interval [xo,xn] . Similarly, the function lP"(x) vanishes at least at one point in between every two neighboring zeros of lP'(x) , and will therefore have a minimum of n  1 zeros on [xo,xn].
Algebraic Interpolation
31
By continuing this line of argument, we conclude that the nth derivative cp( n) (x) will have at least one zero on the interval [xo,xn ] . Let us denote this zero by � , so that cp( n ) ( � ) = 0 . Next, we differentiate identity (2. 14) exactly n times and subsequently substitute x = � , which yields: 0 = cp( n) (� )
=
d" i") (� )  Pn(x,j,X O ,X[ , . . . , x n ) 1x=� . dx"
(2. 15)
On the other hand, according to Corollary 2.2, the divided difference
j(xo ,X[ , . . . , x n ) is equal to the leading coefficient of the interpolating polynomial Pn , i.e., Pn (x,j,XO ,X[ , . . . ,x n ) = j(xo,X[ , . . . ,xn )x" + C n  [x"  [ + . . . + co . Consequently, f; Pn(x,j,XO ,X[ , . . . ,x n ) = n!j(xo,X[ , . . . , x n ) , and therefore, equality
D
(2.15) implies (2.13).
THEOREM 2.4 The values j(xo), j(xt }, . . . , j(x n ) of the function j(x) are expressed through the divided differences j(xo), j(xo,xt } , . . . , j(XO ,XI , . . . , x n ) by the formulae:
j(Xj ) = j(xo) + (Xj  xo)j(xo,x t ) + (Xj  xO ) (Xj  x d j(XO ,X I ,Xl ) j = O, I , . . . , n , + (Xj  xO ) (Xj  xd . . . (Xj  x,,  L )j(xo,X[ , . . . , x n ) ,
i. e. , by linear combinations of the type:
j(Xj ) = ajoj(xo) + aj J /(xo, X[ ) + . . . + aj"j(xo,x l , . . . ,xn ) ,
PROOF
j = O, l, . . . ,n.
(2. 16)
The result follows immediately from formula (2.3) and equalities D
j(Xj ) = P(x,j,XO ,X [ , . . . , xn) ! x=Xj for j = O, I , . . . , n. 2.1.3
Comparison of the Lagrange and Newton Forms
To evaluate the function j(x) at a point x that is not one of the interpolation nodes, one can approximately set: j(x) � PIl (x,j,XO ,X I , . . . ,x,,). Assume that the polynomial PIl (x,j,XO ,XI , . . . ,x,,) has already been built, but in order to try and improve the accuracy we incorporate an additional interpolation node Xn+ I and the corresponding function value j(xn+ I ). Then, to construct the interpolating polynomial Pn+ [ (x,j,xo ,X I , . . . ,x,,+ L ) using the Lagrange formula (2. 1 ) one basically needs to start from the scratch. At the same time, to use the Newton formula (2.3 ), see also Corollary 2. 1 :
P,,+ I (x,j,xO ,Xj , . . . ,X,,+ I ) = PIl (x,j,xO ,X I , . . . ,xn ) + (x  xo) (x  xL) . . . (x  xll )f(xo,Xj , . . . ,xn + d
one only needs to obtain the correction
(x  xo) (x  x L ) . . . (x  xll )f(xo,X [ , . . . ,xll+ L ). Moreover, one will immediately b e able to see how large this correction is.
A Theoretical Introduction to Numerical Analysis
32 2 . 1 .4
Conditioning of the Interpolating Polynomial
Let all the interpolation nodes belong to some interval a :s: :s: b. Let also the values of the function at these nodes be given. Hereafter, we will be using a shortened notation for the interpolating polynomial = Let us now perturb the values j = 0 , I , . . . , no j by some quantities Then, the interpolating polynomial will change and become + One can clearly see from the Lagrange formula (2. 1) that = + + Therefore, the corresponding perturbation of the interpolating polynomial, i.e., its response to will be For a given fixed set of this perturbation depends only on and not on itself. As such, one can introduce the minimum number such that the following inequality would hold for any
f(xo), f(XI ), .XO,., Xf(I ,x., ). ,Xn P,1(X,f(j)x) x Pn(x) Pn(x,j,XO,XI ,f(' " x,x)n). P,,(x,j) P,1 (Xof(, jxj)of), P,,(xP,,j1(X,jof)) . Pn (x, of). of, P,,(x, of). XO,XI , . " ,Xn, of f L" of: Ln Ln(XO,XI , f(xj) f(x) x X XI . LI Xo Xo XI LI �=a Xl Xo Lt . f(x).
(2. 17)
The numbers . . . ,x/I ,a,b) are called the Lebesgue constants. 2 They provide a natural measure for the sensitivity of the interpolating polynomial to the perturbations 0 of the interpolated function at the nodes j. The Lebesgue constants are known to grow as n increases. Their specific behavior strongly depends on how the interpolation nodes j , j = 0 , I , . , n, are located on the interval [a, b] . If, for example, n = 1, = a, = b, then = 1. If, however, ::f. a and/or ::f. b, then 2 2 1 ro l ' i.e., if and are sufficiently close to one another, then the interpolation may appear arbitrarily sensitive to the perturbations of The reader can easily verify the foregoing statements regarding In the case of equally spaced interpolation nodes: =
.
Xj = a + j · h, j = O, I , . . . , n, h = b n a ,
one can show that
2/1 Ln >
1
> 2,, 2__ . Vn
1_ .
(2. 1 8)
_ _
n  I /2
In other words, the sensitivity of the interpolant to any errors committed when spec ifying the values of will grow rapidly (exponentially) as n increases. Note that in practice it is impossible to specify the values of j) without any error, no matter how these values are actually obtained, i.e., whether they are measured (with inevitable experimental inaccuracies) or computed (subject to rounding errors). For a rigorous proof of inequalities (2. 18) we refer the reader to the literature on the theory of approximation, in particular, the monographs and texts cited in Section 3.2.7 of Chapter 3. However, an elementary treatment can also be given, and one can easily provide a qualitative argument of why the Lebesgue constants for equidistant nodes grow exponentially as the grid dimension n increases. From the
f(Xi)
f(x
2 Note that the Lebesgue constant Ln corresponds to interpolation on " + 1 nodes:
Xo ,
. . . , Xn .
33
Algebraic Interpolation Lagrange form of the interpolating polynomial (2. 1 ) and definition that:
(2. 1 7) it is clear (2. 19)
(later, see Section 3.2.7 of Chapter 3, we will prove an even more precise statement). Take k ::::; n l2 and x very close to one of the edges a or b, say, x a = 11 « h. Then, 
The foregoing estimate for I lk (x) l , along with the previous formula (2. 1 9), do imply the exponential growth of the Lebesgue constants on uniform (equally spaced) inter polation grids. Let now a =  1 , b = I , and let the interpolation nodes on [a, b] be rather given by the formula:
(2j + 1 )n j = O, I , . . . , n. (2.20) 2(n + 1) , It is possible to show that placing the nodes according to (2.20) guarantees a much better estimate for the Lebesgue constants (again, see Section 3.2.7): Xj =  cos
2 Ln ::::::  In(n +
n
I ) + 1.
(2.21)
We therefore conclude that in contradistinction to the previous case (2. 1 8), the Lebesgue constants may, in fact, grow slowly rather than rapidly, as they do on the nonequally spaced nodes (2.20). As such, even the highdegree interpolating polynomials in this case will not be overly sensitive to perturbations of the input data. Interpolation nodes (2.20) are known as the Chebyshev nodes. They will be discussed in detail in Chapter 3. 2.1.5
On Poor Convergence of Interpolation with Equidis tant Nodes
One should not think that for any continuous function j(x), x E [a, b], the algebraic interpolating polynomials P,, (x,j) built on the equidistant nodes xj = a + j . h, Xo = a, Xn = b, will converge to j(x) as n increases, i.e., that the deviation of Pn (x,j) from j(x) will decrease. For example, as has been shown by Bernstein, the sequence
34
A Theoretical Introduction to Numerical Analysis
=
of interpolating polynomials obtained for the function J(x) = Ix l on equally spaced nodes diverges at every point of the interval [a , b] [ 1 , 1] except at {  1 , 0 , I } . The next example is attributed to Runge. Consider the function J(x) = xZ;I/4 on the same interval [a , b] = [ 1, 1]; not only is this function continuous, but also has continuous derivatives of all orders. It is, however, possible to show that for the sequence of interpolating polynomials with equally spaced nodes the maximum difference max I J(x)  Pn (x,j) I will not approach zero as n increases.  1 :::;x:::; 1 Moreover, by working on Exercise 4 below, one will be able to see that the areas of no convergence for this function are located next to the endpoints of the interval [  1 , 1]. For larger intervals the situation may even deteriorate and the sequence of interpolating polynomials P,, (x,j) may diverge. In other words, the quantity max I J(x)  P" (x,j) I may become arbitrarily large for large n 's (see, e.g., [IK66]). a:::;x:::; b
Altogether, these convergence difficulties can be accounted for by the fact that on the complex plane the function J(z) = z2+ /4 is not an entire function of its argument z, and has singularities at z = ± i/2. On the other hand, if, instead of the equidistant nodes, we use Chebyshev nodes (2.20) to interpolate either the Bernstein function J(x) = Ixl or the Runge function J(x) = x2 ;1/4 ' then in both cases the sequence of interpolating polynomials Pn (x,j) converges to J(x) uniformly as n increases (see Exercise 5).
\
Exercises
I . Evaluate f( I . 14) by means of linear, quadratic, and cubic interpolation using the fol lowing table of values:
1.31 1 .284
x J(x
Implement the interpolating polynomials in both the Lagrange and Newton form.
2. Let xj = j . h, j = 0, ± 1 , ±2, . . . , be equidistant nodes with spacing h. Verify that the following equality holds:
f(XkI , Xk, Xk+1 ) 
f(Xk+ l )  2f(Xk ) + f(Xk l ) 2!h2
3. Let a = XC, a < Xl < b, X2 = b. Find the value of the Lebesgue constant L2 when Xl is the midpoint of [a, b] : XI = (a + b) /2. Show that if, conversely, Xl > a or XI > b, then the Lebesgue constant L2 = L2 (XO,XI , X2 , a, b) grows with no bound. 4. Plot the graphs of f(x) = X';1 /4 and Pn(x,j) from Section 2. 1 .5 (Runge example) on the computer and thus corroborate experimentally that there is no convergence of the interpolating polynomial on equally spaced nodes when n increases. 5. Use Chebyshev nodes (2.20) to interpolate f(x) Ixi and f(x) = X';1 /4 on the interval [1, 1]' plot the graphs of each f(x) and the corresponding Pn (x, j) for n = 10,20,40, and 80, evaluate numerically the error max [f(x)  P (x, j) I, and show that it de =
creases as n increases.
1 s cannot increase the order of accuracy beyond the threshold set by the unavoidable error 6(hs+ I ) , and therefore cannot speed u p the convergence as h t 0. As such, the degree s of piecewise polynomial interpolation is optimal for the functions that satisfy
f(x)
(2.36).
REMARK 2.1 The considerations of the current section pertain primarily to the asymptotic behavior of the error as h t 0. For a given fixed h > 0 , interpolation o f some degree r < s may, i n fact, appear more accurate than the interpolation of degree s. Besides, in practice the tabulated values k = 0 , 1 , . . . , n, may only b e specified approximately, rather than exactly, with a finite fixed number of decimal ( or binary ) digits. In this case, the loss of interpolation accuracy due to rounding is going to increase as s increases, because of the growth of the Lebesgue constants ( defined by formula (2. 18) of Section 2 . 1 .4 ) . Therefore, the piecewise polynomial interpolation of high degree ( higher than the third ) is not used routinely. 0
f(Xk),
Error estimate (2.32 ) does, in fact, imply uniform conver ( a piecewise polynomial) to the interpolated function with the rate 6(hs+ I ) as the grid is refined, i.e., as h t 0. Estimate (2.33) , in particular, indicates that piecewise linear interpolation converges uniformly with the rate 6(h2 ) . Likewise, estimate (2.35) in the case of a uniform grid with size h will imply uniform convergence of the qth derivative of the interpolant f�q) to the qth derivative of the interpo lated function fq) with the rate 6(hsq+ ! ) as h t 0 . 0 REMARK 2 . 2
gence of the interpolant
f(x)
f�(X,jkj )
(X,jkj )
(x)
REMARK 2 . 3 The notion of unavoidable error as presented in this sec tion ( see also Section 1.3 ) illustrates the concept of Kolmogorov diameters for compact sets of functions (see Section 1 2.2.5 for more detail ) . Let W be a linear normed space, and let U c W. Introduce also an Ndimensional linear N N manifold W (N) C W, for example, W ( N) = span { w i ) , w� ) , . . . , w}:) } , where the functions w}[") E W, n = . . . , N, are given. The Ndimensional Kolmogorov diameter of the set U with respect to the space W is defined as:
1,2,
J(N (U, W) = inf W(N) C W
uEU wEW(N)
sup
inf
Ilw  ul lw .
(2.39)
This quantity tells us how accurately we can approximate an arbitrary u from a given set U C W by selecting the optimal approximating subspace W (N) whose dimension N is fixed. The Kolmogorov diameter and related concepts play a fundamental role in the modern theory of approximation; in particular, for the analysis of the socalled best approximations, for the analysis of saturation of numerical methods by smoothness ( Section 2.2.5 ) , as well as in the theory of £entropy and related theory of transmission and processing of information. 0
A Theoretical Introduction to Numerical Analysis
42 2 .2 .5
Saturation of Piecewise Polynomial Interpolation
As we have seen previously (Sections 1.3.1 and 2.2.4), reconstruction of continu ous functions from discrete data is normally accompanied by the unavoidable error. This error is caused by the loss of information which inevitably occurs in the course of discretization. The unavoidable error is typically determined by the regularity of the continuous function and by the discretization parameters. Hence, it is not related to any particular reconstruction technique and rather presents a common intrinsic accuracy limit for all such techniques, in particular, for algebraic interpolation. Therefore, an important question regarding every specific reconstruction method is whether or not it can reach the aforementioned intrinsic accuracy limit. If the accuracy of a given method is limited by its own design and does not, generally speaking, reach the level of the unavoidable error determined by the smoothness of the approximated function, then the method is said to be saturated by smoothness. Otherwise, if the accuracy of the method automatically adjusts to the smoothness of the approximated function, then the method does not get saturated. Let f(x) be defined on the interval [a, bl, and let its table of values f(xd be known for the equally spaced nodes Xk = a + kh, k = 0, 1 , . . . , n, h = (b a) / n. In Sec tion 2.2.2, we saw that the error of piecewise polynomial interpolation of degree s is of order d(hs+ 1 ) , provided that the polynomial P.�(X, jkj) is used to approximate f(x) on the interval Xk :::::: X :::::: Xk+l , and that the derivative fs+! ) (x) exists and is bounded. Assume now that the only thing we know about f(x) besides the table of values is that it has a bounded derivative of the maximum order q + 1 . If q < s, then the unavoidable error of reconstructing f(x) from its tabulated values is d(hq+ I ). This is not as good as the d(hs+ I ) error that the method can potentially achieve, and the reason for deterioration is obviously the lack of smoothness. On the other hand, if q > s, then the accuracy of interpolation still remains of order d(hs+! ) and does not reach the intrinsic limit d(hq+! ) . In other words, the order of interpolation error does not react in any way to the additional smoothness of the function f(x), beyond the required s + 1 derivatives. This is a manifestation of susceptibility of the algebraic piecewise polynomial interpolation to the saturation by smoothness. In Chapter 3, we discuss an alternative interpolation strategy based on the use of trigonometric polynomials. That type of interpolation appears to be not susceptible to the saturation by smoothness. 
Exercises
1. What size of the grid h guarantees that the error of piecewise linear interpolation for the function f(x) = sinx will never exceed 1O 6 ?
2. What size of the grid h guarantees that the error of piecewise quadratic interpolation for the function f(x) = sin x will never exceed 1O6 ? 3. The values of f(x) can be measured at any given point x with the accuracy 1 8fl ::; 10 4 . What is the optimal grid size for tabulating f(x), if the function is to be subsequently reconstructed by means of a piecewise linear interpolation?
Algebraic Interpolation
43
Hint. Choosing h excessively small can make the interpolation error smaller than the perturbation of the interpolating polynomial due to the perturbations in the data, see Section 2. 1.4.
4.* The same question as in problem 3, but for piecewise quadratic interpolation. 5. Consider two approximation formulae for the first derivative l' (x):
and let 1 J"(x) I
1' (x)
�
1' (x)
�
f(x + h)  f(x) , h f(x + h )  f(x  h ) , 2h
(2.40) (2.41 )
::::: 1 and I f'" (x) I ::::: 1 .
a ) Find h such that the error of either formula will not exceed
10 3.
b)* Assume that the function f itself is only known with the error o. What is the best accuracy that one can achieve using formulae (2.40) and (2.41), and how one should properly choose h ?
c)* Show that the asymptotic order of the error with respect to 0, obtained by formula (2.41) with the optimal h, cannot be improved.
2.3
P iecewise
Smooth ( S plines)
Polynomial
Interpolation
A classical piecewise polynomial interpolation of any degree s, e.g., piecewise linear, piecewise quadratic, etc., see Section 2.2, yields the interpolant that, gener ally speaking, is not differentiable even once at the interpolation nodes. There are, however, two alternative types of piecewise polynomial interpolants  local and nonlocal splines  that do have a given number of continuous derivatives every where, including the interpolation nodes. 2.3.1
Local Interpolation of Smoothness
s and Its Properties
Assume that the interpolation nodes Xk and the function values f (xd are given. Let us then specify a positive integer number s and also fix another positive integer j: 0 : 0 and 0 "" j "" s  1. However, a fonnal substitution of s = 0, j = 0 into (2.45)(2.48) does yield a piecewise linear interpolant.
47
Algebraic Interpolation
m I m dmm where account th at ddA'" Th en, tak·mg mto · xk + l xk ) dX , using formulae ( 2.46) and (2.47) we obtain:
dmR2H I (x, k) I dxm I
I
(
X
�, Xk + l Xk
and
(xk+l  xk ) s+ l m [ j (Xk j,Xk HI , · · · ,Xk Hs+I ) [ x
=
Xk HS+ 1  Xk  j xk+ 1  Xk
I  { [IT (X ±
r
0

_ I
_
1
xkHi  Xk xk+1  Xk
) ] (r) } dmdxlr(mX)
(2.54)
X=I
Finally, according to Theorem 2.3 we can write:
(s + 1 ) ! j(Xk j,XkH I , . . . ,xk j+s+d � E [Xkj,XkHs+I ] .
dmR2s+ 1 (x, k)  const2 · (Xk+ 1  Xk)s+l m I dXm I
=
I
H ) (� ),
ji
(2.5 5)
Then, substitution of (2.55) into (2.54) yields the estimate:
co with the rate determined by the smoothness of f (x), i.e., there is no saturation. REMARK 3 . 1 If the interpolated function f (x) has derivatives of all orders, then the rate of convergence of the trigonometric interpolating poly nomials to f (x) will be faster than any inverse power of n. In the literature, this type of convergence is often referred to as spectral. 0
3.2
Interpolation o f Functions o n an Interval. Relation between Algebraic and Trigonometric Interpolation
Let f = f (x) be defined on the interval  1 ::::: x ::::: 1 , and let it have there a bounded derivative of order r + We have chosen this specific interval  1 ::::: x ::::: 1 as the domain of f (x) , rather than an arbitrary interval a ::::: x ::::: b, for the only reason of simplicity and convenience. Indeed, the transformation x = a!b + t b;a renders a transition from the function f (x) defined on an arbitrary interval a ::::: x ::::: b to the function F (t ) == f ( a!b + t b;a ) defined on the interval  1 ::::: t ::::: 1 .
I.
3.2.1
Periodization
According to Theorem 3.5 of Section 3.1, trigonometric interpolation is only suit able for the reconstruction of smooth periodic functions from their tables of values.
A Theoretical Introduction to Numerical Analysis
74
Therefore, to be able to apply it to the function J(x) given on  1 :::; x :::; 1 , one should first equivalently replace J(x) by some smooth periodic function. However, a straightforward extension of the function J(x) from its domain  1 :::; x :::; 1 to the entire real axis may, generally speaking, yield a discontinuous periodic function with the period L = 2, see Figure 3. 1 .
y
x
FIGURE 3. I : Straightforward periodization.
Therefore, instead of the function J(x),  I :::; x :::; 1, let us consider a new function
y
1
F ( q» = J (cos q» ,
1 x
FIGURE 3.2: Periodization according to formula (3.55).
x = cos q>. (3.55)
It will be convenient to think that the function F ( q» of (3.55) is defined on the unit circle as a function of the polar an gle q>. The value of F ( q» is obtained by merely translating the value of J(x) from the point x E [ 1 , 1] to the correspond ing point q> E [0, trJ on the unit circle, see Figure 3.2. In so doing, one can inter pret the resulting function F ( q» as even, F (  q» = F ( q> ) , 2trperiodic function of its argument q>. Moreover, it is easy to see from definition (3.55) that the deriva+ ! F(rp) tive dr � exists and is bounded.
75
Trigonometric Interpolation 3.2.2
Trigonometric Interpolation
Let us choose the following interpolation nodes:
N = 2(n + I ) .
m = O, ± I , . . . , ±n,  (n + 1),
(3.56)
According to (3.55), the values Fm = F ( CPm ) of the function F( cp ) at the nodes CPm of (3.56) coincide with the values fm = f(xm) of the original function f(x) at the points Xm = cos CPm . To interpolate a 21rperiodic even function F ( cp ) using its tabulated values at the nodes (3.56), one can employ formula (3.25) of Section 3.1: Qn (cos cp, sin qJ , F )
II
= � >k coskcp. k =O
(3.57)
As Fm = fm for all m, the coefficients a k of the trigonometric interpolating polyno mial (3.57) are given by formulae (3.26), (3.27) of Section 3.1:
k= 1
3.2.3
,
2, . . . , n.
(3.58)
Chebyshev Polynomials . Relation between Algebraic and Trigonometric Interpolation
Let us use the equality cos cP = x and introduce the functions:
Tk (X) = coskcp
=
cos (k arccosx) ,
k = 0, I , 2, . . . .
(3.59)
THEOREM 3. 7 The functions n (x) defined by formula {3. 59} are polynomials of degree k = 0, 1 ,2, . . . . Specifically, To(x) = 1 , TJ (x) = x, and all other polynomials: T2 (X) , T3 (X), etc., can be obtained consecutively using the recursion formula
(3.60) PRO O F It is clear that To (x) = cosO = 1 and T] (x) = cos arccos x = x. Then, we employ a wellknown trigonometric identity
cos (k + l ) cp = 2 cos qJ coskcp  cos (k 
1 ) cp, k = 1 , 2, . . . , which immediately yields formula (3.60) when qJ = arccosx. It only remains to prove that Tk (X) is a polynomial of degree k; we will use induction with respect to k to do that. For k = 0 and k = 1 it has been proven directly. Let us fix some k > 1 and assume that for all j = 0, 1 , . . . , k we have already shown
76
A Theoretical Introduction to Numerical Analysis
that Tj (x) are polynomials of degree j. Then, the expression on the righthand side of (3.60) , and as such, Tk+ 1 (x) , is a polynomial of degree k + I . 0 The polynomials Tk (x) were first introduced and studied by Chebyshev. We pro vide here the formulae for a first few Chebyshev polynomials, along with their graphs, see Figure 3.3:
TO (x) = 1 ,
�
�
�
�
�
0 X
�
M
M
M
_1 �__L�L�L�����__���__�__�
FIGURE 3.3: Chebyshev polynomials. Next, by substituting qJ = arccosx into the righthand side of formula (3.57), we can recast it as a function of x, thus obtaining:
Qn ( cos qJ, sin qJ , where Pn (x,j) = and
F)
11
== P,, x, j ,
(
L ak Tdx),
k=O
)
(3.61) (3.62)
(3.63)
Trigonometric Interpolation Therefore, we can conclude using for mulae (3.6 1 ), (3.62) and Theorem 3.7, that P,,(x,j) is an algebraic polynomial of degree no greater than n that coincides with the given function values f(xm) = j,n at the interpolation nodes x", = cos CPI/1 ' In accordance with (3.56), these interpola tion nodes can be defined by the formula:
n(2m + 1) , 2(n + l ) (3.64) m = O, I , . . . , n,
ko . However, due to the equalities [qt[ = [q� [ = 1 , k = 1 , 2, . . . , this error will not get amplified as k increases. This implies numerical stability of the computations according to formula (3.60) for I xl < 1 . 3.2.6
Algebraic Interpolation with Extrema of the Cheby shev Polynomial as Nodes
Tn (x)
To interpolate the function F ( qJ) = f( cos qJ), let us now use the nodes: _
1C
qJm = m,
n
m = O, I , . . . , n.
A Theoretical Introduction to Numerical Analysis
80
In accordance with Theorem 3.6 of Section 3. 1 , and the discussion on page 72 that follows this theorem, we obtain the trigonometric interpolating polynomial:
Qn (COS cp, sin cp , F ) =
1
1
(fo + 1,, ) + ao = 211 11
n l
L L
n
L ak cos kcp k=O
=
an 1,n , m= 1 n I 1 ak = n (fo + ( 1 /1,,) + n2 1,11 coskcpm , ,n= 1
,
1 (fo + (  l Y1,, ) , 211
k = 1 , 2, . . . , n  1 .
Changing the variable to x = cos cp and denoting Qn (cos cp, sin cp,F) = Pn (x,j) , we have:
Pn (x,j) = ao =
i n I
l
n
L akTk (X) , k=O
fm ,  (fo + 1,1) + 211 111=1
=
an nL k ak = n (fo + (  l ) fn ) + n L fm Tk (Xm ) , m 1
2 " 1
=1
1
2
n (fo + (  1 rfn ) , k = 1 , 2, . . . , n  1 .
Similarly to the polynomial Pn (x, f) of (3.62), the algebraic interpolating polynomial
Pn (x, j) built on the grid: _
_
7r
Xm = cos CPm = cos m,
n
m = 0, 1 , . . . , n,
(3. 7 1 )
also inherits the two foremost advantageous properties from the trigonometric inter polating polynomial Qn (cos cp, sin cp, F). They are the slow growth of the Lebesgue constants as n increases (that translates into the numerical stability with respect to the perturbations of 1,n ), as well convergence with the rate that automatically takes into account the smoothness of f(x) , i.e., no susceptibility to saturation. Finally, we notice that the Chebyshev polynomial Tn (x) reaches its extreme values on the interval  1 :s; x :s; 1 precisely at the interpolation nodes xm of (3.7 1 ) : Tn (xm ) cos trm = (_1 )111, m = 0, 1 , . . . ,n. In the literature, the grid nodes xm of (3.7 1 ) are known as the ChebyshevGaussLobatto nodes or simply the GaussLobatto nodes. =
3.2.7
More o n t h e Lebesgue Constants and Convergence of Interpolants
In this section, we discuss the problem of interpolation from the general per spective of approximation of functions by polynomials. Our considerations, in a substantially abridged form, follow those of [LG95] , see also [Bab86]. We quote many of the fundamental results without a proof (the theorems of Jack son, Weierstrass, FaberBernstein, and Bernstein). The justification of these re sults, along with a broader and more comprehensive account of the subject, can
Trigonometric Interpolation
81
be found in the literature on the classical theory of approximation, see, e.g., [Jac94, Ber52, Ber54, Ach92, Nat64, Nat65a, Nat65b, Lor86, Che66, Riv74]. In the numerical analysis literature, some of these issues are addressed in [Wen66] . In these books, the reader will also find references to research articles. The material of this section is more advanced, and can be skipped during the first reading. The meaning of Lebesgue's constants LI1 introduced in Chapter 2 as minimum numbers that for each n guarantee the estimate [see formula (2. 1 7)]:
is basically that of an operator norm. Indeed, interpolation by means of the polyno mial P,,(x,f) can be interpreted as a linear operator that maps the finitedimensional space of vectors [JO ,fI , . . . ,fn ] into the space qa, b] of all continuous functions I(x) defined on [a, b]. The space qa, b] is equipped with the maximum norm 1 1 I11 = max I /(x) I· Likewise, the space of vectors J = [JO ,fl , . . . ,1,,] can also be equipped a 1 and stretch the vector ]: 1 1+ al = [afo , af, , · · · , aj;l ] so that II alII = 1 . As the interpolation by means of the polynomials Pn is a linear operator, we obviously have Pn (x, af) = aPIl (x,j) , and consequently, Il Pn (x, af) I I > I I Pn (x, j) II· We have therefore found a unit vector al, for which the norm of the corresponding interpolating polynomial will be greater than the lefthand side of (3.76). The contradiction proves that the two definitions of the Lebesgue constants are indeed equivalent. 0 =
LEMMA 3.3 The Lebesgue constant of (3. 74), (3. 75) is equal to
(3.77) where the q11antity An is defined by formula (3. 73). PRO O F When proving Lemma 3.1, we have seen that 1 I .9'n ll :::; An . We therefore need to show that An :::; 11 9n I I . As has been mentioned, the function ljf(x) Ih(x) 1 is continuous on
cg k=IO
[a, b]. Consequently, 3x* E [a, bj : ljf(x* ) = All. Let us now consider a function E C[a, b] such that fO (Xk) = sign 1k (x* ) , k = 0 , l , 2, . . . , n, and also IIfo l l = I .
fo
For this function we have:
On the other hand, which implies All :::; 11 9n ll . It only remains to construct a specific example of fo E C[a, b]. This can be done easily by taking fo (x) as a piecewise linear function with the values sign h (x* ) at the points Xk J k = 0 , 1 , 2, . . . , n. 0 We can therefore conclude that
Ln
=
max
II
L I h (x) l ·
a'50x'50h k=o
We have used a somewhat weaker form of this result in Section 2. 1 04 of Chapter 2. The Lebesgue constants of Definition 3.2 play a fundamental role when studying the convergence of interpolating polynomials. To actually see that, we will first need to introduce another key new concept and formulate some important results.
A Theoretical Introduction to Numerical Analysis
84
DEFINITION 3.3
The quantity
c(j,P,,) = min max !P,, (x)  j(x) ! p,, ) a�x�b
(3.78)
(x
is called the best appmximation of a given function j(x) by polynomials of degree no greater than n on the interval [a, bj . Note that the minimum in formula (3.78) is taken with respect to all algebraic poly nomials of degree no greater than n on the interval [a, bj , not only the interpolating polynomials. In other words, the polynomials in (3.78) do not, generally speaking, have to coincide with j(x) at any given point of [a, b] . It is possible to show existence of a particular polynomial that realizes the best approximation (3.78). In most cases, however, this polynomial is difficult to obtain constructively. In general, polynomials of the best approximation can only be built using sophisticated iterative algorithms of nonsmooth optimization. On the other hand, their theoretical properties are well studied. Perhaps the most fundamental property is given by the Jackson inequality. THEOREM 3.8 (Jackson inequality) Let j = j(x) be defined on the interval [a, b] , let it be r  1 times continuously differentiable, and let the derivative j(r  I ) (x) be Lipshitzcontinuous:
M > O. Then, for any n :2:
r the following inequality holds: ba r M c(j, P,, ) < Cr 2nr '
(
)
(3.79)
where C,. (�r � are universal constants that depend neither on j, nor on nor on M.
n,
=
The Jackson inequality [Jac94] reinforces, for sufficiently smooth functions, the result of the following classical theorem established in real analysis (see, e.g., [Rud87]): THEOREM 3. 9 (Weierstrass) Let j E C [a, bj . Then, for any c > 0 there is an algebraic polynomial such that '\Ix E [a, bj : !j(x)  Pf (x) ! :s; c.
Pf (x)
A classical proof of Theorem 3.9 is based on periodization of j that preserves its continuity (the period should obviously be larger than [a, bj ) and then on the approximation by partial sums of the Taylor series that converges uniformly. The Weierstrass theorem implies that for j E C [a, bJ the best approximation defined by (3.78) converges to zero: c(j, Pn) � 0 when n � 00. This is basically as much
Trigonometric Interpolation
85
as one can tell regarding the behavior of E(j, Pn ) if nothing else is known about f (x) except that it is continuous. On the other hand, the Jackson inequality specifies the rate of decay for the best approximation as a particular inverse power of n, see formula (3.79), provided that f (x) is smooth. Let us also note that the value of � that enters the expression for Cr = (T) r � in the Jackson inequality (3.79) may, in fact, be replaced by smaller values: 11:
Ko = 1 , Kj = 2 ' known as the Favard constants. The Favard constants can be obtained explicitly for all r = 0 , 1 , 2, . . . , and it is possible to show that all Kr < �. The key considera tion regarding the Favard constants is that substituting them into (3.79) makes this inequality sharp. The main result that connects the properties of the best approximation (Defini tion 3.3) and the quality of interpolation by means of algebraic polynomials is given by the following THEOREM 3. 1 0 (Lebesgue inequality) Let f E C [a, bj and let {xo ,X j , . . . , xn } be an arbitrary set of distinct interpola tion nodes on [a, b] . Then,
(3.80) Note that according to Definition 3. 1 , the operator 9/1 in formula (3.80) generally speaking depends on the choice of the interpolation nodes. PRO O F
It is obvious that we only need to prove the second inequality in
( 3.80 ) , i.e., the upper bound. Consider an arbitrary polynomial Q (x) E {�l (X) } of degree no greater than n. As the algebraic interpolating polynomial is unique ( Theorem 2.1 of Chapter 2), we obviously have 9n[Q] = Q . Next, I I f  911 [1] [[
=
I I f  Q + 9n [Q]  911[f] [I :; I I f  QII + lI &n[Q]  911[1] 1 1 = I I f  Q [[ + 11 911[ Q  f] [[ :; I I f  QII + I I 9"n ll I I f  Q [ I = ( 1 + Ln ) lI f  QII ·
Let us now introduce 0 > 0 and denote by Q.s (x) E {Pn (x)} a polynomial for which I I f  Q.s [[ < E(j' �l) + O. Then,
[[ f  9n [1] I I :; ( 1 + Ln) lI f  Q.s 11 < ( 1 + Ln ) (E(j' �I ) + O). Finally, by taking the limit
0
;
0, we obtain the desired inequality ( 3.80 ) . 0
The Lebesgue inequality (3.80) essentially provides an upper bound for the inter polation error I I f  911 [1] [ 1 in terms of a product of the best approximation (3.78)
86
A Theoretical Introduction to Numerical Analysis
times the Lebesgue constant (3.74). Often, this estimate allows one to judge the convergence of algebraic interpolating polynomials as n increases. It is therefore clear that the behavior of the Lebesgue constants is of central importance for the convergence study. THEOREM 3. 1 1 (FaberBernstein) For any choice of interpolation nodes XO , X l , following inequality holds:
Ln >
. • •
, XI! on the interval
1 8yr.;1r ln (n + l ) .
[a, b],
the
(3.81)
Theorem 3. 1 1 shows that the Lebesgue constants always grow as the grid dimen sion n increases. As such, the best one can generally hope for is to be able to place the interpolation nodes in such a way that this growth will be optimal, i.e., logarithmic. As far as the problem of interpolation is concerned, if, for example, nothing is known about the function f E qa, bj except that it is continuous, then nothing can be said about the behavior of the error beyond the estimate given by the Lebesgue inequality (3.80). The Weierstrass theorem (Theorem 3.9) indicates that £(j, Pn ) t o as n t 00 , and the FaberBernstein theorem (Theorem 3.11) says that Ln t 00 as n t 00. We therefore have the uncertainty 0 . 00 on the righthand side of the Lebesgue inequality; and the behavior of this righthand side is determined by which of the two processes dominates  the decay of the best approximations or the growth of the Lebesgue constants. In particular, if limn >� Ln £(j, Pn) = 0, then the interpolating polynomials uniformly converge to f(x). If the function f(x) is sufficiently smooth (as formulated in Theorem 3.8), then combining the Lebesgue inequality (3.80) and the Jackson inequality (3.79) we ob tain the following error estimate: 3
( )
ba 'M Il f  8¢'1![JJ II < (Ln + I )C, 2 n"
(3.82)
which implies that the convergence rate (if there is convergence) will depend on the behavior of Ln when n increases. If the interpolation grid is uniform (equidistant nodes), then the Lebesgue constants grow exponentially as n increases, see inequal ities (2. 18) of Section 2. 1 , Chapter 2. In this case, the limit (as n t 00) on the righthand side of (3.82) is infinite for any finite value of r. This does not necessarily mean that the sequence of interpolating polynomials 8¢'I! [J] (x) diverges, because in equality (3.82) only provides an upper bound for the error. It does mean though that in this case convergence of the interpolating polynomials simply cannot be judged using the arguments based on the inequalities of Lebesgue and Jackson. 3 Note that in estimate (3. 82) the function fix) is assumed to have a maximum of ,  I derivatives, and the derivative f( r I ) (x) is required to be Lipshitzcontinuous (Theorem 3.8), which basically makes fix) "almost" r times differentiable. Previously, we have used a slightly different notation and in the error estimate (3.67) the function was assumed to have a maximum of r + I derivatives.
Trigonometric Interpolation
87
On the other hand, for the Chebyshev interpolation grid (3.64) the following the orem asserts that the asymptotic behavior of the Lebesgue constants is optimal: THEOREM 3.12 (Bernstein) Let the interpolation nodes XO, X l , . . . ,xn on the interval [ 1 , 1] be given by roots of the Chebyshev polynomial Tn + l (x) . Then,
4
1C
Ln < 8 +  1n(n + 1 ) .
(3.83)
Therefore, according to (3.82) and (3.83), if the derivative fr l ) (x) of the function f(x) is Lipshitzcontinuous, then the sequence of algebraic interpolating polynomials built on the Chebyshev nodes converges uniformly to f(x) with the rate 6'(n r ln(n + 1 ) )
as
n + oo•
Let us now recall that a Lipshitzcontinuous function f(x) on the interval [ 1 , 1]: I f(x' )  f(x") I ::; constlx'  x" l ,
x', x" E [ 1 , 1] ,
i s absolutely continuous o n [ 1 , 1], and as such, according t o the Lebesgue theorem [KF75], its derivative is integrable, i.e., exists in Ld 1 , 1 ] . In this sense, we can say that Lipshitzcontinuity is "not very far" from differentiability, although this is, of course, not sufficient to claim that the derivative fr) (x) is bounded. On the other hand, if f(x) is r times differentiable on [ 1 , 1] and f(r) (x) is bounded, then fr l ) (x) is Lipshitzcontinuous. Therefore, for a function with its rth derivative bounded, the rate of convergence of Chebyshev interpolating polynomials is at least as fast as the inverse of the grid dimension raised to the power r (smoothness of the interpolated function), times an additional slowly increasing factor '" Inn. At the same time, recall that the unavoidable error of reconstructing a function with r derivatives on a uniform grid with n nodes is 6' (n  r ) . This is also true for the Chebyshev grid, because Chebyshev grid on the diameter is equivalent to a uniform grid on the circle, see Figure 3.4. Consequently, accuracy of the Chebyshev interpolating polynomials appears to be only a logarithmic factor away from the level of the unavoidable error. As such, we have shown that Chebyshev interpolation is not saturated by smoothness and practically reaches the intrinsic accuracy limit. Altogether, we see that the type of interpolation grid may indeed have a drastic effect on convergence, which corroborates our previous observations. For the Bern stein example f(x) = lxi,  1 ::; x ::; 1 (Section 2. 1 .5 of Chapter 2), the sequence of interpolating polynomials constructed on a uniform grid diverges. On the Chebyshev grid we have seen experimentally that it converges. Now, using estimates (3.82) and (3.83) we can say that the rate of this convergence is at least 6'(n l ln(n + 1 ) ) . To conclude, let u s also note that strictly speaking the behavior o fLn o n the Cheby shev grid is only asymptotically optimal rather than optimal, because the constants in the lower bound (3.8 1 ) and in the upper bound (3.83) are different. Better values of these constants than those guaranteed by the Bernstein theorem (Theorem 3 . 1 2)
88
A Theoretical Introduction to Numerical Analysis
have been obtained more recently, see inequality (3.66). However, there is still a gap between (3.8 1 ) and (3.66). REMARK 3.3 Formula (3.78) that introduces the best approximation according to Definition 3.3 can obviously be recast as
£ (j, Pn)
=
min II P,/ (x)  f(x) l i e,
Pn (x)
where the norm on the righthand side is taken in the sense of the space qa, b] . In general, the notion of the best approximation admits a much broader in terpretation, when both the norm (currently, I I . li e ) and the class of approxi mating functions (currently, polynomials p,/ (x)) may be different. In fact, one can consider the problem of approximating a given element of the linear space by linear combinations of a predefined set of elements from the same space in the sense of a selected norm. This is the same concept as exploited when introducing the Kolmogorov diameters, see Remark 2.3 on page 4 l . For example, consider the space L2 [a , b] of all square integrable functions (x) f , x E [a , b] , equipped with the norm:
I I f l 12 �
[1 f (X)dX] 2 b
1
This space is known to be a Hilbert space. Let us take an arbitrary f E L2 [a, b] and consider a set of all trigonometric polynomials Qn (x) of type (3.6), where L b  a is the length of the interval. Similarly to the algebraic polynomi als P,/ (x) employed in Definition 3.3, the trigonometric polynomials Qn (x) do not have to be interpolating polynomials. Then, it is known that the best approximation in the sense of L2 : =
£2(j, Qn ) = min I l f (x)  Qn (X) 11 2 Qn(x) is, in fact, realized by the partial sum Sn (x) , see formula (3.38), of the Fourier series for f (x) with the coefficients defined by ( 3.40 ) . An upper bound for the actual magnitude of the L2 best approximation is then given by estimate ( 3.41 ) for the remainder 8Sn(x) of the series, see formula (3.39 ) :
£2(j, QII )
:::;
.f;,
nr+ z
where Sn
= 0( 1 ), n > 00,
where r + 1 is the maximum smoothness of f(x). Having identified what the best approximation in the sense of L2 is, we can easily see now that both the Lebesgue inequality (Theorem 3.10 ) and the error estimate for trigono metric interpolation (Theorem 3.5 ) are, in fact, justified using the same argu ment. It employs uniqueness of the corresponding interpolating polynomial, the estimate for the best approximation, and the estimate of sensitivity to perturbations given by the Lebesgue constants. 0
Trigonometric Interpolation
89
Exercises
1.
Let the function f f(x) be defined on an arbitrary interval [a, b], rather than on [ 1 , 1]. Construct the Chebyshev interpolation nodes for [a, b], and write down the in terpolating polynomials P" (x,f) and Pn (x,f) similar to those obtained in Sections 3.2.3 and 3.2.6, respectively. =
2. For the function f(x) x2�1/4 '  1 ::; x ::; 1, construct the algebraic interpolating poly nomial PIl (x,f) using roots of the Chebyshev polynomial T,,+ I (x), see (3. 64) , as inter polation nodes. Plot the graphs of f(x) and Pn (x,f) for n 5 , 10, 20, 30, and 40. Do the same for the interpolating polynomial P,,(x,f) built on the equally spaced nodes Xk =  1 + 2k/n, k 0, 1 , 2 , . . . , n (Runge example of Section 2. 1 .5, Chapter 2). Explain the observable qualitative difference between the two interpolation techniques. =
=
=
3. Introduce the normalized Chebyshev polynomial
2 1 1I T,I (x).
tn (x) of degree n by setting 7;, (x) =
a) Show that the coefficient in front of x' in the polynomial
7;, (x) is equal to one. b) Show that the deviation max 17;, (x) I of the polynomial tn (x) from zero on the interval

1 S;xS; 1
 1 ::; x ::; 1 is equal to 2 1 11 .
c)* Show that among all the polynomials of degree n with the leading coefficient (i.e., the coefficient in front of Xl) equal to one, the normalized Chebyshev polynomial tn (x) has the smallest deviation from zero on the interval  I ::; x ::; 1 .
d) How can one choose the interpolation nodes to , tl , . . . , til on the interval [ 1 , 1], so that the polynomial (t  to ) (t  t I l . . . (t  til ) , which is a part of the formula for the interpolation error (2.23), Chapter 2, would have the smallest possible deviation from zero on the interval [ 1 , I]?
4. Find a set of interpolation nodes for an even 2nperiodic function F ( cp), see formula (3.55), for which the Lebesgue constants would coincide with the Lebesgue constants of algebraic interpolation on equidistant nodes.
Chapter 4 Compu ta tion of Definite Integrals. Qu adr a tures
A definite integral
lb j(x)dx can only be evaluated exactly if the primitive (an
tiderivative) of the function is available in the closed form: = e.g., as a tabular integral. Then we can use the fundamental theorem of calculus (the NewtonLeibniz formula) and obtain:
j(x)
F(x) f j(x)dx,
lb j(x)dx = F(b ) F(a). It is an even more rare occasion that a multiple integral, say,
fLj(x, y)dXdy, where
Q is a given domain, can be evaluated exactly. Therefore, numerical methods play an important role for the approximate computation of definite integrals.
In this chapter, we will introduce several numerical methods for computing defi nite integrals. We will identify the methods that automatically take into account the regularity of the integrand, i.e., do not get saturated by smoothness (Section 4.2), as opposed to those methods that are prone to saturation (Section 4. 1). We will also discuss the difficulties associated with the increase of dimension for multiple integrals (Section 4.4) and outline some combined analytical/numerical approaches that can be used for computing improper integrals (Section 4.3). Formulae that are used for the approximate evaluation of definite integrals given a table of values of the integrand are called quadrature formulae.
4.1
Trapezoidal Rule, Simpson's Formula, and the Like
Recall that the value of the definite integral
lb j(x)dx can be interpreted geomet
j j(x).
rically as the area under the graph of =
91
A Theoretical Introduction to Numerical Analysis
92 4.1.1
General Construction of Quadrature Formulae
To compute an approximate value of the foregoing area under the graph, we first introduce a grid of distinct nodes and thus partition the interval [a, b] into n subinter vals: a = Xo < XI < . . . < XII = b. Then, we employ a piecewise polynomial interpolation of some degree s, and replace the integrand j = j(x) by the piecewise polynomial function:
In other words, on every subinterval [Xk ,Xk+ l ] , the function Ps (x) is defined as a polynomial of degree no greater than s built on the s + 1 consecutive grid nodes: Xkj ,Xk  j+ I , ' " ,Xkj+s, where j may also depend on k, see Section 2.2. Finally, we use the following approximation of the integral:
J j(x)dx J Ps(x)dx = J P,, (x,jOj )dx+ J P,, (x,jl j )dx + . . J P,, (x,jn  I ,j )dx. b
b
�
a
a
X2
XI
b
. + xn!
XI
a
(4.1)
Each term o n the righthand side o f this formula is the integral of a polynomial of degree no greater than s. For every k = 0, 1 , . . . , n  1, the polynomial is given by P,,(X,jkj ), and the corresponding integral over [xk ,xk+ d can be evaluated explicitly using the fundamental theorem of calculus. The resulting expression will only con tain the locations of the grid nodes Xk  j ,Xk j + 1 , . . . ,Xk j +s and the function values j(Xk j ),j(Xk  j+ J ) , . . . ,j(Xk j+s). As such, formula (4. 1) is a quadrature formula. It is clear that the following error estimate holds for the quadrature formula (4. 1):
I J j(x)dx  J p,\ (X)dX I � J ij(x)  P,, (x) idx b
b
a
b
a
a
(4.2)
THEOREM 4.1 Let the function j(x) have a continuous derivative of order s + 1 on [a, b], and max iXk+ I  Xk i = 11. Then, denote max iis + I ) (x) i = Ms+ I . Let also
k=O,1 , ... ,1/  1
aSxSb
I J j(x)dx  J P,,(X)dX I b
a
b
II
� const
· ib  a i Ms+ 1 hs+ I .
(4.3)
93
Computation of Definite Integrals. Quadratures
According to Theorem 4. 1 , if the integrand f(x) is s + 1 times continuously differ entiable, then the quadrature formula (4. 1) has order of accuracy s + I with respect to the grid size h. PRO O F In Section 2.2.2, we obtained the following error estimate for the piecewise polynomial algebraic interpolation, see formula (2.32) :
Xk::;max X::;Xk+l If(x)  Ps(X,jkj) ! :::; const · Xk max
j ::;X::;Xk _ j+s
Consequently, we can write: max
a::;x::;b
I lS+ I ) (x) l hs+ l •
� s
I f(x)  f� (x) 1 :::; const · M. + I h + I .
Combining this estimate with inequality (4.2) , we immediately arrive at the desired result (4.3). 0 Quadrature formulae of type (4.1) are most commonly used when the underlying interpolation is piecewise linear (the trapezoidal rule, see Section 4.1 .2) or piecewise quadratic (Simpson's formula, see Section 4. 1.3). 4.1.2
Trapezoidal Rule
Let s = I , and let the grid be uniform: k = 0, 1, . . . , n,
Xk = a + k · h,

b  a. h n
tk+1 PI (x)dx lxk+1 PI (x,jkO )dx is equal to the area of the .!tk trapezoid under the graph of the linear function PI (X,jkO ) = f(Xk ) (Xk+ I  x)/h + ==
Then the integral
xk
f(Xk+ I ) (x  Xk ) /h on the interval [XbXk+ d : b  a fk + fk+1 . PI (x)dx = n 2 Adding these equalities together for all k = 0, 1 , . . . , n I , we transform the general formula (4. 1) to:
lXk+l Xk
J f(x)dx J PI (x)dx b
a
b
�
it
=
(

b  a fo f 1,, n 2 + l + . . . + 1,, 1 + 2
).
(4.4)
Quadrature formula (4.4) is called the trapezoidal rule. Error estimate (4.3) for the general quadrature formula (4. 1) can be refined in the case of the trapezoidal rule (4.4). Recall that in Section 2.2.2 we obtained the fol lowing error estimate for the piecewise linear interpolation itself, see formula (2.33):
Xk ::;X::;Xk+ max
I
h2 I f(x) _ PI (x) l :::; 8
Xk ::;max X::;Xk+ !J" (x) l . I
94
A Theoretical Introduction to Numerical Analysis
Consequently,
h2 l s (x) P lf(x) I 8 I +l � Xkg�. a���h lf /(x) l,
and as the righthand side of this inequality does not depend on what particular subin terval [Xk , Xk+ I ] the point x belongs to, the same estimate holds for all x E [a, b]:
h2 max I fl/ (x) I (x) If(x) P 1 S · I a 9Sb 8 a9Sb According to formula (4.2), the error of the trapezoidal rule built on equidistant nodes 
max
is then estimated as:
i jab f(X)dx  jab PI (X)dXi
s
ba 8
max I fl/(x) l h2 . aSxS h
(4.5)
This estimate guarantees that the integration error will decay no slower than tJ(h2 ) as h .., O. Alternatively, we say that the order of accuracy of the trapezoidal rule is second with respect to the grid size h. Furthermore, estimate (4.5) is in fact sharp in the sense that even for infinitely differentiable functions [as opposed to only twice differentiable functions that enable estimate (4.5)] the error will only decrease as tJ(h2 ) and not faster. This is demonstrated by the following example. Example
10 1
[a,b] = [0, 1]. Then, x2dx = 31 o the trapezoidal rule (4.4) we can write: Let f(x) = x2 and let
.
On the other hand, using
Therefore, the integration error always remains equal to h2 /6 even though the in tegrand f(x) = � has continuous derivatives of all orders rather than only up to the second order. In other words, the error is insensitive to the smoothness of f(x) [beyond fl/ (x)], which means that generally speaking the trapezoidal quadrature for mula is prone to the saturation by smoothness. There is, however, one special case when the trapezoidal rule does not get satu rated and rather adjusts its accuracy automatically to the smoothness of the integrand. This is the case of a smooth and periodic function f = f(x) that has period L = b  a. For such a function, the order accuracy with respect to the grid size h will no longer
Computation ofDefinite Integrals. Quadratures
95
+
be limited by s 1 = 2 and will rather increase as the regularity of f(x) increases. More precisely, the following theorem holds. THEOREM 4.2 Let f = f(x) be an Lperiodic function that has continuous Lperiodic deriva tives up to the order r > 0 and a square integrable derivative of order r 1 :
+
+ J [ir+ l) (x) ] 2 dx < 00.
aL a
Then the error of the trapezoidal quadrature rule (4.4) can be estimated as:
a+L a+L f(x)dx  PI (X)dX a a where 'n 0 as n + 00.
/J
a+L ' f(X)dX  h 1 a
/ /J
J
=:
+
�0 (� + fl; 1 ) / I
::; L ·
:� nr /2 ' (4.6)
The result of Theorem 4.2 is considered wellknown; a proof, for instance can be found in [DR84, Section 2.9]. Our proof below is based on a simpler argument than the one exploited in [DR84]. PROOF We first note that since the function f is integrated over a full period, we may assume without loss of generality that a = O. Let us represent f(x) as the sum of its Fourier series (cf. Section 3.1 .3):
(4.7)
where
5n  1 (x) = is a partial sum of order
ao
II
n I 2nkx 2nkx + L ak cos L + L 13k sin L 2 k=1 k=1
(4.8)
+
(4.9)


n 1 and 
�
.
�
2nkx
2nkx 05n_ I (X) = L ak cos T L 13k sm T k=n k=n+ 1
is the corresponding remainder. The coefficients ak and 13k of the Fourier series for f(x) are given by the formulae: ak
=
2 L
L
J o
2nkx f(x) cos T dx,
According to ( 4.7) , we can write:
L
L
13k
=
2 L
L
L
2nkx J f(x) sin T dx.
o
J f(x)dx = J 5n  1 (x)dx + J 05n l (x)dx. 0
0
0
(4. 10)
A Theoretical Introduction to Numerical Analysis
96
Let us first apply the trapezoidal quadrature formula to the first integral on the righthand side of (4.10). This can be done individually for each term in the sum 4 . 8 . First of all, for the constant component 0 we immediately derive:
( )
k=
L L (10/2 + (10/2 ) = L (10 = f Sn_l(x)dx = f f(x)dx. h IlI,I ( 
2 2 2 For all other terms k= 1 , 2 , . . . , n  l , we exploit periodicity with the period L, usc the definition of the grid h = L/n , and obtain: 2nklh 2nklh + 1 cos 2nk(l + l )h ) h '�I cos h �I,L., ( 2 cos L L L 2 ( n  2h I, (e +e )  h2 11 ee1r + 1  ee1y  0 Analogously, � sin 2nk(l + 1 ) h ) = h 'f ( �2 Sin 2nklh + L 2 L Altogether we conclude that the trapezoidal rule integrates the partial sum SnI (x) given by formula (4.8) exactly: L h I ( Sn,2(Xd + Sn l 2(xl+d ) = ! Sn_ I (X)dx = �(10. (4.1 1 ) 2 We also note that choosing the partial sum of order n  1 , where the number of grid cells is n, is not accidental. From the previous derivation it is easy to see that equality (4.11) would no longer hold if we were to take Sn (x) instead of SnI (X). Next, we need to apply the trapezoidal rule to the remainder of the series 8Sn l(x) given by formula (4.9). Recall that the magnitude of this remainder, or equivalently, the rate of convergence of the Fourier series, is determined by the smoothness of the function f(x). More precisely, as a part of the proof of Theorem 3.5 ( page 68), we have shown that for the function f(x) that has a square integrable derivative of order r+ 1 , the following estimate holds, see formula (3.41): (4. 1 2) sup [8Sn  1 (x)[ :::; +S': /2 ' nr where Sn o( 1 ) when n Therefore, ( 8Sn;I (X1) 8Sn�xl+ d ) �� ( [8SnI (XI )[ + [ 8SnI (XI+d l ) :::; h2 211 sup [8Sn1 (x)[ = nr� +Sn /2 · 10
1=0
_
0
0
1
1
1 =0

i 2Trklh L
_i 2"klh
r
=
i 2Trknh
r
_
· 2"kh
1=0
I
I/�
1
i 2Trkllh r
· 2Trkh
o
1 =0
=
,L.,
/=0
)
.
0
O$x$L
+
+ 00.
1
:::;
O $x$L
[.
_
.
Computation ofDefinite Integrals. Quadratures
97
o
which completes the proof of the theorem.
Thus, if the integrand is periodic, then the trapezoidal quadrature rule does not get saturated by smoothness, because according to (4.6) the integration error will self adjust to the regularity of the integrand without us having to change anything in the quadrature formula. Moreover, if the function f(x) is infinitely differentiable, then Theorem 4.2 implies spectral convergence. In other words, the rate of decay of the error will be faster than any polynomial, i.e., faster than tJ(n r ) for any r > O. REMARK 4 . 1 The result of Theorem 4.2 can, in fact, be extended to a large family of quadrature formulae (4.1 ) called the NewtonCotes formulae. Namely, let n = M · s, where M and s are positive integers, and partition the original grid of n cells into M clusters of s hsize cells:
Xm = m · sh, m = 0, I , . . . , M. Clearly, the relation between the original grid nodes Xk , k = 0, 1, . . . ,n, and the coarsened grid nodes Xm , m = 0, 1 , . . . , M, is given by: xm = Xm . s . On each inter val [xm ,xm+ !], m = 0, 1 , . . . , M  1 , we interpolate the integrand by an algebraic polynomial of degree no greater than
s:
f(x) Ps(x,f,xmos , Xm.s+ ! , . . . ,X(m+ I ) 's), Xm == xm.s ::; X ::; X(m+ ! ) os == Xm+ I . ;:::0
This polynomial iH obviously unique. In other words, we can say that if
x E [Xk ' Xk+ I ] , k = 0, 1 , . . . , n  1 , then f(x) p'\, (X,J,Xk j ,Xk j+ I , ' " ,Xk j+s) , ;:::0
where j = k  [kls] · s,
and [ . ] denotes the integer part. The quadrature formula is subsequently built by integrating the resulting piecewise polynomial function p.\,(x); it clearly belongs to the class (4. 1). In particular, the SimpHon formula of Section 4.1.3 is obtained by taking n even, setting s = 2 and M = n 1 2. For periodic integrands, quadrature formulae of this type do not get Haturated by smoothness, see Exercise 2. Quadrature formulae of a completely different type that do not saturate for nonperiodic integrands either are discussed in Section 4.2. 0
98
A Theoretical Introduction to Numerical Analysis
4.1.3
S impson's Formula
f(x) f(x),
This quadrature formula is obtained by replacing the integrand with a piece wise quadratic interpolant. For sufficiently smooth functions the error of the Simpson formula decays with the rate t!J(h4 ) when h > 0, i.e., faster than the error of the trapezoidal rule. Assume that n is even, n = 2 , and that the grid nodes are equidistant: = h = (b a)/n, k = 0, 1 , . . . , n  1 . On every interval m = O, I , by a quadratic interpolating we replace the integrand polynomial which is defined == uniquely and can be written, e.g., using the Lagrange formula (see Section 2. 1. 1):
M [X2m , X2m+2 ],  l, f(x) . .. , M(X,j,X P2 2m , X2m+ l , X2m+2 ) P2 (x, 12m ,12m+ 1 , 12m+2 ), 2m+ I ) (x X2m+2 ) f(x) ';::j P2 (X,j2m, j2m+ 1 , 12m+2 ) = 12m (x2m(x X X2m+ I )(x2m X2m+2 ) 2m ) (x X2m+2) +12m+2 (x X2m) (x  X2m+ I ) + 2m+I (X2m+I(x X (X2m+2  x2m )(X2m+2 X2m+I ) x  2m )(X2m+1 X2m+2 ) Equivalently, we can say that for x E [Xk, Xk , k = 0, 1 , . . . , n  1, we approximate f(x) by means of the quadratic polynomial P+!l2 (X,jkj) P2 (X,j, Xkj , Xkj+I , Xk j+2 ), where j = ° i f k i s even, and j = 1 i f k i s odd. The polynomial P (X, " 12m+1 , 12m+2 ) can be easily integrated over the interval [X2m , X2m+2] with the2help12ofn the fundamental theorem of calculus, which yields:
xk+ l Xk
1:
==
By adding these equalities together for all m = 0, 1 , . . . b
, M  1, we obtain:
ba =  (fo +4fl +212 +413 + ... + 4fnI + fn). ( f(x)dx f) ';::j (In 3n i def
(4. 13)
a
Formula (4. 13) for the approximate computation of the definite integral is known as the Simpson quadrature formula.
l f(x)dx b
THEOREM 4.3 Let the third derivative interval [a b and let max ,
]
,
f(3 ) (x)(3 ) of the integrand f(x) be continuous on the If (x) I = M3 . Then the error of the Simpson
a�x�b
formula satisfies the following estimate:
l if(x)dx (In (f) I ::::: b
·a
(bV3a)M3 h3 . 9
(4. 14)
Computation of Definite Integrals. Quadratures
99
PRO O F To estimate the integration error for the Simpson formula, we first need to recall the error estimate for the interpolation of the integrand f (x) by piecewise quadratic polynomials. Let us denote the interpolation error by R2 (X) � f (x)  P2 (X) , where the function P2 (X) coincides with the quadratic polynomial P2 (X'/2m ' /2m+ l , /2m+2) if x E [X2m,X2m+2] . Then we can use Theorem 2.5 [formula (2.23) on page 36] and write for x E [X2m ,X2m+2] :
3 f( ) ( � ) R2 (X) = (X  X2m) (X  X2m+ J ) (X  X2m+2) , 3!
� E (X2m, X2m+2 ) .
Consequently,
The previous inequality holds for x E [X2m,X2m+2] , where m can be any number: m = 0, . . . , M Hence it holds for any x E [a , ] and we therefore obtain:
1,
 1.
b M3 3 h 9vM3 .
max IR2 (X) I <S;
aSxSb
For the integration error, we thus have: b
If a
f (x) dx  O"n U)
=
b
If a
h
1 = If a
f (x) dx 
u(X)  P2 (X)) dX
b
f a
P2 (x) dx b
I If =
a
l
R2 (X) dX
b
I f <S;
a
I R2 (x) l dx
o
which completes the proof of the theorem.
Theorem 4.3 implies that for the function f (x) that is three times continuously differentiable on [a, the error of the Simpson formula will decrease cubically with as h 7 0, see estimate (4. l 4). For respect to the grid size i.e., with the rate a smoother integrand f (x), one can, in fact, obtain an even faster convergence.
b]
tJ(h3 )
h,
THEOREM 4.4 Let the fourth derivative f(4) (x) of the integrand f (x) be continuous on the interval [a , ] , and let max (x) 1 = M . Then the error of the Simpson a<x
00
,
Computation ofDefinite Integrals. Quadratures 1 07 provided only that the derivative J lr )(x) of the function f(x) is Lipshitz
continuous. Consequently, for the error of the Gaussian quadrature we have: 1 n 1) � L = as + 00 . (4.26) m 1 ' + 1 m=O vi 1 I Estimate (4.26) implies, in particular, that the Gaussian quadrature (4.23) converges at least for any Lipshit>lcontinuous function (page 87), and that 1 the rate of decay of the error is not slower than 1 ) ) as + 00 . Otherwise, for smoother functions the convergence speeds up as the regularity increases, and becomes spectral for infinitely differentiable functions. 0
/J
f(x ) / tJ ( rr ln(nrn++
f(x) dxx2 n 
)
n
tJ(n In(n +
n
When proving Lemma 4.2, we have, in fact, shown that holds for all Chebyshev polynomials up to the degree k = 2n 1 and not only up to the degree k = This circumstance reflects on another important property of the Gaussian quadrature (4.23) . It happens to be exact, i.e., it generates no error, for any algebraic polynomial of degree no is a higher than + 1 on the interval  1 :::; :::; l . In other words, if polynomial of degree :::; + 1 , then 1 REMARK 4.3
+
equality
(4.22)
2n
Tk(x)
n.
x 2n rr Q2n+I (X)2 dx = � J n + 1 m=O Q2n+1 (xm ) · 1 x _1 £..
�
V
Q2n+1 (x)

Moreover, it is possible to show that the Gaussian quadrature (4.23) is optimal in the following sense. Consider a family of the quadrature formulae on [ 1 , 1]:
j f(x)w(x)dx
(4.27) ± ajf(xj), where the dimension n is fixed, the weight w(x) is defined as before, w(x) = 1 / vi1 x2 , and both the nodes xj and the coefficients aj , j = 0, 1 , , n are
1
�
}=o
. . .
to be determined. Require that this quadrature be exact for the polynomials of the highest degree possible. Then, it turns out that the corresponding degree will be equal to + 1 , while the nodes Xj will coincide with those of the ChebyshevGauss grid (4.1 7), and the coefficients aj will all be equal to + 1 ) . Altogether, quadrature (4.27) will coincide with (4.23). 0
rr/(n
2n
Exercises
1. Prove an analogue of Lemma 4.2 for the Gaussian quadrature on the ChebyshevGauss Lobatto grid (4. 1 8). Namely, let ?n (x,J) Lk=o ihTk (x) be the corresponding interpo lating polynomial (constructed in Section 3.2.6). Show that =
1
.
_
£..
dx  � � f3 f(xm ) , J v� I  x· n m=o
I
Pn (x,J)
In
A Theoretical Introduction to Numerical Analysis
108 where f30
4.3
=
13n
=
1/2 and 13m = 1 for In = 1 ,2,
. . . , In
 1.
Improper Integrals . Combination o f Numerical and Analytical Methods
Even for the simplest improper integrals, a direct application of the quadrature formulae may encounter serious difficulties. For example, the trapezoidal rule (4.4):
b a J f(x) dx : (� b
a
�
+
fl + . . . + fll  I +
a
�)
will fail for the case = 0, b = 1, and f(x) = cosx/ JX, because f(x) has a singularity at x = 0 and consequently, fo = f(O) is not defined. Likewise, the Simpson formula (4. 13) will fail. For the Gaussian quadrature (4.23), the situation may seem a little better at a first glance, because the ChebyshevGauss nodes (4. l7) do not include the endpoints of the interval. However, the unboundedness of the function and its derivatives will still prevent one from obtaining any reasonable error estimates. At c x the same time, the integral itself: obviously exists, and procedures for its effi cient numerical approximation need to be developed. To address the difficulties that arise when computing the values of improper in tegrals, it is natural to try and employ a combination of analytical and numerical techniques. The role of the analytical part is to reduce the original problem to a new problem that would only require one to evaluate the integral of a smooth and bounded function. The latter can then be done on a computer with the help of a quadrature formula. In the previous example, we can first use the integration by parts:
10 1 � dx,
jo cO;;dx
=
I�
cosx(2JX) +
j sJXsinxdx, 0
and subsequently approximate the integral on the righthand side to a desired accu racy using any of the quadrature formulae introduced in Section 4. 1 or 4.2. An alternative approach is based on splitting the original integral into two: 1
J o
cosx JX
c
dx J 
0
c
cosx JX
1
dx, dx+ J cosx JX
(4.28)
c
where > 0 can be considered arbitrary as of yet. To evaluate the first integral on the righthand side of equality (4.28), we can use the Taylor expansion: x2 x4 x6 cosx = 1  +    + . . . . 21 4! 6!

Computation ofDefinite Integrals. Quadratures
109
Then we have:
I l/2 ... 3/2 /. xdx 7/2  j Xdx+ cosx dx  j' xdx jj dx= + 2! 6! . 4! c
o
c
y'x
0
c
y'x
c
0
c
0
0
,
(4.29)
and every integral on the righthand side of equality (4.29) can be evaluated explicitly using the fundamental theorem of calculus. Let the tolerance f > 0 be given, and assume that the overall integral needs to be approximated numerically with the error that may not exceed f. It is then suffi cient to approximate each of the two integrals on the right hand side of (4.28) with the accuracy f /2. As far as the first integral, this can be achieved by taking only a few leading terms in the sum on the righthand side of formula (4.29). The precise number of terms to be taken will, of course, depend on f and on c. In doing so, the value of c should not be taken too large, say, c = I, because in this case even if f is not very small, the number of terms to be evaluated on the righthand side of formula (4.29) will still be substantial. On the other hand, the value of c should not be taken too small either, say, c « 1 , because c in this case the first integral on the righthand side of (4.28): can be
foc o;xxdx, evaluated very efficiently, but the integrand in the second integral l' cO;; dx will n
have large derivatives. This will necessitate taking large values of (grid dimension) for evaluating this integral with accuracy f /2, say, using the trapezoidal rule or the Simpson formula. A more universal and generally more efficient technique for the numerical com putation of improper integrals is based on regularization, i.e., on isolation of the singularity. Unlike in the previous approach, it does not require finding a fine balance between the too small and too large values of c. Assume that we need to compute the integral
j1 f(X) dx
o y'x ' where the function is smooth and bounded. Regularization consists of recasting this integral in an equivalent form:
f(x)
j1 f(x) dx j1 f(x)  rp(x) dx + j1 rp(x) dx. 0
y'x
=
0
y'x
rp(x)
0
y'x
(4.30)
is to be chosen so that to remove the singu In doing so, the auxiliary function larity from the first integral on the righthand side of (4.30) and enable its effi cient and accurate approximation using, e.g., the Simpson formula. The second integral on the righthand side of (4.30) will still contain a singularity, but we select so that to be able to compute this integral analytically. In our particular case, this goal will be achieved if is taken in the form of around = 0: just two leading terms of the Taylor expansion of
f(x)
rp (x)
rp(x)
x
rp(x) = f(O) +
A Theoretical Introduction to Numerical Analysis
1 10
l' (O )x. Substituting this expression into formula (4.30) we obtain:
1
f(x) dx = 1 f(x) f(O)  f (O)x dx+ f(O ) 1 dx + 1'(0) v'XdX. J v'X J J v'X J v'X 1
I
0
o
0
Returning to our original example with pression for the integral: 1
0
f(x) = cosx, we arrive at the following ex
1 cosxld dx . x = J v'X x+ J v'X
cosx d
J v'X
0
1
0
0
2,
The second integral on the righthand side is equal to while for the evaluation of the first integral by the Simpson formula, say, with the accuracy e = 10 3 , taking a fairly coarse grid 4, i.e., = 1 /4, already appears sufficient. Sometimes we can also use a change of variable for removing the singularity. For instance, setting = in the previous example yields:
n x t2
h
=
1
I
cosx
cost22tdt = 2 
J v'X dx J t 0
=
0
I
2
J cost dt , 0
and the resulting integral no longer contains a singularity. Altogether, a key to success is a sensible and adequate partition of the overall task into the analytical and numerical parts. The analytic derivations and transformations are to be performed by a human researcher, whereas routine number crunching is to be performed by a computer. This statement basically applies to solving any problem with the help of a computer. Exercises
1 . Propose algorithms for the approximate computation of the following improper inte grals:
j Sin3/2Xdx, jIO In(l3/2+x) dX, I
o
4.4
x
o
x
jn/2 o
cosx dx. y,=== lr/2  x
Multiple Integrals
Computation of multiple integrals:
I(m) I(m) (j, Q) = J ... J f(Xl , ... ,Xm )dXl ... dxm =
n
Computation ojDefinite Integrals. Quadratures
111
over a given domain .Q of the space ]Rm of independent variables X I , . . . ,Xm can be reduced to computing integrals of the type g(x)dx using quadrature formulae that we have studied previously. However, the complexity of the corresponding al gorithms and the number of arithmetic operations required for obtaining the answer with the prescribed tolerance e > 0 will rapidly grow as the dimension of the space increases. This is true even in the simplest case when the domain of integration is a and if .Q has a more general cube: .Q = {x = (XI , ,xm ) I l xj l :::; 1 , j = 1 2 shape then the problem gets considerably exacerbated. In the current section, we only provide several simple examples of how the prob lem of numerical computation of multiple integrals can be approached. For more detail, we refer the reader to [DR84, Chapter 5].
lb
, , ... , m},
• • ·
4.4.1
m
Repeated Integrals and Quadrature Formulae
We will illustrate the situation using a twodimensional setup, = 2. Consider the integral:
1(2 ) =
11 J(x,y)dxdy
m
y
n
over the domain .Q of the Cartesian plane, which is schematically shown in Figure 4. 1 . Assume that this do main lies in between the graphs of two functions: Then, clearly,
1(2 )
=
where
11 J(x,y)dxdy n
lb
=
F(x) =
o
1b F(x)dx,
a
b
FIGURE 4. 1 : Domain of integration.
a
cpz(x)
1 J(x,y)dy, a :::; x :::; b.
IPI (x)
The integral 1(2) = F(x)dx can be approximated with the help of a quadrature formula. For simplicity, let us use the trapezoidal rule:
1b
b : a ( F �o ) +F(xd + . . . +F(xn  d + F�n) ) , ba Xj a + j . . . . ,n. n
1(2) = F(x)dx >:::; a
=
,
j = O, I ,
(4.3 1 )
x
A Theoretical Introduction to Numerical Analysis
112
In their own turn, expressions
F(xj)
=
fP2 (Xj)
f f(xj,y)dy,
4'1 (.tj)
j = O, I , . . .
,n
that appear in formula (4.31) can also be computed using the trapezoidal rule:
(4.32)
£ £(n)
In doing so, the approximation error that arises when computing the integral /(2 ) by means of formulae (4.3 1) and (4.32) decreases as the number increases. At the same time, the number of arithmetic operations increases as If the integrand = has continuous partial derivatives up to the second order on and n, and if, in addition, the functions are suffi ciently smooth, then the properties of the onedimensional trapezoidal rule established in Section 4.1 .2 guarantee convergence of the twodimensional quadrature (4.3 1 ), (4.32) with second order with respect to = To summarize, the foregoing approach is based on the transition from a multiple integral to the repeated integral. The latter is subsequently computed by the re peated application of a onedimensional quadrature formula. This technique can be extended to the general case of the integrals 1(111) , > 2, provided that the domain n is specified in a sufficiently convenient way. In doing so, the number of arith metic operations needed for the implementation of the quadrature formulae will be where is the number of nodes in one coordinate direction. If the function has continuous partial derivatives of order two, the boundary of the domain n is sufficiently smooth, and the trapezoidal rule (Section 4. 1 .2) is used in the capacity of the onedimensional quadrature formula, then the rate of convergence will be quadratic with respect to = In practice, if the tolerance > ° is specified ahead of time, then the dimension is chosen experimentally, by starting the computation with some and then refining the grid, say, by doubling the number of nodes (in each direction) several times until the answer no longer changes within the prescribed tolerance. =
n 0(n2 ).
f f(x,y)
!PI (x) Cj>2(x) n I : £ (n) 0(n2 ). m
O(nm ), f(XI , ... ,xm )
n
£
4.4.2
n ': £(n) eJ(n2 ).
no
n
T h e U s e o f Coordinate Transformations
Let us analyze an elementary example that shows how a considerable simplifi cation of the original formulation can be achieved by carefully taking into account the specific structure of the problem. Suppose that we need to compute the double where the domain n is a disk: n = xl + l ::; R2 }, integral
fL f(x,y)dxdy,
{(x,y) I
Computation of Definite Integrals. Quadratures
f(x,y)
1 13
x
and the function does not, in fact, depend on the two individual variables Then we can == and but rather depends only on 2 + l = r2 , so that write: R
y,
x
1( 2) =
f(x,y) f(r).
11f(x, y)dxdy 27r 1 f(r)rdr, =
Q
0
and the problem of approximately evaluating a double integral is reduced to the prob lem of evaluating a conventional definite integral. 4.4 . 3
The Notion of Monte C arlo Methods
The foregoing reduction of multiple integrals to repeated integrals may become problematic already for m = 2 if the domain n has a complicated shape. We there fore introduce a completely different approach to the approximate computation of multiple integrals. Assume for simplicity that the domain n lies inside the square D [ O :S 1, 0 :S I } . The function = is originally defined on n, and we extend it = 0 for into D\n and E D\n. Suppose that there is a random number generator that produces the numbers uniformly distributed across the interval [0 1 ] . This means that the probability for a given number to fall into the interval [(X I , (X2 ] , where 0 :S (XI < (X2 :S 1 , is equal to (X2 Using this generator, let us obtain a sequence of random pairs = k d , k = 1 2 . . . , and then construct a numerical sequence: 1 k
:S y
=
f f(x,y) letf(x,y) (x,y) Pk (x y ,
Sk = k
{(x,y) x:S ,

(XI .
,
,
j='LI f(xj , Yj).
It seems natural to expect that for a given £ > 0, the probability that the error [/( 2) Sk i is less than £ : will approach one as the number of tests k increases. This statement can, in fact, be put into the framework of rigorous theorems. The meaning of the Monte Carlo method becomes particularly apparent if = I on n. Then the integral 1(2 ) is the area of n, and Sk is the ratio of the number of those points 1 j :S k, that fall into n to the overall number k of these points. It is clear that this ratio is approximately equal to the area of n, because the overall area of the square D is equal to one. The Monte Carlo method is attractive because the corresponding numerical algo rithms have a very simple logical structure. It does not become more complicated when the dimension m of the space increases, or when the shape of the domain n becomes more elaborate. However, in the case of simple domains and special narrow classes of functions (smooth) the Monte Carlo method typically appears less efficient than the methods that take explicit advantage of the specifics of a given problem. The Monte Carlo method has many other applications, see, e.g., [LiuOl , BH02, CH06]; an interesting recent application area is in security pricing [Gla04].
f(x,y)
Pj , :S
Part II Systems of Scalar Equations
1 15
In mathematics, plain numbers are often referred to as scalars; they may be real or complex. Equations or systems of equations with respect to one or several scalar unknowns appear in many branches of mathematics, as well as in applications. Ac cordingly, computational methods for solving these equations and systems are in the core of numerical analysis. Along with the scalar equations, one frequently encoun ters functional equations and systems, in which the unknown quantities are functions rather than plain numbers. A particular class of functional equations are differential equations that contain derivatives of the unknown functions. Numerical methods for solving differential equations are discussed in Part III of the book. In the general category of scalar equations and systems one can identify a very important subclass of Numerical methods for solving various systems of this type are discussed in Chapters 5, 6, and 7. In Chapter 8, we address numerical solution of the nonlinear equations and systems. Before we actually discuss the solution of linear and nonlinear equations, let us mention here some other books that address similar subjects: [Hen64,IK66, CdB80, Atk89, PT96, QSSOO, Sch02, DB03]. More specific references will be given later.
linear algebraic equations and systems.
1 17
Chapter 5 Sys tems of Line ar Alge br aic Equ a tions: Direct Meth ods
Systems of linear algebraic equations (or simply linear systems) arise in numerous applications; often, these systems appear to have high dimension. A linear system of order (dimension) n can be written as follows:
where ai), i, j = 1 , 2, . . . , n, are the coefficients, Ii, i 1 , 2, . . . , n, are the righthand sides (both assumed given), and Xj, j = 1 , 2, . . . , n, are the unknown quantities. A more general and often more convenient form of representing a linear system is its operator form, see Section 5. 1 .2. In this chapter, we first outline different forms of consistent linear systems and illustrate those with examples (Section 5. 1), and then provide a concise yet rigorous review of the material from linear algebra (Section 5.2). Next, we introduce the notion of conditioning of a linear operator (matrix) and that of the corresponding system of linear algebraic equations (Section 5.3), and subsequently describe several direct methods for solving linear systems (Sections 5.4, 5.6, and 5.7). We recall that the method is referred to as direct if it produces the exact solution of the system after a finite number of arithmetic operations. 1 (Exactness is interpreted within the machine precision if the method is implemented on a computer.) Also in this chap ter we discuss what types of numerical algorithms (and why) are better suited for solving particular classes of linear systems. Additionally in Section 5.5, we estab lish a relation between the systems of linear algebraic equations and the problem of minimization of multivariable quadratic functions. Before proceeding any further, let us note that the wellknown Cramer's rule is never used for computing solutions of linear systems, even for moderate dimensions n > 5. The reason is that it requires a large amount of arithmetic operations and may also develop computational instabilities. =
1 An
alternative to direct methods are iterative methods analyzed in Chapter 6.
1 19
A Theoretical Introduction to Numerical Analysis
1 20
5.1
Different Forms of Consistent Linear Systems 2
A linear equation with respect to the unknowns ZI , 2
form
,
. . . , 211
is an equation of the
t
where a I , a2 , . . . , all, and are some given numbers (constants). 5.1.1
Canonical Form of a Linear System
n
Let a system of linear algebraic equations be specified with respect to as many unknowns. This system can obviously be written in the following form:
canonical
(5. 1 ) Both the unknowns Xj and the equations in system (5. 1 ) are numbered by the the consecutive integers: 1 , 2, . . . Accordingly, the coefficients in front of the unknowns have double subscripts: j = 1 , 2, where i reflects on the number of the equation and j reflects on the number of the unknown. The righthand side of equation number i is denoted by Ji for all = 1 , 2, . . From linear algebra we know that system (5. 1 ) is i.e., has a solution, for any choice of the righthand sides J; if and only if the corresponding homogeneous system2 only has a trivial solution: XI = X2 = . . . = Xn = O. For the homogeneous sys tem to have only the trivial solution, it is necessary and sufficient that the determinant of the system matrix A:
, n.
i,
. .
i
.
aU
, n,
, n. consistent, .
(5.2) differ from zero: detA of O. The condition detA of 0 that guarantees consistency of system (5. 1 ) for any righthand side also implies uniqueness of the solution. Having introduced the matrix A of (5.2), we can rewrite system (5. 1 ) as follows: (5.3) or, using an alternative shorter matrix notation: Ax = J. 2 0btained by setting li
=
0 for all i = 1 , 2, . . . , n .
Systems of Linear Algebraic Equations: Direct Methods 5 . 1 .2
121
Operator Form
Let ]R1I be a linear space of dimension n (see Section 5.2), A be a linear operator, A ]R" f7 ]RII, and! E ]R11 be a given element of the space ]R". Consider the equation: :
(5.4)
Ax =!
with respect to the unknown x E ]R". It is known that this equation is solvable for an arbitrary! E ]R11 if and only if the only solution of the corresponding homogeneous equation Ax = 0 is trivial: x = 0 E ]R11 . Assume that the space ]R11 consists of the elements: _
z
[Z�I] :
(5.5)
'
ZI1
where {ZI , ' " , ZII } are ordered sets of numbers. When the elements of the space ]Rn are given in the form (5.5), we say that they are specified by means of their components. From linear algebra we know that a unique matrix of type (5.2) can be associated with any linear operator A in this case, so that for a given x E ]RII the ]R11 will be rendered by the formulay = Ax, i.e., mapping A : ]R11 11
f7
[YI] ;11
[
][]
a l l " , a l II X l
�:, ; '. : " �;,�
;"
(5.6) •
Then equation (5.4) coincides with the matrix equation (5.3) or, equivalently, with the canonical form (5. 1 ). In this perspective, equation (5.4) appears to be no more than a mere operator interpretation of system (5. 1 ). However, the elements of the space ]R11 do not necessarily have to be specified as ordered sets of numbers (5.5). Likewise, the operator A does not necessarily have to be originally defined in the form (5.6) either. Thus, the operator form (5.4) of a linear system may, generally speaking, differ from the canonical form (5. 1). In the next section, we provide an example of an applied problem that naturally leads to a highorder system of linear algebraic equations in the operator form. This example will be used throughout the book on multiple occasions. 5 . 1 .3
FiniteDifference D irichlet Problem for the Poisson Equation
Let D be a domain of square shape on the Cartesian plane: D = { (x,y) [O < X < Assume that a Dirichlet problem for the Poisson equation:
I , 0 < Y < I }.
(X,y) E D,
(5.7)
A Theoretical Introduction to Numerical Analysis
1 22
needs to be solved on the domain D. In formula (5.7), aD denotes the boundary of D, and f = f(x,y) is a given righthand side. To solve problem (5.7) numerically, we introduce a positive integer M, specify
h = M 1 , and construct a uniform Cartesian grid on the square D:
(5.8) Instead of the continuous solution u = u(x,y) to the original problem (5.7), we will rather be looking for its trace, or projection, [ul h onto the grid (5.8). To compute the values of this projection approximately, we replace the derivatives in the Poisson equation by the second order difference quotients at every interior node of the grid (see Section 9.2. 1 for more detail), and thus obtain the following system of difference equations:
uml ,ml
_
A(h)U(h)
I
ml ,mZ + Uml( Uml,/Il++l ,m] l 2u2/IUmll ,m,2ml +Um1Um,ml zI ,m] )l m m ' l , 2 1, 2, . m ,/I l =
� h
_
m] , m2
+
h2
=
. .,M
=
f,
(5.9)
1.
The righthand side of each equation (5.9) is defined as f l � f(m l h,m2 h). Clearly, the difference equation (5.9) is only valid at the interior nodes of the grid (5.8), see Figure 5 . 1 (a). For every fixed pair of ml and m2 inside the square D, it connects the values of the solution at the five points that are shown in Figure 5 . 1 (b). These points are said to form the stencil of the finitedifference scheme (5.9).
1j.. : ... : .. : ... .... ... ... ...
y
j,
.
I
I
.
.
I I
I I
, , I
I I I
.
I
. , I I
:
:
I
,
•
..
I I I
I , I
:
I
. I I I
I , I
... �... �. . . �... � .. �.. �... �...
o
. . .. . . . ... ... . .. . . ... . I I
I I
, ,
I I
!
I
I I I
I I
I
I
. I I I
, I
. I
I
(a)
,
•

I
I
,
I
.
.
.
.
I I
I
I I
I
, ,
I I
I I
I
I
,
,
I
, I I
interior,
, , ,
,
0

I I I
I
       
        
    
,r ,
       
         r    
I
. , I I
,,r
I I
. . t... t... t... t .. �. . t... t... ... ... ... . . .. ... �... � .. �... � . �... �... �... ... �... ... � . �... �... � �.. ? . + . + . + .. + .. + . + .. + ... .
    
:
I
.
I I I
I
boundary
x
,,,r ,
   
, ,
(b) Stencil
FIGURE 5. 1 : Finitedifference scheme for the Poisson equation. At the boundary points of the grid (5.8), see Figure 5 . 1 (a), we obviously cannot write the finitedifference equation (5.9). Instead, according to (5.7), we specify the
Systems oj Linear Algebraic Equations: Direct Methods homogeneous Dirichlet boundary conditions: um l ,m2 = 0, for m l = O,M
&
123
=
(5. 10) m2 O,M. Substituting expressions (5. 10) into (5.9), we obtain a system of (M  I? linear alge braic equations with respect to as many unknowns: um " m2 ' m l ,m2 1 , 2, ,M 1 . This system i s a discretization of problem (5.7). Let u s note that in general there are many different ways of constructing a discrete counterpart to problem (5.7). Its particular form (5.9), (5. 10) introduced in this section has certain merits that we discuss later in Chapters 10 and 12. Here we only mention that in order to solve the Dirichlet problem (5.7) with high accuracy, i.e., in order to make the error between the exact solution [ul h and the approximate solution u (h) small, we should employ a sufficiently fine grid, or in other words, take a sufficiently large M. In this case, the linear system (5.9), (5. 10) will be a system of high dimension. We will now demonstrate how to convert system (5.9), (5. 10) to the operator form (5.4). For that purpose, we introduce a linear space U (h) IRn , n � (M  I ?, that will contain all realvalued functions Zm l ,m2 defined on the grid (m l h, m2 h), m l ,m2 1,2, . . ,M  1 . This grid is the interior subset of the grid (5.8), see Figure 5.I(a). Let us also introduce the operator A == _/';,(h ) : U (h) U (iz ) that will map a given grid i (i (h) function !l (iz) E U onto some v( z) E U z ) according to the following formula:  2U ,11l2 + Uln l  l,ln2 v(h) 1n ] ,ln2  vln , ,ln2 _ _ /';, (h ) U (h) 11l1 ,11l2 _ _ Um] + I ,m2 lnl h2 (5. 1 1) 2 1 + U + UIn " ln2 + 1  Ihn2" ln2 Uln l ,ln2  . In doing so, boundary conditions (5. 10) are to be employed when using formula (5.1 1) for m 1 , M I and m2 1, M  1. The system of linear algebraic equations (5.9), (5. 10) will then reduce to the operator form (5.4): _/';,(h) u (h) = fhl , (5. 12) where j (h) )ml :11l2 E U (iz ) p (h ) is a given grid function ' and u (h ) E U (h) is the =
=
.
1
. • •
=
JT
1
=
1 =
=
f

(
)
=
=
unknown grid function.
REMARK 5 . 1 System (5.9), (5.10), as well as any other system of n linear algebraic equations with n unknowns, could be written in the canonical form (5. 1 ) . However, regardless of how it is done, i.e., how the equations and unknowns are numbered, the resulting system matrix (5.2) would still be inferior in its simplicity and readability to the specification of the system in its original form (5.9) , (5. 10) or in the equivalent operator form (5. 12). 0
An abstract operator equation (5.4) can always be reduced REMARK 5.2 to the canonical form (5. 1 ) . To do so, one needs to choose a basis { e I , e2 , . . . , en } in the space IRn , expand the elements x and f of IRn with respect to this basis:
124
A Theoretical Introduction to Numerical Analysis
and identify these elements with the sets of their components so that:
X= [�Il ' f= [�ll · 1"
XII
Then the operator A of (5.4) will be represented by the matrix (5.2) deter mined by the operator itself and by the choice of the basis; this operator will act according to formula (5.6) . Hence the operator equation (5.4) will reduce to the canonical form (5.3) or (5. 1 ) . We should emphasize, however, that if our primary goal is to compute the solution of equation (5.4 ) , then replacing it with the canonical form ( 5 . 1 ) , i.e., introducing the system matrix (5.2) explicitly, may, in fact, be an overly cumbersome and as such, illadvised approach. 0 Exercises
1.
The mapping A : ]R2 ]R2 of the space ]R2 = {x I x = (X I ,X2 )} into itself is defined as follows. Consider a Cauchy problem for the system of two ordinary differential equations: >>
ZI
I 1=0 = X I ,
::2 1 1
=0
= X2 ·
The element y Ax ]R2 is defined as y = (Y I ,Y2 ) , YI = Z I (t ) I 1= 1 , Y2 = Z2 (t ) 1 . a) Prove that the operator A is linear, i.e., A( au {3 w) = aAu {3Aw.2 b) Write down the matrix of the operator A if the basis in the space ]R is chosen as follows: =
E
+
+
1= 1
c) The same as in Part b) of this exercise, but for the basis: d) Evaluatex=Aly, wherey = [�l ]. 5.2
Linear Spaces, Norms, and Operators
As we have seen, a solution of a linear system of order n can be assumed to belong to an ndimensional linear space. The same is true regarding the error between any
Systems of Linear Algebraic Equations: Direct Methods
1 25
approximate solution of such a system and its exact solution, which is another key object of our study. In this section, we will therefore provide some background about linear spaces and related concepts. The account of the material below is necessar ily brief and sketchy, and we refer the reader to the fundamental treatises on linear algebra, such as [Gan59] or [HJ85], for a more comprehensive treatment. We recall that the space lL of the elements x,y, Z, . . . is called linear if for any two elements x, y E lL their sum x + y is uniquely defined in lL, so that the operation of addition satisfies the following properties: 1. x + y = y +x (commutativity); 2. x + (y + z) = (x + y) + Z (associativity);
3.
There is a special element 0 E lL, such that \:Ix E lL : x + 0 = x (existence of zero);
4. For any x E lL there is another element denoted x, such that x + ( x) = 0 (existence of an opposite element, or inverse).
Besides, for any x E lL and any scalar number a the product ax is uniquely de fined in lL so that the operation of multiplication by a scalar satisfies the following properties:
I. a (f3x) = ( af3 )x; 2. I · x = x;
3.
(a + f3 )x = ax + f3x;
4. a (x +y) = ax + ay. Normally, the field of scalars associated with a given linear space lL can be either the field of all real numbers: a, f3 , E JR., or the field of all complex numbers: a, f3, . E C. Accordingly we refer to lL as to a real linear space or a complex linear space, respectively. A set of elements Z I , Z2 , ' " , Z/1 E lL is called linearly independent if the equality alZ I + a2Z2 + . . . + anZ/1 = 0 necessarily implies that a l = a2 = . . . = a/1 = O. Oth erwise, the set is called linearly dependent. The maximum number of elements in a linearly independent set in lL is called the dimension of the space. If the dimension is finite and equal to a positive integer n, then the linear space lL is also referred to as an ndimensional vector space. An example is the space of all finite sequences (vectors) composed of n real numbers called components; this space is denoted by JR./) . Similarly, the space of all complex vectors with n components is denoted by e". Any linearly independent set of n vectors in an ndimensional space forms a basis. To quantify the notion of the error for an approximate solution of a linear system, say in the space JR./1 or in the space e", we need to be able to measure the length of a vector. In linear algebra, one normally introduces the norm of a vector for that purpose. Hence let us recall the definition of a normed linear space, as well as that of the norm of a linear operator acting in this space. . .
. . .
A Theoretical Introduction to Numerical Analysis
126 5.2.1
Normed Spaces
A linear space lL is called normed if a nonnegative number Il x ll is associated with every element x E IL, so that the following conditions (axioms) hold: 1. I I x l l
> 0 if x i 0 and IIx l l = 0 if x = 0;
2. For any x E IL and for any scalar A (real or complex):
II Ax il
=
I A l l l x ll;
3. For any x E IL and y E IL the triangle inequality holds: Il x + y ll :::: Il xl l + Ily l l ·
[:J,
Let u s consider several examples. The spaces JR.1l and ell are composed o f the ele ments x
�
whore xj , j
�
, , 2, . .
. , n, ,re re,' DC complex numbe"" '",p"tiv"y.
It is possible to show that the functions Il x ll � and I l x l l l defined by the equalities:
Il x ll � = max I Xj l
}
(5. 1 3)
and
Il x l l l
=
II
L IXj l
j= 1
(5.14)
satisfy the foregoing three axioms of the norm. They are called the maximum norm and the I I norm, respectively. Alternatively, the maximum norm (5. 1 3) is also re ferred to as the l� norm or the C norm, and the I] norm (5. 14) is sometimes called the first norm or the Manhattan norm. Another commonly used norm is based on the notion of a scalar product or, equiva lently, inner product. A scalar product (x,y) on the linear space IL is a scalar function of a pair of vector arguments x E IL and y E lL. This function is supposed to satisfy the following four axioms: 3
1 . (x,y) = (y,x);
2. For any scalar A : (Ax,y) = A (X,y) ;
3. (x + y,z) = (x, z) + (y, z) ;
4. (x,x) 2 0, and (x,x)
> 0 if x i o.
3 The overbar in axiom 1 and later in the text denotes complex conjugation.
Systems of Linear Algebraic Equations: Direct Methods
1 27
Note that in the foregoing definition of the scalar product the linear space lL is, gen erally speaking, assumed complex, i.e., the scalars A may be complex, and the inner product (x,y) itself may be complex as well. The only exception is the inner product of a vector with itself, which is always real (axiom 4). If, however, we were to as sume ahead of time that the space lL was real, then axioms 2, 3, and 4 would remain unchanged, while axiom 1 will get simplified: (x,y) = (y,x). In other words, the scalar product on a real linear space is commutative. The simplest example of a scalar product on the space jRn is given by
(x,y) = X I Y I +X2Y2 + . . . +xnYn , whereas for the complex space (:n we can choose: (x,y) = Xl.YI +X2Y2 + . . . + xnY,, · Recall that a real linear space equipped with an inner product is called whereas a complex space with an inner product is called unitary. One can show that the function:
I I x l1 2 = (x,x) 'Z I
(5. 15)
(5. 1 6)
Euclidean, (5. 1 7)
provides a norm on both jR" and (:" . In the case of a real space this norm is called
Euclidean, and in the case of a complex space it is called Hermitian. In the literature, the norm defined by formula (5. 17) is also referred to as the 12 norm. In the previous examples, we have employed a standard vector form x = [XI ,X2 , ' " ,xIl V for the elements of the space lL (lL = jRn or lL = (:Il ). In much
the same way, norms can be introduced on linear spaces without having to enu merate consecutively all the components of every vector. Consider, for exam ple, the space U(Jl) = lRll, n = (M  1 ?, of the grid functions u (h) = { Um l ,nl2 } ' m l ,m2 1 , 2, . . . , M  1 , that we have introduced and exploited in Section 5. 1 .3 for the construction of a finitedifference scheme for the Poisson equation. The maxi mum norm and the I I norm can be introduced on this space as follows:
=
lI u (h) lI� = nmax I Um l •m2 1 , l I ,m2 M I Il u (h) 11 1 L l uml.mJ =
1tI I .m2 = 1
(5. 1 8) (5. 1 9)
Moreover, a scalar product (u (h) , v(h) ) can be introduced in the real linear space U (h) according to the formula [cf. formula (5. 1 5)]: (5.20) Then the corresponding Euclidean norm is defined as follows: (5.21)
128
A Theoretical Introduction to Numerical Analysis
In general, there are many different ways of defining a scalar product on a real or complex vector space lL, besides using the simplest expressions (5. 15) and (5. 16). Any appropriate scalar product (x,y) on lL generates the corresponding Euclidean or Hermitian norm according to formula (5. 17). It turns out that one can, in fact, de scribe all scalar products that satisfy axioms 14 on a given space and consequently, all Euclidean (Hermitian) norms. For that purpose, one first needs to fix a particular scalar product (x,y) on lL, and then consider the selfadjoint linear operators with respect to this product. Recall that the operator B* : lL I> lL is called adjoint to a given linear operator B : lL I> lL if the following identity holds for any x E lL and y E lL: (Bx,y) = (x, B *y) .
It is known that given a particular inner product on lL, one and only one adjoint operator B* exists for every B : lL I> lL. If a particular basis is selected in lL that enables the representation of the operators by matrices (see Remark 5 .2) then for a real space and the inner product (5. 1 5), the adjoint operator is given by the transposed matrix BT that has the entries b ij = bji, i, j = 1 , 2, . . . , n, where bi) are the entries of B. For a complex space lL and the inner product (5. 16), the adjoint operator is represented by the conjugate matrix B* with the entries bij = b ji, i, j = 1 , 2 , . . . , n. The operator B is called selfadjoint if B = B*. Again, in the case of a real space with the scalar product (5. 1 5) the matrix of a selfadjoint operator is symmetric: hij = h ji, i, j = 1 , 2, . . . , n; whereas in the case of a complex space with the scalar product (5. 1 6) the matrix of a selfadjoint operator is Hermitian: bij bji, i, j ,
1 , 2, . . . , n.
The operator B :
=
lL I>
=
lL is called positive definite and denoted B > 0 if (Bx, x) > = B* > 0, then the expression
o for every x E lL, x I O. It is known that if B
[X ,y] B == (Bx,y)
(5
.
22 )
satisfies alI four axioms of the scalar product. Moreover, it turns out that any legiti mate scalar product on lL can be obtained using formula (5.22) with an appropriate choice of a selfadjoint positive definite operator B = B* > O. Consequently, any operator of this type (B = B* > 0) generates an Euclidean (Hermitian) norm by the formula: J
I l x l i B = ( [X , X]B) 2 .
(5. 2 3)
In other words, there is a onetoone correspondence between Euclidean norms and symmetric positive definite matrices (SPD) in the real case, and Hermitian norms and Hermitian positive definite matrices in the complex case. In particular, the Euclidean norm (5. 17) generated by the inner product (5. 15) can be represented in the form (5.23) if we choose the identity operator for B: B =
I:
J
1
II xll 2 = II x l ll = ( [X,X]I) 2 = (x, x) z .
Systems of Linear Algebraic Equations: Direct Methods 5. 2 . 2
1 29
Norm of a Linear Op erator
The operator A : lL lL is called linear if for any pair x, y E lL and any two scalars a and f3 it satisfies: A( ax + f3y ) = aAx + f3Ay. We introduce the norm I IA II of a linear operator A : lL lL as follows: �
�
(5.24) max IIAx l1 ' IIA II � XElL, xjO Il x J I where II Ax l1 and I l x ll are norms of the elements Ax and x, respectively, from the linear normed space lL. The norm IIA II given by (5.24) is said to be induced by the vector norm chosen in the space lL. According to formula (5.24), the norm I JA I I is the coefficient of stretching of that particular vector x' E lL which is stretched at least as much as any other vector x E lL under the transformation A.
As before, a square matrix
A=
[
]
I �;,: '. : .' �;1I;
al l . . . a l
will be identified with an operator A : lL
�
I
lL = C". Recall, either space consists of the elements (vectors) x = j = 1 , 2, . . . ,n,
[:.'.]
[�I.] ,
lL that acts in the space lL = JR" or x"
where xj,
are real or complex numbers, respectively. The operator A maps a
given x E lL onto y =
Yn
according to the formula
Yi =
n
L aijxj, j= l
i = 1 , 2, .
..
, /1 ,
(5.25)
where aU are entries of the matrix A. Therefore, the norm of the matrix A can naturally be introduced according to (5.24) as the norm of the linear operator defined by formula (5.25). THEOREM 5.1 The norms of the matrix A = {aU } induced by the vector norms I J x l l � = maxj IXjl of {5. 13} and Il x ll l = Lj IXj l of {5. 14}, respectively, in the space lL = JR" or lL = C" , are given by:
I IA II�
= max L laul 1
. .I
(5.26)
and (5.27)
A Theoretical Introduction to Numerical Analysis
1 30
Accordingly, the norm I IA 11 = of (5.26) is known as the maximum row norm, and the norm II A ll l of (5.27) is known as the maximum column norm of the matrix A. The proof of Theorem 5.1 is fairly easy, and we leave it to the reader as an exercise. THEOREM 5.2 Let (x,y) denote a scalar product in an ndimensional Euclidean space IL, and let Il x ll = (x,x) l /2 be the corresponding Euclidean norm. Let A be a selfadjoint operator on IL with respect to the chosen inner product: Vx,y E lL : (Ax,y) = (x,Ay) . Then for the induced operator norm (formula (5.24)) we have:
IIAx l 1 max = max I A) I , II A II = xEIRn, xiO } IIxII
(5.28)
where Aj , j = 1 , 2, . . . , n are eigenvalues of the operator (matrix) A .
The set of all eigenvalues {Aj} of the matrix (operator) A is known as its spectrum. The quantity maxj IAjl on the righthand side of formula (5.28) is called the spectral radius of the matrix (operator) A and is denoted p (A) . PRO O F From linear algebra it is known that when A = A * , then an orthonormal basis exists in the space IL:
(5.29) that consists of the eigenvectors of the operator A:
Take an arbitrary x E IL and expand it with respect to the basis
Then,
(5.29) :
for the vector Ax E IL we can write:
. ( ).
As the basIS
5.29
IS orthonormal:
of the scalar product (axioms
( ) { I,
2 and 3)
i
j, we use the linearity . 0, 1 i ), and obtain:
ej, ej
=
=
.
Consequently,
IIAx l1 ::; m�x IAj l [xT + x� + . . . + x�l � }
=
max I Aj l ll x l l · }
131
Systems of Linear Algebraic Equations: Direct Methods Therefore, for any x E lL, x i= 0 , the following inequality holds:
while for x = ek , where ek is the eigenvector of A that corresponds to the maximum eigenvalue: maxj lA.j l l A.k l , the previous inequality transforms into a precise equality: IIAek l1 WI = l A.k l = mJx l A.j l . =
As such, formula
o
(5.28) i s indeed true.
Note that if, for example, we were to take lL = ]Rn in Theorem 5.2, then the scalar product (x,y) does not necessarily have to be the simplest product (5.15). It can, in fact, be any product of type (5.22) on lL, with the corresponding Euclidean norm given by formula (5.23). Also note that the argument given when proving Theo rem 5.2 easily extends to a unitary space lL, for example, lL = en . Finally, the result of Theorem 5.2 can be generalized to the case of the socalled normal matrices A that are defined as AA * = A * A. The matrix is normal if and only if there is an orthonormal basis composed of its eigenvectors, see [HJ85, Chapter 2]. Exercises
Show that I lxl l � of (5. 13) and I lxll l of (5. 14) satisfy all axioms of the norm. 2. Consider the space ]R2 that consists of all vectors [��] with two real components. Represent the elements of this space by the points X X ) on the two dimensional Cartesian plane. On this plane, draw the curve specified( I ,by2 the equation I Ixll 1 , i.e., the unit circle, if the norm is defined as: Ilx ll Ilx ll � max { lx I , IX2 1 } , 1.
x=
=
=
=
I
I lx l l = Ilx l l l = Ix t l + IX2 1 ,
3.
IT...
IT...
Ilx l l = IIx l1 2 = (xT + x� ) 1 /2 .
Let A be a linear operator. Prove that for any choice of the vector norm on the corresponding induced operator norm IIA I I satisfies the inequality:
IT... ,
:
f+
I IA I I 2: p (A ) ,
where p (A) = maxj IAj l is the spectral radius of the matrix A. 4. Construct an example of a linear operator A ]R2 ]R2 that is represented by a 2 2 matrix with the eigenvalues AI A2 1 and such that I IA I I � > 1000, I IA I I I > 1000, and I IA I1 2 > 1000. 5. Let L be a vector space, and A and B be two linear operators acting on this space. Prove that I IAB I I :::: I IA I I I IB I I · 6. Prove Theorem 5. 1. =
=
:
f+
x
A Theoretical Introduction to Numerical Analysis
1 32
For the scal ar product (x,y) introduced on the linear space prove the Cauchy Schwarz inequality: \fx, y l (x,y) 1 2 ::; (x, x) (y,y) . Consider two cases: a) The field of scalars for is b)* The field of scalars for is 8. Prove that the Euclidean (Hermitian) norm I lx [ = (x,x) 1 /2 , where (x,y) is an appro priate scalar product on the given linear space, is indeed a norm, i.e., that it satisfies all three axioms of the norm from page 1 26. 9. Prove Theorem 5.2 for the case of a unitary space 10. Let be a Euclidean space, A be an arbitrary linear operator, and B : be an orthogonal linear operator, i.e., such that \fx (Bx Bx) (x, x) or equiva lently, BTB = I. Prove that I AB [ 1 2 = [ BA I1 2 = I[A [ 2 . A: and U and I I . Let A be a linear operator on the Euclidean (unitary) space be orthogonal (unitary) operators on this space, i.e., UU* = I. Show that [[UAW[[ 2 = [ A [b . 12. Let be an Euclidean space and A = A' be a selfadjoint operator: A Prove that [IA22 [1 2 = I [A [ � . Construct an example showing that if A i A *, then it can happen that [[A [b [IA I � . 1 3 .* Let be a Euclidean (unitary) space, and A : be an arbitrary linear operator. a) Prove thatA*A is a selfadjoint nonnegative definite operator on b) Prove that [ IA I[ 2 = [ A*A I = Amax (A*A), where Amax (A*A) is the maximum eigenvalue ofA*A. 1 4. Let lR2 be a twodimensional space of real vectors x = [��] with the inner product (5. 1 5): = X I Y I + X2Y2 , and the corresponding 12 norm (5. 1 7): [ X[ 2 (x, x) I /2 . Evaluate(x,y) the induced norm [jA [ 2 for the following two matrices: 7.
lL,
E lL :
lL
lL
lR;
C.
lL.
:
lL
lL f7 lL
lL ...... lL
,
E lL :
W
lL,
=
lL
=
I, WW*
lL ...... lL,
:
O. Evaluate the I norms matrices from Exercise 1 4 induced by the vector norm (5.23): [jX[jB =I A( [XI B,XoflB )the1 /2 two = (Bx,x) I /2 . 1 6. Let be a Euclidean (unitary) space, and !lxl[ = (x, x) 1 /2 be the corresponding vector norm. Let A : be an arbitrary linear operator on Prove that [IA Ii = [IA 1[I. If, additionally, A is a nonsingular1 operator, i.e., if the inverse operator (matrix) Aexists, then prove (A 1)* = (A* ) . =
lL
lL ...... lL
lL.
*
Systems of Linear Algebraic Equations: Direct Methods
5.3
1 33
Conditioning of Linear Systems
Two linear systems that look quite similar at the first glance, may, in fact, have a very different degree of sensitivity of their solutions to the errors committed when specifying the input data. This phenomenon can be observed already for the systems Ax = f of order two:
a2 1x I + a22x2 = /2 .
(5.30)
With no loss of generality we will assume that the coefficients of system (5.30) are normalized: a�1 + aT2 1 , i = 1 , 2. Geometrically, each individual equation of system (5.30) defines a straight line on the Cartesian plane (X I ,X2). Accordingly, the solution of system (5.30) can be interpreted as the intersection point of these two lines as shown in Figure 5.2. =
o
o (a) Weak sensitivity.
(b) Strong sensitivity
FIGURE 5.2: Sensitivity of the solution of system (5.30) to perturbations of the data.
Let us qualitatively analyze two antipodal situations. The straight lines that cor respond to linear equations (5.30) can intersect "almost normally," as shown in Fig ure 5.2(a), or they can intersect "almost tangentially," as shown in Figure 5.2(b). If we slightly perturb the input data, i.e., the righthand sides /;, i 1 , 2, and/or the coefficients a ij , i, j = 1 , 2, then each line may move parallel to itself and/or tilt. In doing so, it is clear that the intersection point (i.e., the solution) on Figure 5.2(a) will only move slightly, whereas the intersection point on Figure 5.2(b) will move much more visibly. Accordingly, one can say that in the case of Figure 5.2(a) the sensitivity of the solution to perturbations of the input data is weak, whereas in the case of Figure 5.2(b) it is strong. Quantitatively, the sensitivity of the solution to perturbations of the input data can be characterized with the help of the socalled condition number .u (A). We will later =
1 34
A Theoretical Introduction to Numerical Analysis
see that not only does the condition number determine the aforementioned sensitivity, but also that it directly influences the performance of iterative methods for solving Ax = f. Namely, it affects the number of iterations and as such, the number of arithmetic operations, required for finding an approximate solution of Ax = f within a prescribed tolerance, see Chapter 6 . 5.3.1
Condition Number
The condition number of a linear operator A acting on a normed vector space IL is defined as follows: (5.31) p (A) = I I A II · IIA  I I I .
The same quantity p (A) given by formula (5.31) is also referred to as the condition number of a linear system Ax = f. Recall that we have previously identified every matrix:
[::.J
with a linear operator acting on an ndimensional vector space x
�
(fo< ox'""ple, JL
y = Ax, where y =
[:'1.]
c R'
mL
c
IL
of the elements
C'). Pm a given x E JL, the opem!m yields
is computed as follows:
Yll
Yi =
11
h
L aijx j= 1
i = 1 , 2, . . . , no
Accordingly, the definition o f the condition number p (A) by formula (5.31) also makes sense for matrices, so that one can refer to the condition number of a matrix A, as well as to the condition number of a system of linear algebraic equations specified in its canonical form (5.1) rather than only in the operator form. The norms I IA II and IIA I II of the direct and inverse operators, respectively, in formula (5.31) are assumed induced by the vector norm chosen in IL. Consequently, the condition number p (A) also depends on on the choice of the norm in IL. If A is a matrix, and we use the maximum norm for the vectors from IL, I l xll� = maxj IXj l , then we write p = � (A); if we employ the first norm I l x ll l = Lj I Xj l , then p = P I (A). IfIL is a Euciidean (unitary) space with the scalar product (x,y) given, for examp1e, by formula (5. 15) or (5.16), and the corresponding Euciidean (Hermitian) norm is defined by formula (5. 17): IIxll 2 = (x,x) 1 /2 , then p = P2 (A). If an alternative scalar product [x,ylB = (Bx,y) is introduced in IL based on the original product (x,y) and on the operator B = B* > 0, see formula (5.22), and if a new norm is set up accordingly by formula (5.23): IIxllB = ( [x,ylB) I / 2 , then the corresponding condition number is denoted by p = PB (A). 
Systems of Linear Algebraic Equations: Direct Methods
135
Let us now explain the geometric meaning of the condition number Ji (A) . To do so, consider the set S e ll of all vectors with the norm equal to one, i.e., a unit sphere in the space IL. Among these vectors choose a particular two, Xmax and Xmin, that satisfy the following equalities:
Ax and [ [AXmin II IIAxmax II = max xES II l1 It is easy to see that
=
min IIAx ll · xES
_I IIA II = IIAxmax 11 and IIA II = AX II m. II
1
Indeed, according to formula (5.24) we can write:
( )
m
x IIAx l 1 = max A  = �axAx_ = IIAxmax ll· IIA I I = XEmax IL , xfcO XEIL, xfco xES II II XII X I Likewise: IIA
I I I =
[
�
A Ix A 1 max II  i i = max I lx = min II XII iEIL, #0 Il x_ ll iEIL, #0 IIAxl1 XEIL, #0 Il x ll
] I
IIAxmin ll '
Substituting these expressions into the definition of Ji (A) by means of formula (5.3 1), we obtain: maxXES IIAx l 1 Ji (A) = (5.32) minXES [ [Ax il ' This, in particular, implies that we always have:
Ji (A)
2 1.
According to formula (5.32), the condition number Ji (A) is the ratio of magnitudes of the maximally stretched unit vector to the minimally stretched (i.e., maximally shrunk) unit vector. If the operator A is singUlar, i.e., if the inverse operator A  I does not exist, then we formally set Ji (A) = . The geometric meaning of the quantity Ji (A) becomes particularly apparent in the case of an Euclidean space 1L = �2 equipped with the norm II x l12 = (x, x) 1 / 2
J T + x�. In other words, X
00
1L
=
=
is the Euclidean plane of the variables XI and X2 . In this case, S is a conventional unit circle: xT + x� 1 . A linear mapping by means of the operator A obviously transforms this circle into an ellipse. Formula (5.32) then implies that the condition number Ji (A ) is the ratio of the large semiaxis of this ellipse to its small semiaxis. THEOREM 5.3 Let A = A * be an operator on the vector space a given scalar product [X,Y]B . Then,
JiB (A ) = I Amax I I Amin l '
1L
selfadjoint in the sense of
(5.33)
A Theoretical Introduction to Numerical Analysis
1 36
where Amax and Amin are the eigenvalues of A with the largest and smallest moduli, respectively.
PRO O F Let {el , e2 , . . . , ell} be an orthonormal basis in the ndimensional space lL composed of the eigenvectors of A. Orthonormality of the ba'lis is un derstood in the sense of the scalar product [X,Y]B. Let Aj be the corresponding eigenvalues: Aej = Aje j, j 1 , 2, . . . , n , which are all real. Also a'lsume with no loss of generality that the eigenvalues are arranged in the descending order: =
I A I I 2': I A2 1 2': . . . 2': IAn l ·
An arbitrary vector x E lL can be expanded with respect to this basis:
In doing so, and because of the orthonormality of the basis:
IIAx li B = ( IAlx1 12 + I A2x2 12 + . . . + IAIIX,, 12)1/2 . Con'leqllently, I IA xl lB ::::: IAl l ll x llB , and if II x llB = 1 , i.e., if X E S, then II AxllB ::::: I A I I . At the same time, for x = el E S we have II Ae , l i B = I A I I . Therefore, max IIAxllB = I AI I = I Amax l · XES A similar argument immediately yields: min IIAxllB = I A" I = I Amin l · XES Then, formula (5.31) implies (5.33).
o
Note that similarly to Theorem 5.2, the result of Theorem 5.3 can also be general ized to the case of normal matrices AA * = A *A. 5.3.2
Characterization o f a Linear System by Means o f Its Condition Number
THEOREM 5.4 Let lL be a normed vector space (e.g., lL = ]R" or lL = ([:11 ), and let A : lL f+ lL be a nonsingular linear operator. Consider a system of linear algebraic equations: (5.34) Ax = j,
Systems of Linear Algebraic Equations: Direct Methods
137
where x E lL is the vector of unknowns andf E lL is a given righthand side. Let llf E lL be a perturbation of the righthand side that leads to the perturbation llx
E lL of the solution so that:
A(x + llx) =f + llf. Then the relative error of the solution
(5.35)
[[llx[[ / [[x[[ satisfies the inequality: (5.36)
where .u(A) is the condition number of the operator A, see formula {5. 31}. Moreover, there are particular f E lL and llf E lL for which {5.36} transforms into a precise equality.
PRO O F Formulae (5.34) and (5.35) imply that Allx = llf, and conse quently, llx = A  I llf. Let us also employ the expression Ax = f that defines the original system itself. Then,
[[llx[[ [[x li
[[A Illf [[ II x[[
[[Ax[[ [[A Illf [[ [[llf [[ [[x l i [[ N [ [ [[Ax[[
[[Ax[[ [[A  I llf [[ [ I N [[ W [[llf [ [ lVII '
According to the definition of the operator norm, see formula
(5.24), we have:
[[Ax[[ < [[A[[ and [ [ x li Combining formulae
(5.37)
(5.38)
(5.37) and (5.38), we obtain for any f E lL and N E lL:
[[llf [[ [[llf [[ [[llx[[ < = .u (A) [[A [[ [[A  I [[ [[x l i [ If [[ , I lf[[
(5.39)
which means that inequality (5.36) holds. Furthermore, if llf is the specific vector from the space lL for which
[[A  I llf [[ = [[A I [[ , [[llf [[
and f = Ax is the element of lL for which
[[Ax il < [[A [ , [[x li  [
then expression (5.37) coincides with inequalities (5.39) and form into precise equalities for these particular f and llf.
(5.36) that trans
0
138
A Theoretical Introduction to Numerical Analysis
Let us emphasize that the sensitivity of x to the perturbations !1f is, generally speaking, not the same for all vectors f. It may vary quite dramatically for dif ferent righthand sides of the system Ax = f. In fact, Theorem 5.4 describes the worst case scenario. Otherwise, for a given fixedf and x = A  If it may happen that II Ax 1 1/11 x 11 « II A I I , so that expression (5.37) will yield a much weaker sensitivity of the relative error 1 I !1xll / l l x ll to the relative error II!1f ll / llfll than provided by estimate
(5.36).
To accurately evaluate the condition number .u (A), one needs to be able to com pute the norms of the operators A and A  1 . Typically, this is a cumbersome task. If, for example, the operator A is specified by means of its matrix, and we are interested in finding /100 (A ) or .u l (A), then we first need to obtain the inverse matrix AI and subsequently compute IIA II� , I A  I II � or I IA II I , IIA  1 11 1 using formulae (5.26) or (5.27), respectively, see Theorem 5. 1 . Evaluating the condition number .uB(A) with respect to the Euclidean (Hermitian) norm given by an operator B = B* > 0 could be even more difficult. Therefore, one often restricts the consideration to obtain ing a satisfactory upper bound for the condition number .u (A) using the specifics of the operator A available. We will later see examples of such estimates, e.g., in Section 5.7.2. In the meantime, we will identify a special class of matrices called matrices with diagonal dominance, for which an estimate of .u�(A) can be obtained without evalu ating the inverse matrix A  I . A matrix
is said to have a (strict) diagonal dominance (by rows) of magnitude 0 if
l au l 2': I l aij l + 0, i = 1 , 2, . . . , n, fl i
where
(5.40)
0 > 0 is one and the same constant for all rows of A.
THEOREM 5.5 Let A be a matrix with diagonal dominance of magnitude 0 > O. Then the inverse matrix A I exists, and its maximum row norm (5.26), i. e., its norm induced by the l� vector norm Ilx l l � = maxj IXj l , satisfies the estimate:
(5.41) PROOF
Take an '"bit"" y f
�
[�J
and as,ume that th"y,tem of line'"
algebm;c equation, Ax �f has " ,olution x
�
[:J,
,uch that m""j Ixj I
�
lx, I ·
Systems of Linear Algebraic Equations: Direct Methods
1 39
Let us write down equation number k from the system Ax = f:
Taking into account that estimate:
IXk l � Ix) 1 for j = 1 , 2, . . . ,n, we arrive at the following
Consequently, IXk l ::; I fk [ / 8 . On the other hand, I fk l ::; maXi lli l = I lf [ [�· Therefore,
[Xk [ = max) [Xli = [[x[[ � and
1
[[x ll� ::; 8 I lfll � .
(5.42)
In particular, estimate (5.42) means that if f = 0 E lL (e.g., lL = lR.n of lL = en ) , then IIx ll � 0, and consequently, the homogeneous system Ax = 0 only has a trivial solution x = O. As such, the inhomogeneous system Ax = f has a unique solution for every f E lL. In other words, the inverse matrix A I exists. Estimate (5.42 ) also implies that for any f E lL, f I 0, the following estimate holds for x = A  If: =
so that
_ max [ [A If ll � < .!. 1 IIA II �  /EL,ff.O I lfll �  u
� .
o
COROLLARY 5.1 Let A be a matrix with diagonal dominance of magnitude 8 > o . Then,
(5.43) The proof is obtained as an immediate implication of the result of Theorem 5.5. Exercises
1.
III
Prove that the condition numbers Jloo (A) and (A) of the matrix A will not change after a permutation of rows or columns.
A Theoretical Introduction to Numerical Analysis
140
Prove that for aT square matrix A Tand its transpose AT , the following equalities hold: p=(A) ,u (A ), (A) J1=(A ). 3. Show that the condition number of the operator A does not change if the operator is multiplied by an arbitrary nonzero real number. 4. Let be a Euclidean space, and let A : Show that the condition number ,uB 1 if and only if at least one of the following conditions holds: a) A = aI, where a E b) A is an orthogonal operator, Le., Vx E lL : [Ax Ax]B [X,X]B. c) A is a composition of al and an orthogonal operator. 5.* Prove that PB (A) ,uB (AiI), where Ail is the operator adjoint to A in the sense of the scalar product [x,yJB . 6. Let A be a nonsingular matrix, detA =1= Multiply one row of the matrix A by some scalar a, and denote the new matrix by Aa. Show that p(Aa) as a 7.* Prove that for any linear operator A : 2.
=
l
PI
=
lL
lL
=
f>
�;
lL.
,
=
=
o.
4 00
lL f> lL:
4 00 .
where Ail is the operator adjoint to A in the sense of the scalar product [X,Y]B . 8.* Let A A * > 0 and B B* > 0 in the sense of some scalar product introduced on the linear space lL. Let the following inequalities hold for every x =
YI
=
Y2
YI (Bx,x) :S
(Ax,x)
:S
E lL:
Y2 (Bx,x), =

I
where > 0 and > 0 are two real numbers. Consider the operator C B A and prove that the condition number ,uB (C) satisfies the estimate: PB (C) n . will solve this problem in Section 6. 1 .4 as it has numerous applications. :S
Yl
Remark. We
5.4
Gaussian Elimination and Its TriDiagonal Version
We will describe both the standard Gaussian elimination algorithm and the Gaus sian elimination with pivoting, as they apply to solving an n x n system of linear algebraic equations in its canonical form: (5.44) Recall that the Gaussian elimination procedures belong to the class of direct meth ods, i.e., they produce the exact solution of system (5.44) after a finite number of arithmetic operations.
Systems of Linear Algebraic Equations: Direct Methods 5.4 . 1
141
Standard Gaussian Elimination
From the first equation of system (5.44), express XI through the other variables: (5.45) Substitute expression (5.45) for X I into the remaining n 1 equations of system (5.44) and obtain a new system of n  1 linear algebraic equations with respect to n 1 unknowns: X2 , X3 , · . . ,XII' From the first equation of this updated system express X2 as a function of all other variables: 

(5.46) substitute into the remaining n 2 equations and obtain yet another system of further reduced dimension: n 2 equations with n  2 unknowns. Repeat this procedure, called elimination, for k = 3 , 4, . . . , n  1 : 

(5.47) and every time obtain a new smaller linear system of dimension (n k) x (n  k). At the last stage k = n, no substitution is needed, and we simply solve the 1 x 1 system, Le., a single scalar linear equation from the previous stage k = n I , which yields: 

x,, =
h; ·
(5.48)
Having obtained the value of X" by formula (5.48), we can find the values of the unknowns XIl  1 , XIl  2 , . . . , X I one after another by employing formulae (5.47) in the ascending order: k = n 1 , n  2, . . . , 1 . This procedure, however, is not failproof. It may break down because of a division by zero. Even if no division by zero occurs, the algorithm may still end up generating a large error in the solution due to an amplification of the small roundoff errors. Let us explain these phenomena. If the coefficient a l l in front of the unknown XI in the first equation of system (5.44) is equal to zero, a l l = 0, then already the first step of the elimination algo rithm, i.e., expression (5.45), becomes invalid, because a� 2 =  a I 2 /a l l . A division by zero can obviously be encountered at any stage of the algorithm. For exam ple, having the coefficient in front of X2 equal to zero in the first equation of the (n 1 ) x (n 1 ) system invalidates expression (5.46). But even if no division by zero is ever encountered, and formulae (5.47) are obtained for all k = 1 , 2, . . . , 11 , then the method can still develop a computational instability when solving for Xfl , Xn I , . . . , X I . For example, if it happens that a�.k+ I = 2 for k = n  I , Il  2, . . . , I , while all other a;j are equal to zero, then the small roundoff error committed when evaluating XII by formula (5.48) increases by a factor of two when computing XII I , then by an other factor of two when computing X,, 2 , and eventually by a factor of 211  1 when computing X". Already for n = 1 1 the error grows more than a thousand times. Another potential danger that one needs to keep in mind is that the quantities f� can rapidly increase when k increases. If this happens, then even small relative errors committed when computing f� in expression (5.47) can lead to a large absolute error in the value of Xk. 


142
A Theoretical Introduction to Numerical Analysis
Let us now formulate a sufficient condition that would guarantee the computa tional stability of the Gaussian elimination algorithm. THEOREM 5. 6 Let the matrix A of system {5.44} be a matrix with diagonal dominance of magnitude 0 > 0, see formula (5.40). Then, no division by zero will be encoun tered in the standard Gaussian elimination algorithm. Moreover, the following inequalities will hold:
n k
L l a� ,k+ j l < 1 , k = 1 , 2, . . . , n  l ,
j= l
If£ l :::;
�mJx lh l,
k=
1 , 2, . . . , n.
(5.49) (5.50)
To prove Theorem 5.6, we will first need the following Lemma. LEMMA 5.1 Let
bm l Y I + bm2Y2 + . . . + bmmYm = gm ,
(5.5 1 )
be a system of linear algebraic equations with diagonal dominance of magni tude 0 > 0: (5.52) I blll � L l b/j l + o , 1 = 1 , 2, . . . , m.
#/
Then, when reducing the first equation of system {5. 51} to the form
(5.53) there will be no division by zem. Besides, the inequality
m L I bU < 1
j=2
(5.54)
will hold. Finally, if m > 1 , then the variable Y I can be eliminated from system (5. 51) with the help of expression (5. 53). In doing so, the resulting (m  l ) x (m  1) system with respect to the unknowns Y2, Y3 , . . . , Ym will also be a system with diagonal dominance of the same magnitude 0 as in formula {5.52}. PRO OF
According to formula (5.52) , for 1 = 1 we have:
Systems of Linear Algebraic Equations: Direct Methods Consequently,
b l l =I= 0,
and expression
blj b l j = bl l I
Moreover,
(5.53)
143
makes sense, where
for j = 2, 3, . . . , m , and
gl g� = . bl l
+b ... Ib�jl = Ibl 2 1 I l 3Ibl + + Iblm l , f lll j=2
hence inequality (5.54) is satisfied. It remains to prove the last assertion of the Lemma for m > 1 . Substituting expression (5.53) into the equation number j (j > 1 ) of system (5.51), we obtain:
(bj2 + bjl b� 2 )Y2 + (bj3 + bjl b� 3 )Y3 + . . . (bjm + bjl b� m )Ym = gj , j = 2, 3, . . . ,m. In this system of m  1 equations, equation number I (l = 1 , 2, . . . , m  1 ) is the
equation:
1 = 1 , 2, . . . , m 
1.
(5.5 5)
(5.55 ) are: (bl+I ,2 + bt+I,l b'1 2 ) , (bl+ I ,3 + bl+ I,l b'I 3 ) , . . . , (bl+I ,1n + bl+ I,l b'l m ) , and the corresponding diagonal entry is: (bl+ I ,I+ 1 + bl + l ,lb� . I + I ) '
Consequently, the entries in row number I of the matrix of system
Let us show that there is a diagonal dominance of magnitude 8 , i.e., that the following estimate holds:
m Ibl+ I,I+ 1 + bl+ l ,1 b'l ,l+ I I 2> L Ibl+ l,j + bl+ l,I b� ) + 8 . j=2,1 #1 +
(5. 56)
We will prove an even stronger inequality:
m I bl + I ,l + I I  lbl+ I, l b� .I + 1 1 2> 'L [l bl+ d + I bl+ l ,l b'l j l ] + 8 , 2 j�I+1
which, in turn, is equivalent to the inequality:
m (5.57) Ibl+ I ,l + I i 2> L I bl+ l,j l + I bl+ l, I 1 L I b'l j l + 8 . j#1=2, j=2 +1 m Let us replace the quantity L I b�) in formula (5.57) by the number 1: j=2 m m (5.58) I bl+ I ,l+ d 2> 'L I bl+ l ,j l + I bl+ 1,1 1 + 8 = 'L I bl+ I ,} 1 + 8 . 1 2 j�I+'1 j�I+1 In
144
A Theoretical Introduction to Numerical Analysis
According to estimate (5.54) , if inequality (5.58) holds, then inequality (5.57) will automatically hold. However, inequality (5.58) is true because of estimate 0 (5.52). Thus, we have proven inequality (5.56). PRO O F OF THEOREM 5.6
We will first establish the validity of formula
(5.47) and prove inequality (5.49). To do so, we will use induction with respect to k. If k 1 and > 1 , formula (5.47) and inequality (5.49) are equivalent to formula (5.53) and inequality (5.54), respectively, proven in Lemma 5.1. In addition, Lemma 5.1 implies that the ( n  l ) ( n  l ) system with respect to X2 , X3 , . . . , Xll obtained from (5.44) by eliminating the variable Xl using (5.45)
=
11
x
will also be a system with diagonal dominance of magnitude o. Assume now that formula (5.47) and inequality (5.49) have already been proven for all k = 1 , 2, . . . ,I, where 1 < n. Also assume that the ( n  I) x ( n  I) system with respect to X/+ l , X'+ 2 , , X n obtained from (5.44) by consecutively eliminating the variables Xl , X2 , . . . , X, using (5.47) is a system with diagonal dominance of magnitude o. Then this system can be considered in the capac ity of system (5.51), in which case Lemma 5.1 immediately implies that the assumption of the induction is true for k = 1 + 1 as well. This completes the proof by induction. Next, let us justify inequality (5.50) . According to Theorem 5.5, the fol lowing estimate holds for the solution of system (5.44) : . • .
Employing this inequality along with formula (5.47) and inequality we have just proven, we obtain for any k = 1 , 2, . . . , 11 :
This estimate obviously coincides with
(5.50).
(5.49) that
o
Let us emphasize that the hypothesis of Theorem 5.6 (diagonal dominance) pro vides a sufficient but not a necessary condition for the applicability of the standard Gaussian elimination procedure. There are other linear systems (5.44) that lend themselves to the solution by this method. If, for a given system (5.44), the Gaussian elimination is successfully implemented on a computer (no divisions by zero and no instabilities), then the accuracy of the resulting exact solution will only be limited by the machine precision, i.e., by the roundoff errors. Having computed the approximate solution of system (5.44), one can substitute it into the lefthand side and thus obtain the residual I'::.f. of the righthand side. Then,
Systems of Linear Algebraic Equations: Direct Methods estimate
145
IILU I l < ,u (A) II N II I lf ll ' Il x ll 
can be used to judge the error of the solution, provided that the condition number ,u (A) or its upper bound is known. This is one of the ideas behind the socalled a posteriori analysis. Computational complexity of the standard Gaussian elimination algorithm is cubic with respect to the dimension n of the system. More precisely, it requires roughly � n3 + 2n2 = &(n3 ) arithmetic operations to obtain the solution. In doing so, the cubic component of the cost comes from the elimination per se, whereas the quadratic component is the cost of solving back for Xn ,Xn  I , . . . ,X l . 5. 4 . 2
TriDiagonal Elimination
The Gaussian elimination algorithm of Section 5.4. 1 is particularly efficient for a system (5.44) of the following special kind:
b i x i + C I X2 a2x l + b2X2 + C2X3 a3x2 + b3X3 + C3X4
(5.59)
an  lxll  2 + bn  I Xn  1 + Cn  I X" = fn  I , anXn l + bnxll = fn · The matrix of system (5.59) is tridiagonal, i.e., all its entries are equal to zero except those on the main diagonal, on the superdiagonal, and on the subdiagonal. In other words, if A = {oij }, then 0ij = 0 for j > i + I and j < i  I . Adopting the notation of formula (5.59), we say that O;i = bi, i = 1 , 2, . . . , n; Oi,i l = ai, i = 2 , 3 , . . . , n; and Oi,i+ 1 = Ci , i = 1 , 2, . . . , 11  1 . The conditions o f diagonal dominance (5.40) for system (5.59) read:
I b I ! � Ic I ! + D, I bk l � l ak l + i ck / + D, I bll l � l anl + D .
k = 2 , 3 , . . . ,n  l ,
(5.60)
Equalities (5.45)(5.48) transform into:
XI1 = �"
(5.61)
where Ak and Fk are some coefficients. Define An = 0 and rewrite formulae (5. 6 1 ) as a unified expression: (5.62)
1 46
A Theoretical Introduction to Numerical Analysis
From the first equation of system (5.59) it is clear that for k = 1 the coefficients in formula (5.62) are:
�>
{:
(5.63) Al = FI = . Suppose that all the coefficients Ak and Fk have already been computed up to some fixed k, 1 :S k :S n  1 . Substituting the expression Xk = AkXk+ I + Fk into the equation number k + 1 of system (5.59) we obtain: f  a I Fk . Xk+ l =  bk C+k+aI A Xk+2 + bkk+ Il + akk+ JA + k + +1 k+ l k Therefore, the coefficients Ak and Fk satisfy the following recurrence relations: k = 1 , 2, . . . , n  1 .
(5.64)
As such, the algorithm of solving system (5.59) gets split into two stages. At the first stage, we evaluate the coefficients Ak and Fk for k = 1 , 2, . . . , n using formu lae (5.63) and (5.64). At the second stage, we solve back for the actual unknowns Xn,Xn  J , . . . ,X I using formulae (5.62) for k = n, n  1 , . . . , 1 . In the literature, one can find several alternative names for the tridiagonal Gaus sian elimination procedure that we have described. Sometimes, the term marching is used. The first stage of the algorithm is also referred to as the forward stage or forward marching, when the marching coefficients Ak and Bk are computed. Accord ingly, the second stage of the algorithm, when relations (5.62) are applied consecu tively in the reverse order is called backward marching. We will now estimate the computational complexity of the tridiagonal elimina tion. At the forward stage, the elimination according to formulae (5.63) and (5.64) requires tJ(n) arithmetic operations. At the backward stage, formula (5.62) is applied n times, which also requires tJ(n) operations. Altogether, the complexity of the tri diagonal elimination is tJ(n) arithmetic operations. It is clear that no algorithm can be built that would be asymptotically cheaper than tJ(n), because the number of unknowns in the system is also tJ(n). Let us additionally note that the tridiagonal elimination is apparently the only example available in the literature of a direct method with linear complexity, i.e., of a method that produces the exact solution of a linear system at a cost of tJ(n) operations. In other words, the computational cost is directly proportional to the dimension of the system. We will later see examples of direct methods that produce the exact solution at a cost of tJ (n ln n) operations, and examples of iterative methods that cost tJ(n) operations but only produce an approximate solution. However, no other method of computing the exact solution with a genuinely linear complexity is known. The algorithm can also be generalized to the case of the banded matrices. Matrices of this type may contain nonzero entries on several neighboring diagonals, including the main diagonal. Normally we would assume that the number m of the nonzero
Systems of Linear Algebraic Equations: Direct Methods
147
diagonals, i.e., the bandwidth, satisfies 3 ::::: m « n. The complexity of the Gaussian elimination algorithm when applied to a banded system (5.44) is (j(m2n) operations. If m is fixed and n is arbitrary, then the complexity, again, scales as (j(n). Highorder systems of type (5.59) drew the attention of researchers in the fifties. They appeared when solving the heat equation numerically with the help of the so called implicit finitedifference schemes. These schemes, their construction and their importance, will be discussed later in Part III of the book, see, in particular, Sec tion 10.6. The foregoing tridiagonal marching algorithm was apparently introduced for the first time by I. M. Gelfand and O. V. Lokutsievskii around the late forties or early fifties. They conducted a complete analysis of the algorithm, showed that it was computationally stable, and also built its continuous "closure," see Appendix II to the book [GR64] written by Gelfand and Lokutsievskii. Alternatively, the tridiagonal elimination algorithm is attributed to L. H. Thomas [Th049] , and is referred to as the Thomas algorithm. The aforementioned work by Gelfand and Lokutsievskii was one ofthe first papers in the literature where the question of stability of a computational algorithm was accurately formulated and solved for a particular class of problems. This question has since become one of the key issues for the entire large field of knowledge called scientific computing. Having stability is crucial, as otherwise computer codes that implement the algorithms will not execute properly. In Part III of the book, we study computational stability for the finitedifference schemes. Theorem 5.6, which provides sufficient conditions for applicability of the Gaus sian elimination, is a generalization to the case of full matrices of the result by Gelfand and Lokutsievskii on stability of the tridiagonal marching. Note also that the conditions of strict diagonal dominance (5.60) that are sufficient for stability of the tridiagonal elimination can actually be relaxed. In fact, one can only require that the coefficients of system (5.59) satisfy the inequalities:
IbJ I 2: h i , (5.65) Ibk l 2: l ak l + I Ck l , k = 2 , 3 , . . . , n  1 , I bnl 2: l an l , so that at least one out of the total of n inequalities (5.65) actually be strict, i.e., " > " rather than " 2: ." We refer the reader to [SN89a, Chapter II] for detail. Overall, the idea of transporting, or marching, the condition bJxJ +C JX2 = fl spec
ified by the first equation of the tridiagonal system (5.59) is quite general. In the previous algorithm, this idea is put to use when obtaining the marching coefficients (5.63), (5.64) and relations (5.62). It is also exploited in many other elimination algorithms, see [SN89a, Chapter II]. We briefly describe one of those algorithms, known as the cyclic tridiagonal elimination, in Section 5.4.3.
148 5.4.3
A Thearetical lntraductian ta Numerical Analysis Cyclic TriDiagonal Elimination
In many applications, one needs to solve a system which is "almost" tridiagonal, but is not quite equivalent to system (5.59):
b ix i + C I X2 a 2x I + b2X2 + a3 X2 +
+
all  l xll  2 + bn  [ XIl  I + Cn I Xn = ill  I + anxllI + bllxn = in .
(5.66) ,
In Section 2.3.2, we have analyzed one particular example of this type that arises when constructing the nonlocal Schoenberg splines; see the matrix A given by for mula (2.66) on page 52. Other typical examples include the socalled central dif ference schemes (see Section 9.2. 1) for the solution of second order ordinary differ ential equations with periodic boundary conditions, as well as many schemes built for solving partial differential equations in the cylindrical or spherical coordinates. Periodicity of the boundary conditions gave rise to the name cyclic attached to the version of the tridiagonal elimination that we are about to describe. The coefficients a l and CII in the first and last equations of system (5.66), respec tively, are, generally speaking, nonzero. Their presence does not allow one to apply the tridiagonal elimination algorithm of Section 5.4.2 to system (5.66) directly. Let us therefore consider two auxiliary linear systems of dimension (n I ) x (n I ) : 
b2 U2 + C2l13 a3 u2 + h)u3 + C3 l14
= =
h, 1"
anI u,,  2 + b,,_ l lI,,_ [ + C" I U,, in  I , all lln I + b" lIn = 1" . =
and
b2V2 + C2 V3 a3 V2 + b3 V3 + C3 V4

(5.67)
(5.68) a,,  I vn  2 + bn  [ V,, _ I + Cn  I V" = 0, an Vn  1 + b" vll = CI · Having obtained the solutions {lI2 ' U 3 , ' " , lin } and {V2 ' V3 , . . " vn } to systems (5.67) and (5.68), respectively, we can represent the solution {X l ,X2 , . . . ,X,,} to system (5.66) in the form: (5.69) where for convenience we additionally define ll[ = 0 and V I = l . Indeed, mUltiplying each equation of system (5.68) by Xl , adding with the corresponding equation of system (5.67), and using representation (5.69), we immediately see that the equations number 2 through n of system (5.66) are satisfied. It only remains to satisfy equation number I of system (5.66). To do so, we use formula (5.69) for i = 2 and i = n, and
Systems 01 Linear Algebraic Equations: Direct Methods
1 49
substitute X2 = U2 + Xl V2 and Xn = Un + X I Vn into the first equation of system (5.66), which yields one scalar equation for X I :
As such, we find:
XI
=
ll  a l un  cI U2 . b l + a l vn + c I V2
(5.70)
Altogether, the solution algorithm for system (5.66) reduces to first solving the two auxiliary systems (5.67) and (5.68), then finding X I with the help of formula (5.70), and finally obtaining X2 ,X3 , . . . ,Xn according to formula (5.69). As both sys tems (5.67) and (5.68) are genuinely tridiagonal, they can be solved by the original tridiagonal elimination described in Section 5.4.2. In doing so, the overall compu tational complexity of solving system (5.66) obviously remains linear with respect to the dimension of the system n, i.e., tJ(n) arithmetic operations.
5.4.4
(O)x
Matrix Interpretation of the Gaussian Elimination. L V Factorization
Consider system (5.44) written as Ax = l or alternatively, A
= 1(0) ,
where
and the superscript "(0)" emphasizes that this is the beginning, i.e., the zeroth stage of the Gaussian elimination procedure. The notations in this section are somewhat different from those of Section 5.4. 1 , but we will later show the correspondence. At the first stage of the algorithm we assume that a \�) # 0 and introduce the trans formation matrix:
TI
1 0 0 ... 0  t2 1 1 0 . . . 0 t 31 0 1 . . . 0 =
tni 0 0 . . .
i = 2,3,
O ( 1 ] � )I
where
1
. . . , fl .
Applying this matrix is equivalent to eliminating the variable X l from equations num ber 2 through n of system (5.44):
o o
(0) (0) a(0) l l a l 2I ) . . . a lnI x I a22( a2( n) X2 [ .. ' . .: .: . ' .: I I xn ( ( ) ) an2 . . . ann • • .
=
II 1 12( ) :.
n( 1 )
=1
(1 ) = T
J
(0) .
A Theoretical Introduction to Numerical Analysis
150
At the second stage, we define:
1 0 0 ... 0 0 ... 0 0 t 1 0 3 2 T2 = o
o
. . .
i = 3 , 4, . . . , n,
where
tn2 0 . . .
and eliminate X2 from equations 3 through n, thus obtaining the system A ( 2) X = f 2) where 0 0 0 a (0) I I a (12I ) a (I I3) . . . a (InI ) o a ( ) a ( 3) . . . a ( n) 2 22 2 o 0 a33(2) . . . a3n(2) o
o
,
AOI ))
and
A f 2) = T2/ I ) = Ji2)
an3(2) . . . ann(2 )
n( 2)
In general, at stage number k we have the transformation matrix: 1
...
0 ... 0
0
Tk = 0o . .. . tk I ,k 01 .. .. .. 00 ' + .
o
.
i = k + 1 , . . . , n, (5.71 )
where
. . tn ,k 0 . . . 1 and the system A (k) X = I(k) , where: .
a (l0l ) . . . a (0) I,k a (0) I ,k+ I o o o
.
a (0) I,n
(k.  I ) ak(k.k+II) . . . ak(kn I ) . . au 0 ak(k+) I ,k+ I · · · ak(k)+ I,n o
(k) + 1 an,k
,
fk) = Tkf(k I ) =
(k) an,n
Performing all n  1 stages of the elimination, we arrive at the system:
where
k I JfkI( k) ) Jf(k+ I
151 Systems of Linear Algebraic Equations: Direct Methods The resulting matrix A ( n  I ) is upper triangular and we redenote it by V. All entries of this matrix below its main diagonal are equal to zero:
a (0) l ,k a (0) l l . . . a (0) l ,k+ I A (n  I )
==
V=
0 0
a (0) ln ,
(k  I ) ak(k,k+I 1) . . . ak(kn I ) . . . ak,k 0 ak(k+) I,k+ 1 . . . a k(k+) l,n ,
0
0
0
(5.72)
. . . an(n1n I )
I
The entries on the main diagonal, ai� ) , k = 1 , 2, . . . , n, are called pivots, and none of them may turn into zero as otherwise the standard Gaussian elimination procedure will fail. Solving the system Vx = /( n  I ) == g is a fairly straightforward task, as we have seen in Section 5.4. 1 . It amounts to computing the values of all the unknowns n I one after another in the reverse order, i.e., starting from Xn = A  ) /a�7; 1 ) , then ( n  2 (n  2 ( n 2 Xn I = (fn  I )  an _ 1 n)Xn )/an _ I n) _ l , etc., and all the way up to x l . We will now sho� that the �epresentation V = Tn I Tn 2 . . . TI A implies that A = LV, where L is a lower triangular matrix, i.e., a matrix that only has zero entries above its main diagonal. In other words, we will need to demonstrate that L � (Tn  I Tn  2 . . . T I )  I T I l T:; I . . . T;� I is a lower triangular matrix. Then, the formula A = LV (5.73) 
=
will be referred to as an LV factorization of the matrix A. One can easily verify by direct multiplication that the inverse matrix for a given Tb k = 1 , 2, . . . , n  1 , see formula (5.7 1), is:
Tk 1
_

1 ...
0 0 ... 0
0 ...
0 ... 0
o . .
tn ,k 0 . . . 1
O . . . tk+ I ,k 1 . . . 0 .
(5.74)
It is also clear that if the meaning of the operation rendered by Tk is "take equation number k, multiply it consecutively by tk+ 1 ,k , . . . , tn ,k > and subtract from equations number k + 1 through n, respectively," then the meaning of the inverse operation has to be "take equation number k, multiply it consecutively by the factors tk+ 1 ,k . . . . , tn ,b and add to equations number k + 1 through n, respectively," which is precisely what the foregoing matrix T; I does. We see that all the matrices T;' , k = 1 , 2, . . . , n  1 , given by formula (5.74) are lower triangular matrices with the diagonal entries equal to one. Their product can
A Theoretical Introduction to Numerical Analysis
152
be calculated directly, which yields:
o
t2, 1 L
= T ' T2 ' 1
• • •
Tn' I
o 0 ... 0 o 0 ... 0
0 ... 0 = tk, 1 tk,2 tH I ,1 tH I ,2 . . . tk+ l,k 1 . . . 0 tn, 1 tn,2 . . . tn, k
(5.75)
0 ... 1
Consequently, the matrix L is indeed a lower triangular matrix (with all its diagonal entries equal to one), and the factorization formula (5.73) holds. The LV factorization of the matrix A allows us to analyze the computational com plexity of the Gaussian elimination algorithm as it applies to solving multiple lin ear systems that have the same matrix A but different righthand sides. The cost of obtaining the factorization itself, i.e., that of computing the matrix V, is cubic: e(n3 ) arithmetic operations. This factorization obviously stays the same when the righthand side changes. For a given righthand sidef, we need to solve the system LVx = f. This amounts to first solving the system Lg f with a lower triangular matrix L of (5.75) and obtaining g L  If Tn  I Tn 2 . . . T lf, and then solving the system Vx g with an upper triangular matrix V of (5.72) and obtaining x. The cost of either solution is e(n2 ) operations. Consequently, once the LV factorization has been built, each additional righthand side can be accommodated at a quadratic cost. In particular, consider the problem of finding the inverse matrix A  I using Gaus sian elimination. By definition, AA  I In other words, each column of the matrix A  I is the solution to the system Ax f with the righthand side f equal to the corresponding column of the identity matrix Altogether, there are n columns, each adding an e(n2 ) solution cost to the e(n3 ) initial cost of the LV factorization that is performed only once ahead of time. We therefore conclude that the overall cost of computing A  I using Gaussian elimination is also cubic: e(n3 ) operations. Finally, let us note that for a given matrix A, its LV factorization is, generally speaking, not unique. The procedure that we have described yields a particular form of the LV factorization (5.73) defined by an additional constraint that all diagonal entries of the matrix L of (5.75) be equal to one. Instead, we could have required, for example, that the diagonal entries of V be equal to one. Then, the matrices Tk of (5.7 1 ) get replaced by:
=
= I.
.
1 . .
Tk =
=
I.
0 0 ... 0
0 . . . tk,k 0 . . . 0 o . . .  tk+ I ,k 1 . . . 0 ' o ...
=
=
=
 tn ,k
0 .. 1
.
where
1 tk,k = (k=i) ' akk
(5.76)
Systems oj Linear Algebraic Equations: Direct Methods
1 53
and instead of (5.72) we have: 11
u=
0 ... 0 ...
 a; ,  a�.k+ a� o 1 ! . ..  ak+ ! ,n I
o
0 ... 0
. 11
(5.77)
,
1
where the offdiagonal entries of the matrix U are the same as introduced in the beginning of Section 5.4. 1 . The matrices and L are obtained accordingly; this is the subject of Exercise 1 after this section.
T; !
5.4 .5
Cholesky Factorization
If A is a symmetric positive definite (SPD) matrix (with real entries), A = A T > 0, then it admits a special form of LV factorization, namely: A = LL T ,
(5.78)
where L is a lower triangular matrix with positive entries on its main diagonal. Fac torization (5.78) is known as the Cholesky factorization. The proof of formula (5.78), which uses induction with respect to the dimension of the matrix, can be found, e.g., in [GL8 1 , Chapter 2]. Moreover, one can derive explicit formulae for the entries of the matrix L:
I IJk) ! ' j = 1 , 2, . . . ,n , ljj = (aJJ  k=! lij = (aij  kI=! likljk) . li/, i = j + l , j + 2, . . ,n. .
Computational complexity o f obtaining the Cholesky factorization (5.78) is also cubic with respect to the dimension of the matrix: 0'(/13 ) arithmetic operations. How ever, the actual cost is roughly one half compared to that of the standard Gaussian elimination: n 3 /3 vs. 2n 3 /3 arithmetic operations. This reduction of the cost is the key benefit of using the Cholesky factorization, as opposed to the standard Gaussian elimination, for large symmetric matrices. The reduction of the cost is only enabled by the special SPD structure of the matrix A ; there is no Cholesky factorization for general matrices, and one has to resort back to the standard LV. This is one of many examples in numerical analysis when a more efficient computational procedure can be obtained for a more narrow class of objects that define the problem.
154 5 .4.6
A Theoretical Introduction to Numerical Analysis Gaussian Elimination with Pivoting
5.4.1
The standard Gaussian procedure of Section fails if a pivotal entry in the matrix appears equal to zero. Consider, for example, the following linear system: 1,
Xl + 2X2 + 3X3 2Xl +4X2 + 5X3 2 7Xl + 8X2 +9X3 =3. =
=
,
After the first stage of elimination we obtain:
X l + 2X2 + 3X3 = 1 , · Xl + 0 · X2 X3 =0, 0· XI  6x2  12x3 = 4. o
The pivotal entry in the second equation of this system, i.e., the coefficient in front of
X2 , is equal to zero. Therefore, this equation cannot be used for eliminating X2 from the third equation. As such, the algorithm cannot proceed any further. The problem, however, can be easily fixed by changing the order of the equations:
Xl + 2X2 + 3X3 = 1 ,  6X2  12x3 =  4, X3 =0 , and we see that the system is already reduced to the upper triangular form. Different strategies of pivoting are designed to help overcome or alleviate the dif
ficulties that the standard Gaussian elimination may encounter  division by zero and the loss of accuracy due to an instability. Pivoting exploits pretty much the same idea as outlined in the previous simple example  changing the order of equations and/or the order of unknowns in the equations. In the case detA =I the elimination procedure with pivoting guarantees that there will be no division by zero, and also improves the computational stability compared to the standard Gaussian elimination. We will describe partial pivoting and complete pivoting. When performing partial pivoting for a nonsingular system, we start with finding the entry with the maximum absolute value4 in the first column of the matrix A. Assume that this is the entry because otherwise detA = O. Then, Clearly, we change the order of the equations in the system  the first equation becomes equation number and equation number becomes the first equation. The system after this operation obviously remains equivalent to the original one. Finally, one step of the standard Gaussian elimination is performed, see Section In doing so, the entries of the matrix see formula will all have absolute values less than one. This improves stability, because multiplication of the subtrahend by a quantity smaller than one before subtraction helps reduce the
0,
i,
ai l .
t2 1 , t3 1 , ... , tnl
ail =I0, i
Tl ,
4 If several entries have the same largest absolute value, we take one of those.
5.4.1. (5.71)
Systems oj Linear Algebraic Equations: Direct Methods
155
effect of the roundoff errors. Having eliminated the variable X I , we apply the same approach to the resulting smaller system of order ( I ) x ( 1 ) . Namely, among all the equations we first find the equation that has the maximum absolute value of the coefficient in front of X2. We then change the order of the equations so that this coefficient moves to the position of the pivot, and only after that eliminate X2. Complete pivoting is similar to partial pivoting, except that at every stage of elim ination, the entry with the maximum absolute value is sought for not only in the first column, but across the entire currentstage matrix:
n

n

Once the maximum entry has been determined, it needs to be moved to the position This is achieved by the appropriate permutations of both the of the pivot i� equations and the unknowns. It is clear that the system after the permutations remains equivalent to the original one. Computational complexity of Gaussian elimination with pivoting remains cubic with respect to the dimension of the system: ( ) arithmetic operations. Of course, this estimate is only asymptotic, and the actual constants for the algorithm with par tial pivoting are larger than those for the standard Gaussian elimination. This is the payoff for its improved robustness. The algorithm with complete pivoting is even more expensive than the algorithm with partial pivoting; but it is typically more robust as well. Note that for the same reason of improved robustness, the use of Gaussian elimination with pivoting can be recommended in practice, even for those systems for which the standard algorithm does not fail. Gaussian elimination with pivoting can be put into the matrix framework in much the same way as it has been done in Section 5.4.4 for standard Gaussian elimination. Moreover, a very similar result on the LV factorization can be obtained; this is the subject of Exercise 3. To do so, one needs to exploit the socalled permutation matri ces, i.e., the matrices that change the order of rows or columns in a given matrix when applied as multipliers, see [HJ85, Section For example, to swap the second and the third equations in the example analyzed in the beginning of this section, one would need to multiply the system matrix from the left by the permutation matrix:
a I ) .
6' n3
0.9].
p=
5.4. 7
[01 00 01] . o
10
An Algorithm with a Guaranteed Error Estimate
As has already been discussed, all numbers in a computer are represented as frac tions with a finite fixed number of significant digits. Consequently, many actual numbers have to be truncated, or in other words, rounded, before they can be stored in the computer memory, see Section 1 .3.3. As such, when performing computations
A Theoretical Introduction to Numerical Analysis
156
on a real machine with a fixed number of significant digits, along with the effect of inaccuracies in the input data, roundoff errors are introduced at every arithmetic operation. For linear systems, the impact of these roundoff errors depends on the number of significant digits, on the condition number of the matrix, as well as on the particular algorithm chosen for the solution. In work [GAKK93], a family of algorithms has been proposed that directly take into account the effect of rounding on a given computer. These algorithms either produce the result with a guaranteed accuracy, or otherwise determine in the course of computations that the system is conditioned so poorly that no accuracy can be guaranteed when computing its solu tion on the machine with a given number of significant digits. Exercises
Write explicitly the matrices 'II: I and L that correspond to the LU decomposition of A with 'Ik and fj defined by formulae (5.76) and (5.77), respectively. 2. Compute the solution of the 2 2 system: 1.
x
1O3x + y = 5, x  y = 6,
using standard Gaussian elimination and Gaussian elimination with pivoting. Conduct all computations with two significant digits (decimal). Compare and explain the results. 3.* Show that when performing Gaussian elimination with partial pivoting, the LU factor ization is obtained in the form: PA = LU, where P is the composition (i.e., product) of all permutation matrices used at every stage of the algorithm. 4. Consider a boundary value problem for the second order ordinary differential equation: dZ u dx2  u = f(x) , X E [O, l], u (O ) = 0 , u ( l ) = 0,
where u = u(x) is the unknown function and f = f(x) is a given righthand side. To solve this problem numerically, we first partition the interval x. h 1h into N equal subintervals and thus build a uniform grid of N + 1 nodes: xj = j , = 1 /N, j = 0 1 2, ... ,N. Then, instead of looking for the continuous function u u (x) we will be looking for its approximate table of values {uo, U I , U2 , . . . , UN } at the grid nodes xo, XI , Xz, · · · ,XN, respectively. At every interior node X , j = 1 2 ... ,N 1 , we approximately replace the second derivative by the differencejquotient (for more detail, see Section 9.2): ° S
,
S
=
,
,
d2 U dxz
I
X=Xj
,
::::::

Uj+ I  2uj + Itj I hZ
•
'
and arrive at the following finitedifference counterpart of the original problem (a centraldifference scheme): uj+1  2uj + uj_1  Uj = /j, . 1 2 ... ,N I, hZ J=
Uo = 0, UN = 0,
,
,
1 57 Systems ofLinear Algebraic Equations: Direct Methods where /j f(xj) is the discrete righthand side assumed given, and j is the unknown discrete solution. a) Write down the previous scheme as a system of linear algebraic equations (N  1 equations with N  1 unknowns) both in the canonical fonn and in an operator form with the operator specified by an (N  1 ) (N  1 ) matrix. b) Show that the system is tridiagonal and satisfies the sufficient conditions of Sec tion 5.4.2, so that the tridiagonal elimination can be applied. that the exact solution is known: u(x) sin(rrx)e 0 in formula (5.79), i.e., (Ax,x) > 0 for any x E ]RIl, X I O. Given the function F(x) of (5.79), let us formulate
the problem of its minimization, i.e., the problem of finding a particular element z E ]R1l that delivers the minimum value to F(x):
min F(x) . F(z) = xEIR"
(5.80 )
It turns out that the minimization problem (5.80 ) and the problem of solving the system of linear algebraic equations:
Ax =f,
where
A = A* > 0,
(5.8 1 )
are equivalent. We formulate this result i n the form o f a theorem. THEOREM 5.7 Let A = A * > O. There is one and only one element Z E ]R1l that delivers a minimum to the quadratic function F(x) of (5. 79). This vector Z is the solution to the linear system (5. 81). PRO O F As the operator A is positive definite, it is nonsingular. Conse quently, system (5.81) has a unique solution z E ]R1l. Let us now show that for any � E ]RIl, � I 0, we have F(z + � ) > F(z) :
F(z + � ) = (A(z + � ), z + � )  2 (J, z + � ) + c = = [(Az , z)  2 (J , z ) + c] + 2(Az , � )  2 (J , � ) + (A � , � ) = F(z) + 2(Az f, � ) + (A � , � ) = F(z) + (A� , � ) > F(z) . Consequently, the solution z of system (5.81) indeed delivers a minimum to the function F(x), because any nonzero deviation � implies a larger value of the function. 0 Equivalence of problems (5.80) and (5. 8 1 ) established by Theorem 5.7 allows one to reduce the solution of either problem to the solution of the other problem. Note that linear systems of type (5.8 1 ) that have a selfadjoint positive definite matrix represent an important class of systems. Indeed, a general linear system
AX =f,
where A : ]R1l r+ ]R1l is an arbitrary nonsingular operator, can be easily reduced to a new system of type (5. 8 1 ): Cx = g, where C = C· > O. To do so, one merely needs to set: C = A *A and g = A of. This approach, however, may only have a theoretical significance because in practice an extra matrix multiplication performed on a machine with finite precision is prone to generating large additional errors.
Systems of Linear Algebraic Equations: Direct Methods
159
However, linear systems with selfadjoint matrices can also arise naturally. For example, many boundary value problems for elliptic partial differential equations can be interpreted as the LagrangeEuler problems for minimizing certain quadratic functionals. It is therefore natural to expect that a "correct," i.e., appropriate, dis cretization of the corresponding variational problem will lead to a problem of min imizing a quadratic function on a finitedimensional space. The latter problem will, in turn, be equivalent to the linear system (5.8 1 ) according to Theorem 5.7. Exercises
Consider a quadratic function of two scalar arguments XI and X2 : F(XI ,Xl) = XI + 2xIXl + 4x�  2xI + 3X2 + 5. Recast this function in the form (5.79) as a function of the vector argument x [;�] that belongs to the Euclidean space ]R2 supplied with the inner product (x,y) = X Y + X2Yl. Verify that A A* 0, and solve the corresponding problem (5.S 1 ). I I 2.* Recast the function F(XI ,X2 ) from the previous exercise in the form: F(x) = [Ax,xlB  2[f,xlB + c, where [x,ylB = (Bx,y) , B = [� �], and A = A * in the sense of the inner product [x,ylB. 1.
=
=
5.6
>
The Met hod o f Conj ugate Gradients
Consider a system of linear algebraic equations: Ax = j,
A = A * > 0, j E rn:.",
x E rn:.n ,
(5.82)
where rn:.n is an ndimensional real vector space with the scalar product (x,y), and A : rn:.n f> rn:.n is a selfadjoint positive definite operator with respect to this product. To introduce the method of conjugate gradients, we will use the equivalence of system (5.82) and the problem of minimizing the quadratic function F (x) = (Ax,x)  2(f,x) that was established in Section 5.5, see Theorem 5.7. 5.6 . 1
Construction o f t he Method
In the core of the method is the sequence of vectors x(p) E rn:.n, p = 0, 1 , 2, . . . , which is to be built so that it would converge to the vector that minimizes the value of F(x). Along with this sequence, the method requires constructing another se quence of vectors d(p) E rn:.k , p = 0, 1 , 2, . . . , that would provide the socalled descent directions. In other words, we define x( p + I ) = x(p) + cxp d(p), and choose . the scalar ) a + ap in order to minimize the composite function F (x(p d(p»).
160
A Theoretical Introduction to Numerical Analysis
=
Note that the original function F(x) (Ax,x)  2(j,x) + c is a function of the vector argument x E JR.n . Given two fixed vectors x(p) E JR.n and d(p) E JR.11, we can consider the function F(x(p) + ad(p)) , which is a function of the scalar argument a. Let us find the value of a that delivers a minimum to this function:
�:
=
(Ad(p) , (x(p) + ad(p))) + (A (x(p) + ad(p) ), d(p))  2(j,d(p))
(Ad(p) ,x(p)) + (Ax(p) , d(p)) + 2a(Ad(p) , d(p) )  2 (j, d(p)) = 2 (Ax(p) , d(p)) + 2a(Ad(p), d(p))  2 (j, d(p) ) = Consequently, (r(p) , d(p)) (Ax(p) j,d(p)) ap  argmc:. nF(x(p) + ad(p) ) Pl (Ad( , d(p))  (Ad(p) , d(p) ) ' =
0.
_
_ _
_ _
(5.83)
where the vector quantity r(p) ct,g Ax(p) j is called the residual. Consider x(p+ l ) = x(p) + ap d(p), where ap is given by and introduce F(x(p+ l ) + ad(p) ). Clearly, argmin u F(x(p+ l ) + ad(p)) = i.e., lu F(x(p+ l ) + ad(p)) ! u=o = Consequently, according to formula we will have (r(p+J ) , d(p) ) i.e., the residual r(p+ l ) = Ax(p+l ) j of the next vector X(p+ l ) will be orthogonal to the previous descent direction d(p). Altogether, the sequence of vectors x(p), p = 1 , 2, . . . , will be built precisely so that to maintain orthogonality of the residuals to all the previous descent directions, rather than only to the immediately preceding direction. Let d be given and assume that (r(p) , d) where r(p) = Ax(p) f. Let also x(p+ l ) x(p) + g, and require that (r(p+ l ) , d) = Since r(p+ l ) = Ax(p+ l ) j = Ag + r(p), we obtain (Ag, d) = In other words, we need to require that the increment g = x(p+ l )  x(p) be orthogonal to d in the sense of the inner product [g,djA = (Ag, d). The vectors that satisfy the equality [g,djA ° are often called Aorthogonal orA conjugate. We therefore see that to maintain the orthogonality of r(p) to the previous descent directions, the latter must be mutually Aconjugate. The descent directions are directions for the onedimensional minimization of the function F, see formula As minimization of multivariable functions is traditionally performed along their local gradients, the descent directions that we are going to use are called conju gate gradients. This is where the method draws its name from, even though none of these directions except d(O) will actually be a gradient of F. Define d(O) = r(O ) = Ax(O) j, where x(O) E JR.n is arbitrary, and set d(p+ l ) r(p+ l )  {3pd(p) , p = 1 , 2, . . where the real scalars {3p are to be chosen so that to obtain the Aconjugate descent directions: [d(j) , d(p+ I ) ]A . . , p. For p, this immediately yields the following value of {3p : (Ad(p) , r(p+I)) p = 1 , 2, . . . . {3p = (Ad(p) , d(p)) ' =
(5.83), 0, (5.83),
0. 0,
0,
= 0, 0.
=

0.
=
(5.83).
0,
=
j
=
= 0, j = O, I , 0,
.
.
,
(5.84) (5.85) (5.86)
Systems of Linear Algebraic Equations: Direct Methods
161
Next, we will use induction with respect to p to show that [d(j) , d(p+ l)]A = 0 for j 0, 1 , . . . , p  1 . Indeed, if p = 0, then we only need to guarantee [d(O) , dO )]A = 0, and we simply choose f30 = (Ad(O) , r(I ) ) / (Ad(O) , d(O) ) as prescribed by formula (S.86). Next, assume that the desired property has already been established for all integers 0, 1 , . . . , p, and we want to justify it for p + 1 . In other words, suppose that [dU) , d(P)]A = 0 for j = 0, 1 , . . . , P  1 and for all integer p between 0 and some prescribed positive value. Then we can write: =
[dUl , d(p+l)]A = [dUl , (r(P+ l)  {3pd(P))]A = [dU) , r(p+l)]A ,
(S.87)
because [dU), d(P)]A 0 for j < p . In its own turn, the sequence x(p), p = 0, 1 , 2, . . . , is built so that the residuals be orthogonal to all previous descent directions, i.e., (dU) , r(P+ I )) = 0, j = 0, 1 , 2, . . . , p. We therefore conclude that the residual r(p+l) is orthogonal to the entire linear span of the k + 1 descent vectors: span{ d(O) , d( I ) , . . . , d(p) }. According to the assumption of the induction, all these vectors are Aconjugate and as such, linearly independent. Moreover, we have: =
r(O)  d(O) , r( l) = d(l) + f30d(O) ,
(S.88)
r(p) = d(P) + {3p _ld(P l), and consequently, the residuals r(O) , r( l ) , . . . , r(p) are also linearly independent, and span{ d(O) , d(l ) , . . . , d(p) } = span {r(O) , r(I), . . . , r(p) }. In other words, (S.89) r(p+ l) � span {r(O) , r( I ) , . . . , r(p) } . Finally, for every j = 0 , 1 , . . . , p  1 we can write starting from (S.87): [dUl, r(p+ l)]A = (AdU) , r(p+ I ) )
=
�. (A (xU+l )  x(j)), r(p+ l ) ) J
= J.a.. ((r(j+ l)  r(j) ),r(p+I) ) = 0 j
(S.90)
because of the relation (S.89). Consequently, formulae (S.87), (S.90) imply (S.8S). This completes the induction based argument and shows that the descent directions chosen according to (S.84), (S.86) are indeed mutually Aorthogonal. Altogether, the method of conjugate gradients for solving system (S.82) can now be summarized as follows. We start with an arbitrary x(O) E ]R1l and simultaneously build two sequences of vectors in the space ]R1l : {x(p)} and {d(p) }, p = 0, 1 , 2, . . .. The first descent direction is chosen as d(O) = r(O) = Ax(O)  f. Then, for every p = 0, l , 2, . . . we compute:
(r(p) , d(p)) ap   ',,,:,, (Ad(p) , d(p)) ' (Ad(p) , r(p+I)) {3p = , (Ad(p) , d(p))
(S.9 l a) (S.9 l b)
1 62
A Theoretical Introduction to Numerical Analysis
while in between the stages (5.9 I a) and (5. 9 1 b) we also evaluate the residual:
r(p+l) = AX(p+ I ) f.
Note that in practice the descent vectors d(p) can be eliminated, and the fonnulae of the method can be rewritten using only x(p) and the residuals r(p); this is the subject of Exercise 2. THEOREM 5.8 There is a positive integer number Po :::; n such that the vector x(po) obtained by formulae (5. 91) of the conjugate gradients method yields the exact solution of the linear system (5. 82).
PRO O F By construction of the method, all the descent directions d(p) are Aconjugate and as such, are linearly independent. Clearly, any lin early independent system in the ndimensional space �n has no more than n vectors. Consequently, there may be at most n different descent directions: {d(O) , d(1), . . . , d(III ) } . According to formulae (5.88), the residuals r(p) are also linearly independent, and we have established that each subsequent residual r(p+l) is orthogonal to the linear span of all the previous ones, see formula (5.89) . Therefore, there may be two scenarios. Either at some intermediate step Po < n it so happens that r(po) = 0, i.e., Ax(po) = j, or at the step number n we have:
r(II) .l span {r(O) , r( J ) , . . . , r(II  I ) } .
Hence r(n ) is orthogonal to all available linearly independent directions in the space �n , and as such, r(ll) = O. Consequently, ko = n and AX(Il) = j. 0 In simple words, the method of conjugate gradients minimizes the quadratic function F (x) = (Ax,x) 2(J, x) along a sequence of linearly independent (A orthogonal) directions in the space �n . When the pool of such directions is exhausted (altogether there may be no more than n), the method yields a global minimum, which coincides with the exact solution of system (5.82) according to Theorem 5.7. According to Theorem 5.8, the method of conjugate gradients is a direct method. Additionally we should note that the sequence of vectors x(O), x(l), ... , x(p), ... pro vides increasingly more accurate approximations to the exact solution when the num ber p increases. For a given small £ > 0, the error \I x  x(p) II of the approximation x(p) may become smaller than £ already for p « n, see Section 6.2.2. That is why the method of conjugate gradients is mostly used in the capacity of an iterative method these days,S especially when the condition number .u(A) is moderate and the dimen sion n of the system is large. When the dimension of the system n is large, computing the exact solution of sys tem (5.82) with the help of conjugate gradients may require a large number of steps 
5 As
opposed to direct methods that yield the exact solution after a finite number of operations, iterative methods provide a sequence of increasingly accurate approximations to the exact solution, see Chapter 6.
Systems of Linear Algebraic Equations: Direct Methods
1 63
(S.9 1a), (S.9 1b). If the situation is also "exacerbated" by a large condition number J.l (A), the computation may encounter an obstacle: numerical stability can be lost when obtaining the successive approximations x( p). However, when the condition number J.l (A) is not excessively large, the method of conjugate gradients has an im portant advantage compared to the elimination methods of Section S.4. We comment on this advantage in Section S.6.2. 5.6.2
Flexibility in Specifying the Operator
A
From formulae (S.91 a), (S.9 1b) it is clear that to implement the method of conju gate gradients we only need to be able to perform two fundamental operations: for a given y E ]Rn compute Z = Ay, and for a given pair of vectors y, Z E ]Rn compute their scalar product (y,z). Consequently, the system (S.82) does not necessarily have to be specified in its canonical form. It is not even necessary to store the entire matrix A that has n2 entries, whereas the vectors y and Z = Ay in their coordinate form are represented by only n real numbers each. In fact, the method is very well suited for the case when the operator A is defined as a procedure of obtaining Z = Ay once y is given, for example, by finitedifference formulae of type (S. l l ). 5.6.3
Computational Complexity
The overall computational complexity of obtaining the exact solution of system (S.82) using the conjugate gradients method is & (n3 ) arithmetic operations. Indeed, the computation of each step by formulae (S. 9 1 a), (S.9 1 b) clearly requires & (n 2 ) op erations, while altogether no more than n steps is needed according to Theorem S.8. We see that asymptotically the computational cost of the method of conjugate gradi ents is the same as that of the Gaussian elimination. Note also that the two key elements of the method identified in Section S.6.2, i.e., the matrixvector multiplication and the computation of a scalar product, can be efficiently implemented on modem computer platforms that involve vector and/or multiprocessor architectures. This is true regardless of whether the operator A is specified explicitly by its matrix or otherwise, say, by a formula of type (S. l I ) that involves grid functions. Altogether, a vector or multiprocessor implementation of the algorithm may considerably reduce its total execution time. Historically, a version of what is known as the conjugate gradients method today was introduced by C. Lanczos in the early fifties. In the literature, the method of conjugate gradients is sometimes referred to as the Lanczos algorithm. Exercises
1.
Show that the residuals of the conjugate gradients method satisfy the following relation that contains three consecutive terms:
164 2.*
A Theoretical Introduction to Numerical Analysis

Show that the method of conjugate gradients (5.9 1 ) can be recast as follows: Xli ) = (I '! I A)x(O) + Tlf, x( p+ l ) Yp+l (I  Tp+ l A)X(P) + ( 1  Yp+ 1 )x(p  I ) + Yp+ 1 '!p+ lf, =
where �
,p+
YI
5.7
=
1,
, , .,
(r(p) ,r( p)) r(p) = ( p) Ax f, p = O , l 2 . . I  (Ar( p) ,r(p)) ' 1 Tp+ l (r(p) , r(p) ) 1 Yp+ 1 = 1  :r; , p = 1 , 2, . . . . (r( p I ) , r( p I ) ) Yp _
(
)
Finite Fourier Series
Along with the universal methods, special techniques have been developed in the literature that are "fine tuned" for solving only particular, narrow classes of linear systems. Trading off the universality brings along a considerably better efficiency than that of the general methods. A typical example is given by the direct methods based on representing the solution to a linear system in the form of a finite Fourier series. Let IL be a Euclidean or unitary vector space of dimension n, and let us consider a linear system in the operator form: Ax =/,
xE
IL,
/ E lL,
A:
IL
1+
IL,
(5.92)
where the operator A is assumed selfadjoint, A = A * . Every selfadjoint linear op erator has an orthonormal basis composed of its eigenvectors. Let us denote by {el , e2 , . . . , en } the eigenvectors of the operator A that form an orthonormal basis in IL, and by {}q , A2, " " AI! } the corresponding eigenvalues that are all real, so that
(5.93) As we are assuming that A is nonsingular, we have Aj =I 0, j = 1 , 2, . . . , n. For the solution methodology that we are about to build, not only do we need the existence of the basis {e] , e2 , . . . , en } , but we need to explicitly know these eigenvectors and the corresponding eigenvalues of (5.93). This clearly narrows the scope of admis sible operators A, yet it facilitates the construction of a very efficient computational algorithm. To solve system (5.92), we expand both its unknown solution x and its given right hand side / with respect to the basis {e 1 ' e2, ' " , en } , i.e., represent the vectors x and / in the form of their finite Fourier series: / = il e l + fz e2 + · · · + fn ell ' x =Xjej + X2e2 + . . . + xnell ,
where h = (j, ej), where Xj = (x, ej) .
(5.94)
Systems oj Linear Algebraic Equations: Direct Methods
165
Clearly, the coefficients Xj, although formally defined as Xj (x, ej), j = 1 , 2, . . . , n, are not known yet. To determine these coefficients, we substitute expressions (5.94) into the system (5.92), and using formula (5.93) obtain: =
11
n
j= 1
j= 1
L AjXjej = L h ej .
(5.95)
Next, we take the scalar product of both the lefthand side and the righthand side of equality (5.95) with every vector e j, j = 1,2, . . . , n. As the eigenvectors are ori = j, , we can find the coefficients xj of the Fourier series thonormal: ( ei ' e j) =
{ 1,
0, i =I j, 4) for the solution x: (5.9
X ; = /j/Aj,
j = 1 , 2 , . . . , n.
(5.96)
We will illustrate the foregoing abstract scheme using the specific example of a finitedifference Dirichlet problem for the Poisson equation from Section 5. 1 .3:
(5.97) In this case, one can explicitly write down the eigenvectors and eigenvalues of the operator _/'o.(h) . 5.7.1
Fourier Series for Grid Functions
Consider a set of all realvalued functions v =
{vm } defined on the grid Xm mh, Vo VM = 0. The set
m = 0, 1 , . . . , M, Mh = I , and equal to zero at the endpoints:
=
=
of these functions, supplemented by the conventional componentwise operations of addition and multiplication by real numbers, forms a real vector space jRM I . The dimension of this space is indeed equal to M  1 because the system of functions:

=
{ o,
m =I k,
k = 1 , 2, . . . ,M  l , IJIm( k) I , m  ,k _
clearly provides a basis. This is easy to see, since any function v = (vo, V I , . . . , vm ) with Vo ° and VM = ° can be uniquely represented as a linear combination of the functions \ji( I), \ji(2 ), ... , \ji( M 1 ) :
v = vl \ji( l ) + V2 \ji(2) + . . . + vM_ I \ji(M I ) .
Let u s introduce a scalar product in the space we are considering:
M ( v, w) = h L Vm WIIl • m=O
(5.98)
1
We will show that the following system of M  functions:
lJI( k)
=
{
v'2 sin
;} ,
k
k = 1 ,2, . . . ,M  l ,
(5.99)
A Theoretical Introduction to Numerical Analysis
166
forms an orthonormal basis in our M  Idimensional space, i.e.,
{
( lI'(k) , II'(r) ) = 0 ' k # r"
k, r = 1 , 2, . . . , M  l .
1 , k = r,
(5. 100)
For the proof, we first notice that
';; = "21 Ml L. (eil7rm/M + e U7rm/M) m=O 1 1  eil7r + 1 1  e  il7r  "2 M  M
Ml l L. cos m=O
1  eiI7r/
"2 1  e ilrr/
{o
l is even, 0 < l < 2M. 1 , l is odd, '
Then, for k # r we have:
M . knm . rnm M I . knm . rnm (II'(k) , lI'(r) ) = 2h L. = 2h L. M M M M m=O m=O M I r)nm M I (k + = h cos (k M  h cos Mr)nm = 0, m o m o S lll  S lll
and for k = r:
�
Slll  Slll 

�
;
M I MI ( lI'(k) , II'(r) ) = h L. cosO  h L. cos 2k m = hM  h · O = l . m=O m=O Therefore, the system of grid functions (5.99) is indeed orthonormal and as such, linearly independent. As the number of functions M  1 is the same as the dimension of the space, the functions lI'(k) , k = 1 , 2, . . . , M 1 , do provide a basis. Any grid function v = ( vo , V I , . . . , VM ) with Vo = ° and VM = ° can be expanded with respect to the orthonormal basis (5.99): 
or alternatively, v=
M I
h L. ck sin knm , M k=1 
(5. 101)
where the coefficients Ck are given by:
M knm Ck = ( v, lI'(k) ) = hh L. Vm sin  . M m=O
(5.102)
It is clear that as the basis (5.99) is orthonormal, we have:
( V, V) = C2I + C22 + · · · + C2M I ·
(5. 103)
Systems of Linear Algebraic Equations: Direct Methods
1 67
The sum (5. 101) is precisely what is called the expansion of the grid function v = { Vm } into a finite (or discrete) Fourier series. Equality (5. 103) is a counterpart of the Parseval equality in the conventional theory of Fourier series. In much the same way, finite Fourier series can be introduced for the functions defined on twodimensional grids. Consider a uniform rectangular grid on the square
{ (x,y)I0 :S: x :S:
1 , O :S: y :S: l } :
xm l = m, h, m, = O, I , . . . , M, Ym2 = m2 h, m2 = O, I , . . . ,M,
(5. 104)
1/ M and M is a positive integer. A set of all realvalued functions v = { Vm l ' vm2 } on this grid that turn into zero at the boundary of the square, i.e., for m, =
where h =
O, M and m2 = O,M, form a vector space of dimension (M  I ? once supplemented by the conventional componentwise operations of addition and multiplication by (real) scalars. An inner product can be introduced in this space: ( v, w )
= h2
M
L
m l ,m2=O
vmI ,m2 wm l ,m2 ·
}
Then, it is easy to see that the system of grid functions: If/(k'i)
=
{2 · knmJ . lnm2 SIll "M" SIll
� , k, l = 1 , 2, . . . ,M  I ,
{
(5. 105)
forms an orthonormal basis in the space that we are considering: ( If/(k,l) , If/(r,s) ) =
0 ' k I r or l l s, 1 , k = r and I = s.
This follows from formula (5. 100), because: ( If/(k 'l) , If/(rs' ) )
)
(5. 106)
. knmJ . Inm2 (2 SIll . rnm, . Snm2 )  SIll M M M M ml =Om2 =O lnm2 SIll knm, Slll. snm2 rnm, 2 £... � SIll � SIll . . . = 2 £... M M M M ml=O m2 =O = (If/(k) , If/(r) ) (If/( i) , If/(s) ) . � � £... = £...
(
(2
)(
SIll  SIll 
)
Any function v { Vm l ,m2 } that is equal to zero at the boundary of the grid square expanded into a twodimensional finite Fourier series [cf. formula (5. 10 1 )]:
can be
=
vml ,m2 = 2
M I
. knm, . lnm2 CkI SIll  SIll  , L M M k,l= '
(5. 107)
where Ckl = ( v, If/(k,J) ) . In doing so, the Parseval equality holds [cf. formula (5. 103)]:
( v, v ) =
M J
L C�l ·
k,I=1
A Theoretical Introduction to Numerical Analysis
168 5.7.2
Representation of Solution as a Finite Fourier Series
By direct calculation one can easily make sure6 that the following equalities hold: _!',,( h) lII (k .I) "1'
=
�2 h
(
sin2
)
krr 2 � lII(k,l ) 2M + sin 2M "1"
(S . 108)
where the operator _ !',,(iI) : V(iI) f7 V(ll) is defined by formulae (S. l l), (S.lO), and the space V(h) contains all discrete functions specified on the interior subset of the grid (S .104), i.e., for m I , m2 1 , 2, . . . , M  1. Note that even though the functions ljI(k ,l) of (S.lOS) are formally defined on the entire grid (S. 104) rather than on its interior subset only, we can still assume that ljI(k ,1) E V(ll) with no loss of generality, because the boundary values of ljI(k ,l) are fixed: 1jI��':2'2 0 for mj, m2 = 0, M. Equalities (S. 108) imply that the grid functions ljI(k ,1) that form an orthonormal basis in the space V(h), see formula (S. 106), are eigenfunctions of the operator _ !',, (lI) : V(ll) f7 V (II ) , whereas the numbers: =
=
)
(
4 . 2 krr . 2 lrr (S. 109) h2 Slll 2M + Slll 2M ' k, l 1 , 2, . . . ,M 1 , are the corresponding eigenvalues. As all the eigenvalues (S. 109) are real (and Akl
=
=
positive), the operator _!',,(il) is selfadjoint, i.e., symmetric (and positive defi nite). Indeed, for an arbitrary pair v, w E V(iI) we first use the expansion (S. 107): v = L��\ Ckl ljl(k,l ) , w = L�� \ drs ljI(k ,l ) , and then substitute these expressions into formula (S. 108), which yields [with the help of formula (S.106)]: (_!',,(ll) V, w)
=
( (Ml ) Ml ) _!',, (il)
2: Ckl ljl(k,l ) , 2: drs ljl(r,s) . I k .l= r..s= l
According to the general scheme of the Fourier method given by formulae (S.94),
(S.9S), and (S.96), the solution of problem (S.97) can be written in the form: fkl . krrml . lrrm2 Unl J ,1Il2 = �l L. ,2S111 M S111 M , k, l=l /l.,kl 6See also the analysis on page 269.
(S. l lO)
Systems of Linear Algebraic Equations: Direct Methods
169
where
(5. 1 1 1) REMARK 5.3 In the case M = 2P, where p is a positive integer, there is a special algorithm that allows one to compute the Fourier coefficients Al according to formula (5. 1 1 1 ) , as well as the solution unl I ,m2 according to formula (5.1 10), at a cost of only tJ (M2 1n M) arithmetic operations, as opposed to the straightforward estimate tJ(At+). This remarkable algorithm is called the fast Fourier transform ( FFT) . We describe it in Section 5.7.3. 0
Let us additionally note that according to formula (5. 109): /\,1 1 = /\,1 1
,
,
(h) =
8 SIn. 2 n = 8 In. 2 2 nh ' S 2 2 h 2M h
(5. 1 12)
Consequently,
(5. 1 1 3)
At the same time,
Therefore, for the Euclidean condition number of the operator with the help of Theorem 5.3:
J.1 ( _ !l(h ) )
=
Amax Am in
=
AM I ,M I Al l
=
tJ ( h  2 ) ,
as
h
(
_ !l
(h ) we can write
O.
(5. 1 15)
�
Moreover, the spectral radius of the inverse operator  !l(h)
(5. 1 14)
)  I , which provides
its Euclidean norm, and which is equal to its maximum eigenvalue A lI I , does not exceed n 2 according to estimate (5. 1 1 3). Consequently, the following inequality (a stability estimate) always holds for the solution u(h) of problem (5.97):
5.7.3
Fast Fourier Transform
We will introduce the idea of a fast algorithm in the onedimensional context, i.e., we will show how the Fourier coefficients (5. 102):
J2
Ck _  (V, ljI(k) )  2h _
� vlIZ SIn . knm M 7::
m o
can be evaluated considerably more rapidly than at a cost of tJ(M2 ) arithmetic opera tions. Rapid evaluation of the Fourier coefficients on a twodimensional grid, as well
1 70
A Theoretical Introduction to Numerical Analysis
as rapid summation of the twodimensional discrete Fourier series (see Section 5.7.2) can then be performed similarly. We begin with recasting expression (5. 102) in the complex form. To do so, we artificially extend the grid function Vm anti symmetrically with respect to the point m = M: VM+j = VM j , j = 1 , 2, . . . , M, so that for m = M + 1 , M + 2, . . . , 2M it becomes: Vm = V2M m . Then we can write:
(5. 1 1 6)
because given the anti symmetric extension of Vm beyond m = M, we clearly have I;,"!o Vm cos k�m = 0. Hereafter, we will analyze formula (5. 1 1 6) instead of (5. 102). First of all, it is clear that a straightforward implementation of the summation formula (5. 1 1 6) requires tJ(N) arithmetic operations per one coefficient Ck even if all the exponential factors ei2knn/N are precomputed and available. Consequently, the computation of all Cb k = 0, 1 , 2, . . . , N  1 , requires tJ(N2 ) arithmetic operations. Let, however, N = N\ . N2 , where N, and N2 are positive integers. Then we can represent the numbers k and n as follows:
k = k\N2 + k2 , k, = O, I , . . . ,N\  I , k2 = 0, 1 , . . . ,N2  1 , n = n l + n2Nj , n l = O, I , . . . ,Nl  l , n2 = 0, 1 , . . . ,N2  1 . Consequently, by noticing that ei2 n (kIN2 n2NI ) /N pression (5. 1 1 6):
= ei2 n (k l n2 ) = 1 , we obtain from ex
Therefore, the sum (5. 1 1 6) can be computed in two stages. First, we evaluate the intermediate quantity: (5. 1 17a)
Systems of Linear Algebraic Equations: Direct Methods
171
and then the final result: (S. I 17b) The key advantage of computing Ck consecutively by formulae (S. I 1 7a)(S. 1 1 7b) rather than directly by formula (S. 1 16), is that the cost of implementing formula (S. I 17a) is tJ(N2 ) arithmetic operations, and the cost of subsequently implementing formula (S. I 1 7b) is (j(Nr ) arithmetic operations, which altogether yields tJ(Nr + N2 ) operations. At the same time, the cost of computing a given Ck by formula (S. 1 16) is tJ(N) = (j(Nr . N2 ) operations. For example, if N is a large even number and Nr = 2, then one basically obtains a speedup by roughly a factor of two. The cost of computing all coefficients Cb k = 0, 1 , 2, . . . ,N  1 , by formula (S. 1 16) is (j(N2 ) operations, whereas with the help of formulae (S. 1 17a)(S. I 1 7b) it can be reduced to tJ (N(Nr + N2 )) operations. Assume now that Nr is a prime number, whereas N2 is a composite number. Then N2 can be represented as a product of two factors and a similar transformation can be applied to the sum (S. I 1 7a), for which n r is simply a parameter. This will further reduce the computational cost. In general, if N = Nr . N2 . . . Np , then instead of the tJ(N2 ) complexity of the original summation we will obtain an algorithm for com puting Ck > k = 0, 1 , 2, . . . ,N  1 , at a cost of (j(N(NJ + N2 + . . . + Np )) arithmetic operations. This algorithm is known as the fast Fourier transform (FFT). Efficient versions of the FFT have been developed for Ni = 2 , 3 , 4, although the algorithm can also be built for other prime factors. In practice, the most commonly used and the most convenient to implement version is the one for N = 2P, where p is a positive integer. The computational complexity of the FFT in this case is
(j(N(2� + 2 + . . . + 2 )) = tJ(NlnN) p times
arithmetic operations, The algorithm of fast Fourier transform was introduced in a landmark 1 965 pa per by Cooley and Tukey [CT6S] , and has since become one of the most popular computational tools for solving large linear systems obtained as discretizations of differential equations on the grid. Besides that, the FFT finds many applications in signal processing and in statistics. Practically all numerical software libraries today have FFT as a part of their standard implementation. Exercises
1.
Consider an inhomogeneous Dirichlet problem for the finitedifference Poisson equa tion on a grid square [cf. formula (5. 1 1 )] : ml = 1 , 2, . . . , M  l ,
mz = 1 , 2, . . . , M  l ,
(5. 1 1 8a)
172
A Theoretical Introduction to Numerical Analysis with the boundary conditions:
uO,m2 = CPm" UM,m2 = �m" m = 1 , 2, uml,o = 11ml ' Uml,M = (1111 ' m] = 1 , 2 2
. . . ,
M

, . . . ,
M

l I
, ,
(S. 1 1 8b)
where fml ,ln2' CPm2' �m2 ' Tlml ' and (ml are given functions of their respective arguments. Write down the solution of this problem in the form of a finite Fourier series (5. 1 10). Hint. Take an arbitrary grid function w(il) = {wml • m,}, m ] = 0, 1 , . . . , M, m2 = 0, 1 , . . . , M, that satisfies boundary conditions (S. 1 l 8b) . Apply the finitedifference Laplace operator of (S . 1 1 8a) to this function and obtain: _�(h)w(il) � g(il). Then,
consider a new grid function v (h) �f= u (il)  w(il). This function satisfies the finite difference Poisson equation: _�(")v(h) = f(h)  g(h), for which the righthand side is known, and it also satisfies the homogeneous boundary conditions: Vml ,IIl2 = ° for m] = O , M and mz = O , M. Consequently, the new solution v(h) can be obtained us ing the discrete Fourier series, after which the original solution is reconstructed as 1P') = veil) + w(il), where w(h) is known.
Chapter 6 Itera tive Methods for Solving Linear Sys tems
Consider a system of linear algebraic equations:
Ax = j, j E L, x E L,
(6. 1 )
where L i s a vector space and A : L >+ L i s a linear operator. From the standpoint of applications, having a capability to compute its exact solution may be "nice," but it is typically not necessary. Quite the opposite, one can normally use an approximate solution, provided that it would guarantee the accuracy sufficient for a particular application. On the other hand, one usually cannot obtain the exact solution anyway. The reason is that the input data of the problem (the righthand side j, as well as the operator A itself) are always specified with some degree of uncertainty. This necessarily leads to a certain unavoidable error in the result. Besides, as all the numbers inside the machine are only specified with a finite precision, the roundoff errors are also inevitable in the course of computations. Therefore, instead of solving system (6. 1 ) by a direct method, e.g., by Gaussian elimination, in many cases it may be advantageous to use an iterative method of solution. This is particularly true when the dimension n of system (6. 1 ) is very large, and unless a special fast algorithm such as FFT (see Section 5.7.3) can be employed, the 6(n3 ) cost of a direct method (see Sections 5.4 and 5.6) would be unbearable. A typical iterative method (or an iteration scheme) consists of building a sequence of vectors {x(p)} c L, p = 0, 1 , 2, . . . , that are supposed to provide successively more accurate approximations of the exact solution x. The initial guess x(O) E L for an iter ation scheme is normally taken arbitrarily. The notion of successively more accurate approximations can, of course, be quantified. It means that the sequence x(p) has to converge to the exact solution x as the number p increases: x(p) . x, when p + 00. This means that for any f > 0 we can always find p p( f) such that the following inequality: =
!! x  x(p) !!
:::;
£
will hold for all p 2. p( £ ). Accordingly, by specifying a sufficiently small f > 0 we can terminate the iteration process after a finite number p = p( f) of steps, and subsequently use the iteration x( p) in the capacity of an approximate solution that would meet the accuracy requirements for a given problem. In this chapter, we will describe some popular iterative methods, and outline the conditions under which it may be advisable to use an iterative method rather than a direct method, or to prefer one particular iterative method over another.
173
1 74
6.1
A Theoretical Introduction to Numerical Analysis
Richardson Iterations and t he Like
We will build a family of iterative methods by first recasting system (6. 1 ) as fol lows: (6.2) x = (/  rA )x + r/. In doing so, the new system (6.2) will be equivalent to the original one for any value of the parameter r, r > 0. In general, there are many different ways of replacing the system Ax = I by its equivalent of the type: x = Bx + tp,
x E L,
tp E L.
System (6.2) is a particular case of (6.3) with B = (/ 6.1.1

(6.3)
rA) and tp = r/.
General Iteration Scheme
The general scheme of what is known as the first order linear stationary iteration process consists of successively computing the terms of the sequence: X(p+ l ) = Bx(p) + tp,
p = 0, 1 , 2, . . . ,
(6.4)
where the initial guess x(O) is specified arbitrarily. The matrix B is known as the iteration matrix. Clearly, if the sequence x(p) converges, i.e., if there is a limit: limp>= x(p) = x, then x is the solution of system (6. 1 ). Later we will identify the conditions that would guarantee convergence of the sequence (6.4). Iterative method (6.4) is first order because the next iterate x(p+ l ) depends only on one previous iterate, x(p ) ; it is linear because the latter dependence is linear; and finally it is stationary because if we formally rewrite (6.4) as x(p+ l ) = F (x(p) , A ,f), then the function F does not depend on p. A particular form of the iteration scheme (6.4) based on system (6.2) is known as the stationary Richardson method: X(p+ l ) = (/  rA)x(p) + r/,
p
= 0, 1 , 2, . . . .
(6.5)
Formula (6.5) can obviously be rewritten as: x( p+ l ) = x(p)  u(p) ,
(6.6)
where r(p) = Ax(p) l is the residual of the iterate x(p ) . Instead of keeping the parameter r constant we can allow it to depend on p. Then, departing from formula (6.6), we arrive at the socalled nonstationary Richardson method: (6.7) Note that in order to actually compute the iterations according to formula (6.5) we only need to be able to obtain the vector Ax(p) once x(p) E L is given. This does not
17 5
Iterative Methods for Solving Linear Systems
necessarily entail the explicit knowledge of the matrix A. In other words, an iterative method of solving Ax = f can also be realized when the system is specified in an operator form. Building the iteration sequence does not require choosing a particular basis in lL and reducing the system to its canonical form: n
L, a jXj = fi , i = 1 , 2, . . . , no j=! i
(6.8)
Moreover, when computing the terms of the sequence x(p) with the help of formula
(6.5), we do not necessarily need to store all n2 entries of the matrix A in the com puter memory. Instead, to implement the iteration scheme (6.5) we may only store the current vector x(p) E lL that has n components. In addition to the memory savings
and flexibility in specifying A, we will see that for certain classes of linear systems the computational cost of obtaining a sufficiently accurate solution of Ax = f with the help of an iterative method may be considerably lower than 0'(n3 ) operations, which is characteristic of direct methods. THEOREM 6.1 Let lL be an ndimensional normed vector space (say, ]Rn or en ), and assume that the induced operator norm of the iteration matrix B of (6.4) satisfies:
[[B[[ = q < 1 .
(6.9)
Then, system (6. 3) or equivalently, system (6. 1), has a unique solution x E lL. Moreover, the iteration sequence (6.4) converges to this solution x for an arbitrary initial guess x(O) , and the error of the iterate number p:
e (p) � x  x(p) satisfies the estimate:
(6. 1 0) In other words, the norm of the error as the geometric sequence qP.
I le(p) I [ vanishes when p � 00 at least as fast
PROOF If cp = 0, then system (6.3) only has a trivial solution Indeed, otherwise for a solution x I 0, cp = 0 , we could write: i.e.,
IIxl i
=
x
=
O.
II Bx ll :S IIB[ [ [ [x II = q [[ x[ [ < II x[ [ ,
[[xii < [ [ xii , which may not hold. The contradiction proves that system (6.3) is uniquely solvable for any cp, and as such, so is system (6. 1 ) for any f. Next, let x be the solution of system (6.3). Take an arbitrary x(O) E lL and subtract equality (6.3) from equality (6.4) , which yields: e (P+!) Be(p) , p = O, 1 , 2, . . . . =
A Theoretical Introduction to Numerical Analysis
176
Consequently, the following error estimate holds:
[[x  x(p) 1 I = [[e (p) 1 I = II Be(p I ) II ::; q[[e (p I ) [[ ::; q2 I 1 e ( p  2 ) I [ ::; . . . ::; qP l Ie (O) II = qP II x  x(O) [I ,
which is equivalent to
o
(6. 10).
REMARK 6 . 1 Condition (6.9) can be violated for a different choice of norm on the space IT..: [lx[I' instead of I l x ll . This, however, will not disrupt the convergence: x(p) + x as p + 00. Moreover, the error estimate (6. 10) will be replaced by (6. 1 1 )
where
c is a constant that will,
generally speaking, depend on the new norm
[ I . 11' , whereas the value of q will remain the same.
To justify this remark one needs to employ the equivalence of any two norms on a vector (i.e., finitedimensional) space, see [HJ85, Section 5.4] . This important result says that i f [ [ . I I and II . II' are two norms on IL, then we can always find two constants c, > 0 and C2 > 0, such that "Ix E IL : C I [[x [I' ::; [[x[[ ::; c2 [[x[I', where c, and C2 do not depend on x. Therefore, inequality (6.10) implies (6. 1 1 ) with C = c2 /c l . 0 Example 1 : The Jacobi Method
Let the matrix A
= {aU } of system (6. 1 ) be diagonally dominant: l au[ > I. l au l + II B II + 1] , series (6.22) is majorized (component According to wise) by a convergent geometric series: lI e(O ) 11 2:;=0 the Weierstrass theorem proven in the courses of complex analysis, see, e.g., [Mar77, Chapter 3] , the sum of a uniformly converging series of holomorphic functions is holomorphic. Therefore, in the region of convergence the function W(A) is a holomorphic vector function of its argument A , and the series (6.22) is its Laurent series. It is also easy to see that A W(A )  Ae(O) = B W( A ) , which, in turn, means: W(A ) = A (B  Al)  l e(O) . Moreover, by multiplying the series (6.22) by V I and then integrating (counterclockwise) along the circle I A I = r on the complex plane, where the number r is to be chosen so that the contour of integration lie within the area of convergence, i.e. , r 2' II B I I + 1] , we obtain:
(lI�ff!�)p,
e ( p) = �
2m
'
J V  w ( A ) dA
1). I=r
=

� 2m
{
J V (B
1).I=r

M) 
l e(O) dA .
(6.23)
Indeed, integrating the individual powers of A on the complex plane, we have:
J A kdA
1).I=r
=
2 7!i, k = l k �  1. 
0,
,
A Theoretical Introduction to Numerical Analysis
1 80
In other words, formula (6.23) implies that e(p) is the residue of the vector function V  J W(A ) at infinity. Next, according to inequality (6.2 1 ) , all the eigenvalues of the operator B belong to the disk of radius p < 1 centered at the origin on the complex plane: [Aj [ $. p < 1 , 2 , . . . ,no Then the integrand in the second integral of formula (6.23) is an analytic vector function of A outside of this disk, i.e., for [A [ > p , because the operator (B  AI)  I exists ( i.e., is bounded ) for all A : [A [ > p . This function is the analytic continuation of the function V 1 W(A) , where W(A) is originally defined by the series (6.22 ) that can only be proven to converge outside of a larger disk [A [ $. [[B[[ + 11 . Consequently, the contour of integration in (6.23) can be altered, and instead of r 2: [[B[[ + 11 one can take r = p + �, where � > 0 is arbitrary, without changing the value of the integral. Therefore, the error can be estimated as follows:
1, j
=
�1 J
[[e(p ) [[ = 2
I A I = p+s
V(B  U)  l e (O )dA
I
I $. (p + � )P max [[ (B  U)  [ [ [ [e ( O ) [[. I I=
(6.24)
A p+s
In formula (6.24) , let us take � > 0 sufficiently small so that p + � < I . Then, the righthand side of inequality (6.24) vanishes as p increases, which implies the convergence: [[e(p) [ + 0 when p + 00. This completes the proof of sufficiency. To prove the necessity, suppose that inequality ( 6.21 ) does not hold, i.e., that for some Ak we have [ Ak [ 2: 1. At the same time, contrary to the conclusion of the theorem, let us assume that the convergence still takes place for any choice of x(O) : x( p) + x as p + 00. Then we can choose x(O) so that e(O ) = x x(O) = ek , where ek is the eigenvector of the operator B that corresponds to the eigenvalue Ak . In this case, e(p ) BPe(O ) = BPek = Atek . As [Ak [ 2: 1 , the sequence Atek does not converge to 0 when p increases. The contradiction proves the necessity. o
=
REMARK 6 . 2 Let us make an interesting and important observation of a situation that we encounter here for the first time. The problem of computing the limit x lim x( p) is ultimately well conditioned, because the result x does =
P+=
not depend on the initial data at all, i.e., it does not depend on the initial guess x(O ) . Yet the algorithm for computing the sequence x( p) that converges according to Theorem 6.2 may still appear computationally unstable. The instability may take place if along with the inequality maxj [Aj [ [ = p < 1 we have [[B[[ > 1 . This situation is typical for nonselfadjoint ( or nonnormal) matrices B ( opposite of Theorem 5.2 ) . Indeed, if [[B[[ < 1 , then the norm of the error [[e(p) [[ = [[BPe(O) [[ decreases monotonically, this is the result of Theorem 6 . 1 . Otherwise, if [[B[[ > 1 , then for some e(O ) the norm [[e(p) [[ will initially grow, and only then decrease. The
Iterative Methods for Solving Linear Systems
181
behavior will be qualitatively similar to that shown in Figure 10.13 ( see page 394) that pertains to the study of stability for finitedifference initial boundary value problems. In doing so, the height of the intermediate "hump" on the curve showing the dependence of I l e(p) li on p may be arbitrarily high. A small relative error committed ncar the value of p that corresponds to the maximum of the "hump" will subsequently increase ( i.e. , its norm will increase )  this error will also evolve and undergo a maximum, etc. The resulting instability may appear so strong that the computation will become practically impossible already for moderate dimensions n and for the norms IIBII that are only slightly 0 larger than one. A rigorous definition of stability for the first order linear stationary iterative meth ods, as well as the classification of possible instabilities, the pertinent theorems, and examples can be found in work [Rya70]. 6.1.3
The Richardson Method for
A = A* > 0
Consider equation (6.3): x = Bx + rp, x E lL, assuming that lL is an ndimensional Euclidean space with the inner product (x, y) and the norm I l x l l = V(x, x) (e.g., IL ]R n ). Also assume that B : lL f+ lL is a selfadjoint operator: B = B*, with respect to the chosen inner product. Let Vj, j = 1 , 2, . . , n, be the eigenvalues of B and let p = p (B) = max I v} 1 =
.
J
be its spectral radius. Specify an arbitrary x(O) E lL and build a sequence of iterations:
x(p+I) = Bx(p) + rp, p = O, 1 , 2, . . . .
(6.25)
LEMMA 6.1 1. If P < 1 then the system x = Bx + rp has a unique solution x E lL; the iterates x(p) of (6. 25) converge to x; and the Euclidean norm of the error Ilx  x(p) I satisfies the estimate:
IIx  x(p) l I ::; pP l l x  x (O) I I ,
p = O, 1 , 2, . . . .
(6.26)
Moreover, there is a particular x(O) E lL for which inequality (6. 26) trans forms into a precise equality. 2. Let the system x = Bx + rp have a solution x E lL for a given rp E lL, and let p ::::: 1 . Then there is an initial guess x(O) E lL such that the corresponding sequence of iterations (6.25) does not converge to x.
PROOF According to Theorem 5.2, the Euclidean norm of a selfadjoint operator B = B* coincides with its spectral radius p . Therefore, the first con clusion of the lemma except its last statement holds by virtue of Theorem 6.1.
182
A Theoretical Introduction to Numerical Analysis
To find the initial guess that would turn (6.26) into an equality, we first introduce our standard notation e(p) = x  x(p) for the error of the iterate x(p) , and subtract equation (6.25) from x = Bx + qJ, which yields: e (p+l ) = Be(p) , p = 0, 1 , 2, . . .. Next, suppose that I Vk l = maxj I Vj l p and take e(O) x  x(O) = ek , where ek is the eigenvector of B that corresponds to the eigenvalue with maximum magnitude. Then we obtain: l I e(p) II = I Vk IP l l e(O) 11 = p Pll e(O) II. To prove the second conclusion o f the lemma, we take the particular eigen value vk that delivers the maximum: I Vk l = maxj I Vj l = p 2' 1 , and again select e(O) = x  x(O) = ek, where ek is the corresponding eigenvector. In this case the error obviously does not vanish as p ... because: =
00
e (p)
=
=
,
Be(p  l ) = . . = BPe (O) = v{ek' .
and consequently, Ile(p) 1 1 = p P l lek l l , where p P will either stay bounded but will not vanish, or will increase when p ... 0 00
.
Lemma 6. 1 analyzes a special case B = B* and provides a simple illustration for the general conclusion of Theorem 6.2 that for the convergence of a first order linear stationary iteration it is necessary and sufficient that the spectral radius of the itera tion matrix be strictly less than one. With the help of this lemma, we will now analyze the convergence of the stationary Richardson iteration (6.5) for the case A = A * > O. THEOREM 6.3 Consider a system of linear algebraic equations: AX =f,
A = A* > O,
(6.27)
where x,f E lL, and lL is an ndimensional Euclidean space (e.g., lL = lRn ). Let Amin and Amax be the smallest and the largest eigenvalues of the operator A, respectively. Specify some 'r '" 0 and recast system (6.27) in an equivalent form: x = (/  'rA)x+ 'rf. (6.28) Given an arbitrary initial guess x(O) E IL, consider a sequence of Richardson iterations:
1. If the parameter
'r
(6.29)
satisfies the inequalities:
(6.30) then the sequence x(p) of (6.29) converges to the solution x of system (6.27) . Moreover, the norm of the error I I x  x(p) II is guaranteed to decrease when p increases with the rate given by the following estimate:
(6.31)
Iterative Methods for Solving Linear Systems
1 83
The quantity p in formula (6. 31) is defined as p = p (r) = max{ 1 1  rAminl , 1 1

rAmax l } .
(6. 3 2)
This quantity is less than one, p < 1 , as it is the maximum of two numbers, / 1  rAmin l and 1 1  rAmax l , neither of which may exceed one provided that inequalities (6. 30) hold. 2. Let the number r satisfy (6. 30). Then there is a special initial guess x(O) for which estimate (6.31) cannot be improved, because for this x(O) inequality (6. 31) transforms into a precise equality. 3. If condition (6.30) is violated, so that either r :::: 2/Amax or r ::; 0, then there is an initial guess x(O) for which the sequence x(p) of (6.29) does not converge to the solution x of system (6. 27} . 4. The number p p ( r ) given by formula (6.32) assumes its minimal (i. e. , optimal) value popt = p (ropt) when r = ropt = 2/(Am in + Amax ) . In this case, =
p = popt =
Amax  Amin .u (A)  1 = Amax + Amin .u (A ) + 1 '
(6.33)
where .u (A) = Amax/Amin is the condition number of the operator A (see Theorem 5.3). PROOF To prove Theorem 6.3, we will use Lemma 6 . 1 . In this lemma, let us set B = /  rA. Note that if A = A* then the operator B = /  rA is also selfadjoint, i.e. , B = B*:
(Bx, y) = ( (/  rA)x,y) = (x,y)  r(Ax,y) = (x,y)  r(x, Ay) = (x, (/  rA)y) = (x, By) . Suppose that Aj , j = 1 , 2 , . . . , n, are the eigenvalues of the operator A arranged in the ascending order:
(6.34)
and {el , e2 , . . . , en } are the corresponding eigenvectors: Aej = Ajej, j = 1 , 2, . . . ,n, that form an orthonormal basis in the space lL. Then clearly, the same vectors ej, j = 1 , 2, . . . , n , are also eigenvectors of the operator B, whereas the respective eigenvalues are given by:
(6.35) Indeed,
Bej = (/  rA) ej = ej  rAjej = ( 1  rAj ) ej = vjej , j = 1 , 2, . . . , n o
A Theoretical Introduction to Numerical Analysis
1 84
According to (6.34), if 'I' > 0 then the eigenvalues Vj given by formula (6.35) are arranged in the descending order, see Figure 6 . 1 :
1 > V i � V2 � . . . � Vn · From Figure 6.1 it is also easy to see that the largest among the absolute values I Vj l , j = 1 , 2, . . . , n, may be either I VI I = J I  'I'AI I == 1 1  'I'Amin l or I Vn l = 1 1 'I'An l == 1 1  'I'Amax l ; the case I Vn l = maxj I Vj l is realized when Vn = I  'I'Amax < 0 and 1 1  'I'Amax l > 1 1  'I'Amin l . Consequently, the condition: (6.36) of Lemma 6.1 coincides with the condition [see formula (6.32) ] : p = max { 1 1  'I'Amin l , 1 1  'I'Amax l } < 1 .
Clearly, if 'I' > 0 we can only guarantee p < 1 pro • • ! vided that the point vn 1 1 V o on Figure 6.1 is located FIGURE 6. 1 : Eigenvalues of the matrix B = I  'I'A . to the right of the point  1 , i.e., if V" = I 'I'Amax >  1 . This means that along with 'I' > 0 the second inequality of (6.30) also holds. Otherwise, if 'I' � 2/ Amax, then p > 1 . If 'I' < 0, then Vj = 1  'I'Aj = 1 + l 'I'I Aj > 1 for all j = 1 , 2, . . . , n, and we will always have p = maxj I Vj l > 1 . Hence, condition (6.30) is equivalent to the requirement (6.36) of Lemma 6.1 for B = 1  'I'A (or to requirement (6.21) of Theorem 6.2). We have thus proven the first three implications of Theorem 6.3. To prove the remain ing fourth implication, Ivl we need to analyze the 1 behavior of the quanti ties I vI i = 1 1  'I'Amin l and I Vn l = 1 1  'I'Amax l as func tions of 'I'. We schemat ically show this behav ior in Figure 6.2. From Popt     I I TAmin l this figure, we determine that for smaller values of 'I' the quantity I VI I dom inates, i.e., 1 1  'I'Aminl > Topt o J I  'I'Amax l , whereas for larger values of 'I' the FIGURE 6.2: I VI I and I vll l as functions of 'I'. quantity I Vn I dominates , i.e., 1 1  'I'Amax l > 1 1 'I'Amin l . The value of p ('I') = max{ 1 1  'I'Amin l , 1 1  'I'Amax l } is shown by a bold v"
•
Iterative Methods for Solving Linear Systems
,
1 85

p olygonal line in Figure 6.2; it coincides with 1 1  rAmin l before the intersection p oint and after this point it coincides with I I rAmax I . Consequently, the min imum value of P Popt is achieved precisely at the intersection, i.e., at the value of r ropt obtained from the following condition: VI ( r) = I Vil (r) I = Vil (r) . This condition reads: 1  rAmin = r Amax  1 ,
=
=

which yields:
Consequently,
1
Popt = P (ropt) =  roptAmin =

Amax  Amin J1 (A) I = Amax + Amin J1 (A) + 1 .
o
This expression is identical to (6.33), which completes the proof.
Let us emphasize the following important consideration. Previously, we saw that the condition number of a matrix determines how sensitive the solution of the corre sponding linear system will be to the perturbations of the input data (Section 5.3.2). The result of Theorem 6.3 provides the first evidence that the condition number also determines the rate of convergence of an iterative method. Indeed, from formula (6.33) it is clear that the closer the value of J1 (A) to one, the closer the value of popt to zero, and consequently, the faster is the decay of the error according to estimate (6.3 1 ). When the condition number J1 (A) increases, so does the quantity popt (while still remaining less than one) and the convergence slows down. According to formulae (6.3 1 ) and (6.33), the optimal choice of the iteration pa rameter r = ropt enables the following error estimate:

1 A where J:." = 1 min =  . ( A) J1 '''max Moreover, Lemma 6. 1 implies that this estimate is actually attained for some partic e(O), i.e., that there is an initial guess for which the inequality transforms into a precise equality. Therefore, in order to guarantee that the initial error drops by a prescribed factor in the course of the iteration, i.e., in order to guarantee the estimate:
ular
(6.37)
where (] > 0 is given, it is necessary and sufficient to select p that would satisfy:
(  )  , l� P < (] 1+�
. I.e.,
p >  In ( 1 + ."J: )In(]In( I  ."J: ) .
A more practical estimate for the number p can also be obtained. Note that In( l + � )  In( l  � ) = 2�
00
2k
,
.L 2k� + 1
k=O
1 86
A Theoretical Introduction to Numerical Analysis
where 1
00
� 2k
::::; kL= 2k + 1 ::::; 1 O
1
_
]: �
2
•
Therefore, for the estimate (6.37) to hold it is sufficient that p satisfy: .u (A) =
�1 '
(6.38a)
and it is necessary that (6.38b)
Altogether, the number of Richardson iterations required for reducing the initial error by a predetermined factor is proportional to the condition number of the matrix. REMARK 6.3 In many cases, for example when approximating elliptic boundary value problems using finite differences (see, e.g. , Section 5.1.3), the operator A : IL rt IL of the resulting linear system (typically, IL = ]Rn ) appears selfadjoint and positive definite (A = A * > 0) in the sense of some natural inner product. However, most often one cannot find the precise minimum and maximum eigenvalues for such operators. Instead, only the estimates a and b for the boundaries of the spectrum may be available:
o < a ::::; Amin ::::; Amax ::::; b.
(6.39)
In this case, the Richardson iteration (6.29) can still be used for solving the system Ax = f. 0 The key difference, though, between the more general case outlined in Remark 6.3 and the case of Theorem 6.3, for which the precise boundaries of the spectrum are known, is the way the iteration parameter 1: is selected. If instead of Amin and Amax we only know a and b, see formula (6.39), then the best we can do is take 1:' = 2/ (a +b) instead of 1:opt = 2/ ( Amin + Amax) . Then, instead of Popt given by formula (6.33): Popt = ( Amin  Amax ) / ( Amin + Amax )
another quantity
p
'
= max { l l  ,'Amin i , 1 1  1:'Amax l } ,
which is larger than Popt, will appear in the guaranteed error estimate (6.3 1 ). As has been shown, for any value of 1: within the limits (6.30), and for the respective value of P = P (1:) given by formula (6.32), there is always an initial guess x(O) for which estimate (6.3 1 ) becomes a precise equality. Therefore, for 1: = " =1= 1:opt we obtain an unimprovable estimate (6.3 1 ) with P = p ' > popt. In doing so, the rougher the estimate for the boundaries of the spectrum, the slower the convergence.
Iterative Methodsfor Solving Linear Systems
1 87
Example
Let us apply the Richardson iterative method to solving the finitedifference Dirichlet problem for the Poisson equation: !:',YI)u(h) = f(h ) that we introduced in Section 5 . 1 .3. In this case, formula (6.29) becomes:
The eigenvalues of the operator _�(h) are given by formula (5. 109) of Section 5.7.2. In the same Section 5.7.2, we have shown that the operator _�(h) : V(h) r'! V(h) is selfadjoint with respect to the natural scalar product (5.20) on V(h) :
Finally, we have estimated its condition number: J.l ( _�(h) ) = eJ(h  2 ), see formula (5.1 15). Therefore, according to formulae (6.37) and (6.38), when 't' = 't'opt. ! the number of iterations p required for reducing the initial error, say, by a factor of e is p � !Il( _�(Iz) ) = eJ(h 2 ). Every iteration requires eJ(h  2 ) arithmetic operations and consequently, the overall number of operations is tJ(h 4 ). In Section 5.7.2, we represented the exact solution o f the system _�(h)u(h) = ih) in the form of a finite Fourier series, see formula (5. 1 10). However, in the case of a nonrectangular domain, or in the case of an equation with variable coefficients (as opposed to the Poisson equation):
:x (a �: ) :y (b ��) = f, a = a(x,y) > 0, b = b(x,y) > 0, +
(6.40)
we typically do not know the eigenvalues and eigenvectors of the problem and as such, cannot use the discrete Fourier series. At the same time, an iterative algorithm, such as the Richardson method, can still be implemented in quite the same way as it was done previously. In doing so, we only need to make sure that the discrete operator is selfadjoint, and also obtain reasonable estimates for the boundaries of its spectrum. In Section 6.2, we will analyze other iterative methods for the system Ax = j, where A = A * > 0, and will show that even for an ill conditioned operator A, say, with the condition number J.l (A) = eJ(h 2 ), it is possible to build the methods that will be far more efficient than the Richardson iteration. A better efficiency will be achieved by obtaining a more favorable dependence of the number of required iterations p on the condition number J.l (A). We will have p 2: VJ.l (A) as opposed to p 2: J.l (A), which is guaranteed by formulae (6.38). lIn this case,
'opt
=
2/ (A.l l + A.M l .M l ) "" h2/[4 ( 1 + sin2
�) j, see formulae (5. 1 12) and (5. 1 14).
A Theoretical Introduction to Numerical Analysis
1 88 6 . 1 .4
Preconditioning
As has been mentioned, some alternative methods that are less expensive computa tionally than the Richardson iteration will be described in Section 6.2. They will have a slower than linear rate of increase of p as a function of .u (A). A complementary strategy for reducing the number of iterations p consists of modifying the system Ax = j itself, to keep the solution intact and at the same time make the condition number .u smaller. Let P be a nonsingular square matrix of the same dimension as that of A. We can equivalently recast system (6. 1 ) by multiplying it from the left by p l :
p l Ax = p lj.
(6.4 1 )
The matrix P i s known as a pre conditioner. The solution x o f system (6.4 1 ) i s obvi ously the same as that of system (6. 1 ). Accordingly, instead of the standard stationary Richardson iteration (6.5) or (6.6), we now obtain its preconditioned version: (6.42) or equivalently,
x(p+ l ) = x(p)  7:p l r( p) ,
where
r( p) = Ax(p) fp) .
(6.43)
As a matter of fact, we have already seen some examples of a preconditioned Richardson method. By comparing formulae (6. 13) and (6.42), we conclude that the Jacobi method (Example 1 of Section 6. 1 . 1 ) can be interpreted as a preconditioned Richardson iteration with 7: = 1 and P = D = diag{ au} . Similarly, by comparing formulae (6. 17) and (6.42) and by noticing that
 (A  D)  I D =  (A  D)  l (A  (A  D)) = 1  (A  D) l A,
we conc I ude that the GaussSeidel method (Example 2 of Section 6. 1 . 1 ) can be in terpreted as a preconditioned Richardson iteration with 7: = 1 and P = A  D. Of course, we need to remember that the purpose of preconditioning is not to analyze the equivalent system (6.4 1 ) "for the sake of it," but rather to reduce the condition number, so that .u (P lA) .u (A) or ideally, .u (P l A) « .u (A). Unfortu nately, relatively little systematic theory is available in the literature for the design of efficient preconditioners. Different types of problems may require special individual tools for analysis, and we refer the reader, e.g., to [Axe94] for detail. For our subsequent considerations, let us assume that the operator A is selfadjoint and positive definite, as in Section 6. 1 .3, so that we need to solve the system:
0,
X E IL, f E lL.
(6.44)
Introduce an operator P = P* > 0, which can be taken arbitrarily in the meantime, and mUltiply both sides of system (6.44) by p l , which yields an equivalent system:
Cx = g, C = P l A, g = p lj.
(6.45)
Note that the new operator C of (6.45) is, generally speaking, no longer selfadjoint.
Iterative Methodsfor Solving Linear Systems
1 89
Let us, however, introduce a new inner product on the space IL by means of the operator P: [x, yj p � (Px,y) . Then, the operator C of (6.45) appears selfadjoint and positive definite in the sense of this new inner product. Indeed, as the inverse of a selfadjoint operator is also selfadjoint, we can write:
(PCx,y) (PP I Ax ,y) = (Ax ,y) (x , Ay) I = (P I px,Ay ) = (Px,P Ay) = [x , Cyjp, [Cx,xj p = (PCx,x) = (PP IAx, x) = (Ax, x) > 0, if x =J O. [Cx ,yjp
=
=
=
As of yet, the choice of the preconditioner P for system (6.45) was arbitrary. For example, we can choose P = A and obtain C = A  I A = which immediately yields the solution x. As such, P A can be interpreted as the ideal preconditioner; it provides an indication of what the ultimate goal should be. However, in real life setting P = A is totally impractical. Indeed, the application of the operator pI = A I is equivalent to solving the system Ax = f directly, which is precisely what we are trying to avoid by employing an iterative scheme. Recall, an iterative method only requires computing Az for a given Z E IL, but does not require computing A  If. lt therefore only makes sense to select the operator P among those for which the computation of p I Z for a given z is considerably easier than the computation of A I z. The other extreme, however, would be setting P = which does not require doing anything, but does not bring along any benefits either. In other words, the preconditioner P should be chosen so as to be easily invertible on one hand, and on the other hand, to "resemble" the operator A. In this case we can expect that the operator C = p I A will "resemble" the unit operator and the boundaries of its spectrum Amin and Amax, as well as the condition number, will all be "closer" to one.
I,
=
I,
I,
THEOREM 6.4 Let P P* > 0, let the two numbers following ineq'ualities hold: =
YI > 0 and ')2 > 0 be fixed, and let the
YI (Px, x) s: (Ax ,x)
s: ')2 (Px, x)
(6.46)
for all x E lL. Then the eigenvalues Amin (C), Amax (C ) and the condition number J.Lp (C) of the operator C p I A satisfy the inequalities: =
YI
s: Ami n (C) s: Amax (C) s: 'Y2 ,
J.Lp(C) s: ')2 / rI '
(6.47)
PROOF From the courses of linear algebra it is known that the eigenvalues of a selfadjoint operator can be obtained in the form of the RaylcighRit:;�
1 90
A Theoretical Introduction to Numerical Analysis
quotients (see, e.g., [HJ85, Section 4.2] ) : /"mm 1 .
(C)
1 /"max
(C)
_

_

[Cx,xjp _ . (Ax,x) ,  mm [x,xjp XEIL, xfo (Px , x) [Cx, xjp _ (Ax,x) . max  max XEIL, xfO [X,xjp xEIL, xfO (PX,X) . mm
xEIL. xfO

By virtue of (6.46) , these relations immediately yield (6.47) .
o
The operators A and P that satisfy inequalities (6.46) are called equivalent by spectrum or equivalent by energy, with the equivalence constants YI and '}2 . Let u s emphasize that the transition from system (6.44) to system (6.45) i s only justified if
)1p(C) :::;
'}2
y,
« )1 (A).
Indeed, since in this case the condition number of the transformed system becomes much smaller than that of the original system, the convergence of the iteration no ticeably speeds up. In other words, the rate of decay of the error e(p) = x  x(p) in the norm I I . li p increases substantially compared to (6.3 1 ):
I l e(p) ll p = I l x  x(p) ll p :::; pt ll x  x(O) ll p = p; ll e (O) li p, )1p (C)  1 pp )1p (C) + l '
=
The mechanism of the increase is the drop pp « p that takes place because p
=
)1 (A)  l according to formula (6.33), and )1p(C) « )1 (A ) . Consequently, for one )1 (A ) + 1 and the same value of p we will have p; « pp.
A typical situation when preconditioners of type (6.46) prove efficient arises in the context of discrete approximations for elliptic boundary value problems. It was first identified and studied by D'yakonov in the beginning of the sixties; the key results have then been summarized in a later monograph [D'y96] . Consider a system of linear algebraic equations:
obtained as a discrete approximation of an elliptic boundary value problem. For ex ample, it may be a system of finitedifference equations introduced in Section 5 . 1 .3. In doing so, the better the operator An approximates the original elliptic differential operator, the higher the dimension n of the space ]R1l is. As such, we are effectively dealing with a sequence of approximating spaces ]Rn , n 7 00, that we will assume Euclidean with the scalar product (x y ) (n ) . Let An : ]Rn f+ ]Rn be a sequence of operators such that All = A:' > 0, and let Allx = f, where x,J E ]R", be a sequence of systems to be solved in the respective spaces. Suppose that the condition number, )1 (AIl)' increases when the dimension ,
Iterative Methodsfor Solving Linear Systems n
191
increases so that .u (An ) nS, where s > 0 is a constant. Then, according to for mulae (6.38), it will take tJ( ns ln O' ) iterations to find the solution x E ]Rn with the guaranteed accuracy 0' > Next, let Pn : ]R" t+ ]Rn be a sequence of operators equivalent to the respective operators An by energy, with the equivalence constants Yl and }2 that do not depend on n. Then, for ell = r;; I An we obtain: rv
o.
(6.48)
Hence we can replace the original system (6.44): Allx = f by its equivalent (6.45): (6.49)
In doing so, because of a uniform boundedness of the condition number with respect to n, see formula (6.48), the number of iterations required for reducing the I I . IIPn norm of the initial error by a predetermined factor of 0': (6.50)
will not increase when the dimension n increases, and will remain tJ(ln O'). Furthermore, let the norms:
be related to one another via the inequalities:
Then, in order to guarantee the original error estimate (6.37): (6.5 1 )
when iterating system (6.49), it is sufficient that the following inequality hold: (6.52)
Inequality (6.52) is obtained from (6.50) by replacing 0' with O'n i , so that solv ing the preconditioned system (6.49) by the Richardson method will only require 0'(  In( O'n /)) = tJ (In n  In 0') iterations, as opposed to tJ( nS In 0') iterations re quired for solving the original nonpreconditioned system (6.44). Of course, the key question remains of how to design a preconditioner equivalent by spectrum to the operator A, see (6.46). In the context of elliptic boundary value problems, good results can often be achieved when preconditioning a discretized op erator with variable coefficients, such as the one from equation (6.40), with the dis cretized Laplace operator. On a regular grid, a preconditioner of this type P = 6o(h) , see Section 5 . 1 .3, can be easily inverted with the help of the FFT, see Section 5.7.3.
1 92
A Theoretical Introduction to Numerical Analysis
Overall, the task of designing an efficient preconditioner is highly problem dependent. One general approach is based on availability of some a priori knowledge of where the matrix A originates from. A typical example here is the aforementioned spectrally equivalent elliptic preconditioners. Another approach is purely algebraic and only uses the information contained in the structure of a given matrix A. Ex amples include incomplete factorizations (LV, Cholesky, modified unperturbed and perturbed incomplete LV), polynomial preconditioners (e.g., truncated Neumann se ries), and various ordering strategies, foremost the multilevel recursive orderings that are conceptually close to the idea of multigrid (Section 6.4). For further detail, we refer the reader to specialized monographs [Axe94, Saa03, vdY03] . 6.1.5
Scaling
One reason for a given matrix A to be poorly conditioned, i.e., to have a large condition number )1 (A), may be large disparity in the magnitudes of its entries. If this is the case, then scaling the rows of the matrix so that the largest magnitude among the entries in each row becomes equal to one often helps improve the conditioning. Let D be a nonsingular diagonal matrix (dii =I= 0), and instead of the original system Ax = j let us consider its equivalent:
DAx = Dj.
(6.53)
The entries dii, i = 1 , 2, . . , n of the matrix D are to be chosen so that the maximum absolute value of the entry in each row of the matrix DA be equal to one. .
m�x l diiaij l J
=
i = 1 , 2, . . . ,n.
1,
(6.54)
The transition from the matrix A to the matrix DA is known as scaling (of the rows of A). By comparing equations (6.53) and (6.4 1 ) we conclude that it can be inter preted as a particular approach to preconditioning with p I = D. Note that different strategies of scaling can be employed; instead of (6.54) we can require, for example, that all diagonal entries of DA have the same magnitude. Scaling typically reduces the condition number of a system: )1 (DA) )1 (A). To solve the system Ax = j by iterations, we can first transform it to an equivalent system Cx = g with a selfadjoint positive definite matrix C =A *A and the righthand side g = A *j, and then apply the Richardson method. According to Theorem 6.3, the rate of convergence of the Richardson iteration will be determined by the condition number )1 (C). It is possible to show that for the Euclidean condition numbers we have: )1 (C) = )1 2 (A) (see Exercise 7 after Section 5.3). If the matrix A is scaled ahead of time, see formula (6.53), then the convergence of the iterations will be faster, because )1 ((DA) * (DA)) = )12 (DA ) )1 2 (A). Note that the transition from a given A to C = A *A is almost never used in practice as a means of enabling the solution by iterations that require a selfadjoint matrix, because an additional matrix multiplication may eventually lead to large errors when computing with finite precision. Therefore, the foregoing example shall only be
0, x E IR", I E IR",
(6.60)
we will describe two iterative methods of solution that offer a better performance (faster convergence) compared to the Richardson method of Section 6. 1 . We will also discuss the conditions that may justify preferring one of these methods over the other. The two methods are known as the Chebyshev iterative method and the method of conjugate gradients, both are described in detail, e.g., in [SN89bj. As we require that A = A' > 0, all eigenvalues Aj, j = 1 , 2, . . . , n, of the operator A are strictly positive. With no loss of generality, we will assume that they are arranged in ascending order. We will also assume that two numbers a > 0 and b > 0 are known such that: (6.61) The two numbers a and b in formula (6.61) are called boundaries of the spectrum of the operator A . If a = Al and b = ).." these boundaries are referred to as sharp. As in Section 6.1 .3, we will also introduce their ratio:
� = ab < l.
If the boundaries of the spectrum are sharp, then clearly � = fL (A)  1 , where fL (A) is the Euclidean condition number of A (Theorem 5.3). 6.2.1
C hebyshev Iterations
Let us specify the initial guess x(O) E IRn arbitrarily, and let us then compute the iterates x(p), p = 1 , 2, . . . , according to the following formulae:
x(l) (I  TA)x(O) + TI, + X( p l ) = lXp+1 (I  TA)x(p) + ( 1  lXp+I )X(P I ) + TlXp+ Ii, =
p = 1 , 2,
. . ·. ,
(6.62a)
Iterative Methods for Solving Linear Systems
1 95
where the parameters 1" and ap are given by: p=
4
1 , 2 , . . . (6.62b)
We see that the first iteration of (6.62a) coincides with that of the Richardson method, and altogether formulae (6.62) describe a second order linear nonstationary iteration scheme. It is possible to show that the error t: (p ) = x x( p ) of the iterate x( p) satisfies the estimate: PI
=
I  VS l + VS'
p=
1 , 2, . . .
(6.63)
where I . I I = � is a Euclidean norm on JRn . Based on formula (6.63), the analysis very similar to the one performed in the end of Section 6. 1.3 will lead us to the conclusion that in order to reduce the norm of the initial error by a predetermined factor (J, i.e., in order to guarantee the following estimate: it is sufficient to choose the number of iterations p so that:
(6.64)
(6.65)
/¥,
times smaller than the number of iterations required for This number is about achieving the same error estimate (6.64) using the Richardson method. In particular, when the sharp boundaries of the spectrum are available, we have:
b = �1 .u (A).
�
=
Then, b y comparing formulae (6.38a) and (6.65) we can see that whereas for the Richardson method the number of iterations p is proportional to the condition num ber .u (A) itself, for the new method (6.62) it is only proportional to the square root J.u(A). Therefore, the larger the condition number, the more substantial is the rel ative economy offered by the new approach. It is also important to mention that the iterative scheme (6.62) is computationally stable. The construction of the iterative method (6.62), the proof of its error estimate (6.63), as well as the proof of its computational stability and other properties, are based on the analysis of Chebyshev polynomials (see Section 3.2.3) and of some related polynomials. We omit this analysis here and refer the reader, e.g., to [SN89b] for detail. We only mention that because of its relation to Chebyshev polynomials, the iterative method (6.62) is often referred to as the second order Chebyshev method in the literature. Let us also note that along with the second order Chebyshev method (6.62), there is also a first order method of a similar design and with similar properties. It is
A Theoretical Introduction to Numerical Analysis
1 96
known as the first order Chebyshev iterative method, see [SN89b 1 . Its key advantage is that for computing the iterate x(p+ I ) this method only requires the knowledge of x( p) so that the iterate X(p  I ) no longer needs to be stored in the computer memory. However, the first order Chebyshev method is prone to computational instabilities. 6.2.2
Conjugate Gradients
The method of conjugate gradients was introduced in Section 5.6 as a direct method for solving linear systems of type (6.60). Starting from an arbitrary initial guess x(O) E ]RtI, the method computes successive approximations x(p ), p = 1 , 2, . . . , to the solution x of Ax = 1 according to the formulae (see Exercise 2 after Sec tion 5.6):
x( l ) = (1  rI A)x(O) + rtf, x(p+ i) = Yp+ 1 (/  rp + IA)x( p) + ( 1  Yp+I )X(P J ) + Yp+1 rp+Ji,
(6.66a)
where
(r( p) , r(p) ) p p+ 1  (Ar(p) , r(p) ) ' r( ) = Ax(p) I, p  0 1 2 , . . , (6.66b) rp + 1 (r(p) , r(p ) ) YI = I , Yp+ l = 1  :r;; p l p l ) p = I , 2, . . . . (r( ), r( ) yp . After at most n iterations, the method produces the exact solution x of system (6.60), ,.. •
_
1 ) 1
(
,
,
.
'
and the computational complexity of obtaining this solution is 0'(n3 ) arithmetic oper ations, i.e., asymptotically the same as that of the Gaussian elimination (Section 5.4). However, in practice the method of conjugate gradients is never used as a direct method. The reason is that its convergence to the exact solution after a finite num ber of steps can only be guaranteed if the computations are conducted with infinite precision. On a real computer with finite precision the method is prone to instabili ties, especially when the condition number .u(A) is large. Therefore, the method of conjugate gradients is normally used in the capacity of an iteration scheme. It is a second order linear nonstationary iterative method and a member of a larger family of iterative methods known as methods of conjugate directions, see [SN89b]. The rate of convergence of the method of conjugate gradients (6.66) is at least as fast as that of the Chebyshev method (6.62) when the latter is built based on the knowledge of the sharp boundaries a and b, see formula (6. 6 1 ). In other words, the same error estimate (6.63) as pertains to the Chebyshev method is guaranteed by the method of conjugate gradients as well. In doing so, the quantity ; in formula (6.63) needs to be interpreted as: ;
_
a
b
_
Amin
_
1
 Amax  .u (A) "
The key advantage of the method of conjugate gradients compared to the Chebyshev method is that the boundaries of the spectrum do not need to be known explicitly, yet
Iterative Methods jar Solving Linear Systems
1 97
the method guarantees that the number of iterations required to reduce the initial error by a prescribed factor will only be proportional to the square root of the condition number J,u(A), see formula (6.65). The key shortcoming of the method of conjugate gradients is its computational instability that manifests itself stronger for larger condition numbers ,u (A ). The most favorable situation for applying the method of conjugate gradients is when it is known that the condition number ,u (A ) is not too large, while the boundaries of the spectrum are unknown, and the dimension n of the system is much higher than the number of iterations p needed for achieving a given accuracy. In practice, one can also use special stabilization procedures for the method of conjugate gradients, as well as implement restarts after every so many iterations. The latter, however, slow down the convergence. Note also that both the Chebyshev iterative method and the method of conjugate gradients can be preconditioned, similarly to how the Richardson method was pre conditioned in Section 6. 1 .4. Preconditioning helps reduce the condition number and as such, speeds up the convergence. Exercises
1. Consider the same second order central difference scheme as the one built in Exercise 4 after Section 5.4 (page 156):
Uj+ l  2uj + Uj t h2
 Uj = h,
1 h = N ' 110 = 0,
j=
1 , 2, . . . ,N  I ,
UN = O.
a) Write down this scheme as a system of linear algebraic equations with an (N 1) x (N  1 ) matrix A. Use the apparatus of finite Fourier series (Section 5.7) to prove that the matrix A is symmetric positive definite: A = A * > 0, find sharp boundaries of its spectrum Amin and Amax, end estimate its Euclidean condition number J1 ( A) as it depends on h = 1/ N for large N. Hint.
j,k
=
lJIY)
The eigenvectors and eigenvalues of A are given by = J2 sin kZ.i , 1 , 2, . . . , N, and Ak = � sin2 �� + k = 1 , 2, . . . ,N  I , respectively.
b) Solve the system
(
I),
Au = f
on the computer by iterations, with the initial guess u(O) = 0 and the same right hand side f as in in Exercise 4 after Section 5.4: h = f(xj), j = I , . . . , N  1 , where f(x) = (_n2 sin(nx) + 2ncos(nx) ) e< and Xj = j . h ; the corresponding values of the exact solution are u(xj), j = 1, . . . ,N  1, where u(x) = sin(nx)e 0, where x &1 E ]Rn, by building a sequence of descent directions d(p), p = 0 , 1 , 2, . . . , and a sequence of iterates x(p), p = 0 , 1 , 2 , . . . , such that every residual r(p+ l ) = Ax(p + I ) 1 is orthogonal to all the previous descent directions dU) , j = 0 , 1 , . . . , p. In their own turn, the descent directions dU) are chosen Aconjugate, i.e., orthogonal in the sense of the inner product defined by A, so that for every p = 1 , 2, . . . : [dU) , d(P ) ]A = 0, j = 0 , 1, . . . , P  1. Then, given an arbitrary d(O), the pool of all available linearly independent directions dU), j = 0 , 1 , 2, . . . , p, in the space ]Rn gets exhausted at 1 p = n  1 , and because of the orthogonality: r( p + l ) 1. span{ d(O) , d( ), . . . , d( p ) } , the method converges to the exact solution x after at most n steps. A key provision that facilitates implementation of the method of conjugate gra dients is that a scalar product can be introduced in the space ]Rn by means of the matrix A. To enable that, the matrix A must be symmetric positive definite. Then, a system of linearly independent directions dU) can be built in ]Rn using the notion of orthogonality in the sense of [ . , · ]A. Iterative methods based on Krylov subspaces also exploit the idea of building a sequence of linearly independent directions in the space ]Rn . Then, some quantity related to the error is minimized on the subspaces that span those directions. However, the Krylov subspace iterations apply to more general systems Ax = 1 rather than only A = A * > 0. In doing so, the orthogonality in the sense of [ . , ' ]A is no longer available, and the problem of building the linearly independent directions for the iteration scheme becomes more challenging. In this section, we only provide a very brief and introductory account of Krylov subspace iterative methods. This family of methods was developed fairly recently, and in many aspects it still represents an active research area. For further detail, we refer the reader, e.g., to [Axe94, Saa03, vdV03].
Iterative Methods for Solving Linear Systems 6 . 3. 1
1 99
Definition of Krylov S ubspaces
Consider a nonpreconditioned nonstationary Richardson iteration (6. 7):
x(p+ l) = x(p )  rpr(p ) ,
=
p
= 0, 1 , 2, . . . .
(6.67)
Ax( p)  f, formula (6.67) yields: r(p+ l ) = r(p)  rpAr(p) = (I  rpA)r(p) , p = 0, 1 , 2, . . . .
For the residuals r(p)
Consequently, we obtain: (6.68)
In other words, the residual r(p) can be written in the form: r(p) = Qp (A)r(O) , where Qp (A) is a matrix polynomial with respect to A of degree no greater than p. DEFINITION 6.1 For a given n x n matrix A and a given u E ]R.n , the Krylov subspace of order m is defined as a linear span of the m vectors obtained by successively applying the powers of A to the vector u:
Km (A, u)
=
span{u,Au, . . . ,Am l u} .
(6.69)
Clearly, the Krylov subspace is a space of all ndimensional vectors that can be represented in the form Qm I (A) u, where QmI is a polynomial of degree no greater than m  1 . From formulae (6.68) and (6.69) it is clear that
r(p) E Kp + l (A, r(O) ),
p=
0, 1 , 2, . . . .
As for the iterates x(p) themselves, from formula (6.67) we find:
p J x(p) = x(O)  L rjrU) . j =o Therefore, we can write: (6.70)
which means that x(p) belongs to an affine space Np obtained as a translation of Kp (A, r(O)) from Definition 6. 1 by a fixed vector x(O). Of course, a specific x(p ) E Np can be chosen in a variety of ways. Different iterative methods from the Krylov subspace family differ precisely in how one selects a particular x(p) E Np , see (6.70). Normally, one employs some optimization criterion. Of course, the best thing would be to minimize the norm of the error II x  z l l : x( p) = arg min ii x  z ii, ZENp
A Theoretical Introduction to Numerical Analysis
200
where x is the solution to Ax = f. This, however, is obviously impossible in practice, because the solution x is not known. Therefore, alternative strategies must be used. . . tlI , · · .. c . , , ' I n) ..L l� \n, 1 lOl . r ·Ul cxanlp I e, one call enlO, ,,,, VHuv5v ....UlL) , .. ) , l.e., lequire p \lu E Kp (A , r(O) ) : (u,Ax(p) f) = O. This leads to the Arnoldi method, which is also called the full orthogonalization method (FOM). This method is known to reduce to conjugate gradients (Section 5.6) if A = A * > O. Alternatively, one can minimize the Euclidean norm of the residual r(p) = Ax(p ) f: �
,
"'�
x(p)
=
.
..
.
I
,r
f
.
arg min IIAz f 11 2 . ZENp
This strategy defines the socalled method of generalized minimal residuals (GM RES) that we describe in Section 6.3.2. It is clear though that before FOM, GMRES, or any other Krylov subspace method can actually be implemented, we need to thoroughly describe the minimization search space Np of (6.70). This is equivalent to describing the Krylov subspace Kp (A, r(O) ) . In other words, we need to construct a basis in the space Kp (A, r(O) ) introduced by Definition 6. 1 for p = m and r(O) = u . The first result which is known is that in general the dimension of the space Km (A , u) is a nondecreasing function of m. This dimension, however, may actually be lower than m as it is not guaranteed ahead of time that all the vectors: u, Au, A2u, . ' " Am  ! u are linearly independent. For a given m one can obtain an orthonormal basis in the space Km(A, ll) with the help of the socalled Arnoldi process, which is based on the well known Gram Schmidt orthonormalization algorithm (all norms are Euclidean):
VI 112 = W ' V2 =AU2  (U I ,AU2)U I  ( 1l2 ,Au2 ) U2 ,
(6 . 71)
k Vk =AUk  L (llj,Allk )Uj, j= ! Obviously, all the resulting vectors 11 1 , U2 , . . . , Uk , . . . are orthonormal. If the Arnoldi process terminates at step m, then the vectors {Uj , U2 , ' " , um} will form an orthonor mal basis in Km(A, u). The process can also terminate prematurely, i.e., yield Vk = 0 at some k m. This will indicate that the dimension of the corresponding Krylov subspace is lower than m. Note also that the classical GramSchmidt orthogonal ization is prone to numerical instabilities. Therefore, in practice one often uses its stabilized version (see Remark 7.4 on page 2 19). The latter is not not completely fail proof either, yet it is more robust and somewhat more expensive computationally.
Uk , " " U I · Consequently, we can write: . . •
(6.72) where Hk is a matrix in the upper Hessenberg form. This matrix has k + I rows and k columns, and all its entries below the first subdiagonal are equal to zero:
Hk = { IJ i.j I i = 1 , 2, . . . , k + I , } = 1 , 2, . . , k, IJ i.j = 0 for i > } + I } . .
Having introduced the procedure for building a basis in a Krylov subspace, we can now analyze a particular iteration method  the GMRES. 6.3.2
GMRES
In GMRES, we are minimizing the Euclidean norm of the residual,
I I . II == I I . 11 2 :
min IIAz fll , I I r(p ) 1I = IIAx( p ) fll = ZEN"
(6.73)
Np = x(O) + Kp (A,r(O) ).
Let Up b e an n x p matrix with orthonormal columns such that these columns form a basis in the space Kp (A,r(O)); this matrix is obtained by means of the Arnoldi process described in Section 6.3. 1 . Then we can write:
where w(p) is a vector of dimension p that needs to be determined. Hence, for the residual r(p) we have according to formula (6.72):
r(p) = r(O) + AUp w(p) = r(O) + Up+ I Hp w( p) = Up + 1 (q( p+I ) + Hp w( p) ) , where q(P+ I ) is a p + I dimensional vector defined as follows:
(6.74)
q( p + I ) = [ lI r(O) II ,O,O, . . . ,of . 'v" p Indeed, as the first column of the matrix Up + I generated by the Arnoldi process (6.71) is given by r(O) / ll r(O) II , we clearly have: Up+ I q(p + l ) = r(O) in formula (6.74). Consequently, minimization in the sense of (6.73) reduces to: min Ilq( p +I ) + Hp w(p ) II . Il r( p) II = w(p)EII�P
(6.75)
EqUality (6.75) holds because U�+ I Up+I = Ip + l , which implies that for the Eu clidean norm I I · 11 == I · 11 2 we have: Il Up + I (q(p+ l ) +H w(p ) ) 1 1 = IIq( p+ I ) + H w( p) II .
p
p
A Theoretical Introduction to Numerical Analysis
202
Problem (6.75) is a typical problem of solving an overdetermined system of linear algebraic equations in the sense of the least squares. Indeed, the matrix Hp has p + 1 rows and p columns, i.e., there are p + I equations and only p unknowns. Solutions of such systems can, generally speaking, only be found in the weak sense, in particular, in the sense of the least squares. The concept of weak, or generalized, solutions, as well as the methods for their computation, are discussed in Chapter 7. In the meantime, let us mention that if the Arnoldi orthogonalization process (6.7 1 ) does not terminate on or before k = p, then the minimization problem (6.75) has a unique solution. As shown in Section 7.2, this is an implication of the matrix Hp being full rank. The latter assertion, in turn, is true because according to formula (6.72), equation number k of (6.7 1 ) can be recast as follows:
k k k+ 1 AUk = vk + L I) jk Uj = uk+ l llvk li + L I) jk Uj = L I) jk Uj , j= 1 j= 1 j=1 which means that I)k+ 1 ,k = I I Vk II =J O. As such, all columns of the matrix Hp are linearly independent since every column has an additional nonzero entry 1) k+I,k compared to the previous column. Consequently, the vector w(p ) can be obtained
as a solution to the linear system:
(6.76) The solution w(p) of system (6.76) is unique because the matrix H�Hp is non singular (Exercise 2). In practice, one does not normally reduce the least squares minimization problem (6.75) to linear system (6.76) since this reduction may lead to the introduction of large additional errors (amplification of roundoff). Instead, prob lem (6.75) is solved using the QR factorization of the matrix Hp , see Section 7.2.2. Note that in the course of the previous analysis we assumed that the dimension of the Krylov subspaces Kp (A,r(O)) would increase monotonically as a function of p. Let us now see what happens if the alternative situation takes place, i.e., if the Arnoldi process terminates prematurely. THEOREM 6. 5 Let p be the smallest integer number fOT which the Arnoldi process (6. 71) terminates:
Aup

p L (uj,Aup )uj = O. j= 1
Then the corresponding iterate yields the exact solution:
x(p) = x = A 1j. By hypothesis of the theorem, This implies [ef. formula (6.72)]:
PRO OF
Kp .
Aup E Kp . Consequently, AKp C (6.77)
Iterative Methodsfor Solving Linear Systems
203
where iI is a p x p matrix. This matrix is nonsingular because otherwise it would have had linearly dependent rows. Then, according to formula (6.77) , each column of AUp could be represented as a linear combination of only a subset of the columns from Up rather than as a linear combination of all of its columns. This, in turn, means that the Arnoldi process terminates earlier than k = p, which contradicts the hypothesis of the theorem. For the norm of the residual r(p) = Ax( p) f we can write:
(6.78) Next, we notice that since x( p) E Np , then x(p) _ x(O) E Kp (A,r(O) ), and conse quently, :3w E ]RP : x( p ) _ x{O) Up w, because the columns of the matrix Up provide a basis in the space Kp (A,r(O)) . Let us also introduce a pdimensional vector q( p) with real components: =
q ( p) = [ l l r(O) II ,O,O, . . . ,of , '...' p I so that Up q(p) = r(O ) . Then, taking into account equality (6.77) , as well as the orthonormality of the columns of the matrix Up : U�Up = Jp , we obtain from formula (6.78) :
(6.79) Finally, we recall that on every iteration of GMRES we minimize the norm of the residual: Il r( p ) I I + min. Then, we can simply set w = _B 1 q( p) in formula (6.79) , which immediately yields Il r(p ) II = 0. This is obviously a minimum of the norm, and it implies r(p) = 0, i.e., Ax{p) = f =? x(p) = A  If = x. 0 We can now summarize two possible scenarios of behavior of the GMRES it eration. If the Arnoldi process terminates prematurely at some p n (n is the dimension of the space), then, according to Theorem 6.5, x{p) is the exact solution to Ax =f. Otherwise, the maximum number of iterations that the GMRES can perform is equal to n. Indeed, if the Arnoldi process does not terminate prematurely, then Un will contain n linearly independent vectors of dimension n and consequently, Kn(A,r(O) ) = ]Rn . As such, the last minimization of the residual in the sense of (6.73) will be performed over the entire space ]Rn , which obviously yields the exact solution x = A  If. Therefore, technically speaking, the GMRES can be regarded as a direct method for solving Ax = f, in much the same way as we regarded the method of conjugate gradients as a direct method (see Section 5.6). In practice, however, the GMRES is never used in the capacity of a direct solver, it its only used as an iterative scheme. The reason is that for high dimensions n it is only feasible to perform very few iterations, and one should hope that the approximate so lution obtained after these iterations will be sufficiently accurate in a given context. The limitations for the number of iterations come primarily from the large storage
0 [recall, Vf E ]Rm , f i 0 : (Bf,J) > 0] defines a scalar product as follows: =
'J , g ) ( m) [f, g] B(m) � (BI" "
f g E ]Rm .
(7. 12)
It is also known that any scalar product on ]Rm (see the axioms given on page 1 26) can be represented by formula (7. 1 2) with an appropriate choice of the matrix B. In general, system (7.8) [or, alternatively, system (7. 10)] does not have a classical solution. In other words, no set of n numbers {Xl ,X2, . . . , xn } will simultaneously tum every equation of (7.8) into an identity. We therefore define a solution of system (7.8) in the following weak sense. DEFINITION 7. 1 Let B : ]Rm f7 ]Rm, B = B* > 0, be fixed. Introduce a scalar function = (x) of the vector argument x E ]Rn :
(x) = [Ax f, Ax f] �m) . A generalized, or weak, solution XB of system (7. 8) is the vector XB E delivers a minimum value to the quadratic function (7. 13}.
(7. 1 3)
]Rn
that
There is a fair degree of flexibility in choosing the matrix The meaning of this matrix is that of a "weight" matrix (see Section 7 . 1 . 1 ) . It can b e chosen based on some "external" considerations t o emphasize certain components of the residual of system (7. 10) and deemphasize the others. D REMARK 7.1
B.
REMARK 7.2 The functions = (XI , X2) introduced by formulae (7.3) and (7.6) as particular examples of the method of least squares, can also be obtained according to the general formula (7. 1 3 ) . In this sense, Definition 7 . 1 can b e regarded as the most general definition o f the least squares weak solu tion for a full rank system. D
THEOREM 7. 1 Let the rectangular matrix A of (7. 9) have full rank equal to n, i. e., let it have linearly independent columns. Then, there exists one and only one generalized solution XB of system (7. 8) in the sense of Definition 7. 1. This generalized solution coincides with the classical solution of the system:
A * BAx = A*Bf
(7. 1 4)
that contains n linear algebraic equations with respect to pTecisely as many scalar unknowns: Xl ,X2, . . . ,Xn'
216 PRO OF
A Theoretical Introduction to Numerical Analysis
[]
Denote the columns of the matrix A by ak ,
ak =
a )k :
amk
ak E ]R1n :
, k = 1 , 2, . . . , n.
The matrix C = A *BA of system (7. 14) is a symmetric square matrix of di mension n x n. Indeed, the entry Cij of this matrix that is located at the intersection of row number i and column number j is given by:
and we conclude that cij = cji , i.e., C = C* . Let us now show that the matrix C is nonsingular and moreover, positive definite: )
(7.15) To prove inequality (7. 1 5 ) , we first recall a wellknown formula:
(7. 16) This formula can be verified directly, by writing the matrixvector products and the inner products involved via the individual entries and components. Next, let � E 11 k. In other words, the matrix R (7. 1 9} is indeed upper triangular. 0 COROLLARY 7. 1 The matrix R in formula (7. 19) coincides with the upper triangular Cholesky factor of the symmetric positive definite matrix A * A, see formula (5. 78).
P ROOF
Using orthonormality of the columns of Q, A * A = (QR ) * ( QR )
=
Q* Q = I, we have:
R*Q*QR = R*R ,
and the desired result follows because the Cholesky factorization is unique 0 (Section 5.4.5). The next theorem will show how the QR factorization can be employed for com puting the generalized solutions of full rank overdetermined systems in the sense of Definition 7. 1 . For simplicity, we are considering the case B = I.
2 All norms in this section are Euclidean: I . II
==
II
.
[[ 2 = �.
Overdetermined Linear Systems. The Method of Least Squares
219
THEOREM 7.2 Let A be an m x n matrix 'With real entries, m ::::: n, and let rankA = n. Then, there is a unique vector x E JRm that minimizes the quadratic function:
(x) = (Ax f, Ax f) (m) .
(7.21)
X = R  I Q*f,
(7.22)
This vector i s given by: where the matrices
Q and R are introduced in Lemma 7. 1 .
PROOF Consider an arbitrary complement Q o f the matrix Q t o an m x m orthogonal square matrix: Q * Q = QQ * = 1m . Consider also a rectangular m x n matrix R with its first n rows equal to those of R and the remaining m  n rows filled with zeros. Clearly, QR = QR . Then, using definition (7.21) of the function (x) we can write:
(x) = (Ax f, Ax f) (m) = (QRx f, QRx f) (m) = ( QRx  QQ *f, QRx  QQ *f) (/n) = (Rx  Q*f, Q* QRx  Q* QQ*f) (In) = (Rx  Q *f,Rx  Q*f) (m )
=
In
(Rx  Q*f,Rx  Q*f) (n) + L ( [Q*fl k ) 2 .
k=n+1
Clearly, the minimum of the previous expression is attained when the first term in the sum is equal to zero, i.e., when Rx = Q *f. This implies (7.22) . 0 REMARK 7.4 In practice, the classical GramSchmidt orthogonalization (7.20) cannot be employed for computing the QR factorization of A because it is prone to numerical instabilities due to amplification of the roundoff errors. Instead, one can use its stabilized version. In this version, one does not compute the orthogonal projection of ak+ 1 onto span{ q l , . . . , qk} first and then subtract according to formula (7.20b) , but rather does projections and subtractions successively for individual vectors { ql , . . . , qd :
(k I) ( (k) = Vk+1 vk+1  (qk , vk+k1 I) ) qk· In the end, the normalization is performed as before, according to formula (7.20c) : qk+ I = vi1 1 / / I vi1 1 11 · This modified GramSchmidt algorithm is more
robust than the original one. It also entails a higher computational cost.
0
A Theoretical Introduction to Numerical Analysis
220
REMARK 7.5 Note that there is an alternative way of computing the QR factorization (7. 1 9 ) . It employs special types of transformations for matrices , known as the Householder transformations (orthogonal reflections, or sym metries, with respect to a given hyperplane in the vector space) and Givens transformations (twodimensional rotations) , sec, e.g. , [GVL89, Chapter 5]. 0
7.2.3
Geometric Squares
Interpretation
of the
Method
of Least
Linear system (7.8) can be written as follows: (7.23) where the vectors
ak [���l =
E ]RI1I, k
amk
=
[�fmI.l XI , X2,'" , Xn XI a, +X2a2 + k=�>' kak I B X ,aXI2,, ...... ,,alxll}}' Lk= 1 Xkak
(7.9), and the righthand side is given: I =
... + xllal
1 , 2, . . . , 11, are columns of the matrix A of E ]R11I. According to Definition 7. 1 ,
we need to find such coefficients of the linear combination that the deviation of this linear combination from the vectorI be minimal:
ak,
IV

Il
f+
min .
The columns k 1 , 2, . . . , 11, of the matrix A are assumed linearly independent. Therefore, their linear span, span { forms a subspace of dimension n in the space ]Rill : ]Rn (A) C ]R1n . If { I is a weak solution of system (7.8) in is the orthogonal the sense of Definition 7. 1 , then the linear combination projection of the vector I E ]R1n onto the subspace ]R1l (A), where the orthogonality is understood in the sense of the inner product [ . , ' l�/) given by formula (7. 1 2). Indeed, according to formula (7. 1 3) we have (x) = Ilf  AxI1 1 . Therefore, the element = Ax of the space ]R" (A) that has minimum deviation from the given I E ]RIII minimizes (x), i.e., yields the generalized solution of system (7. 1 0) in the sense of Definition 7. 1 . In other words, the minimum deviation is attained for AXB E ]R1l (A) , where xB is the solution to system (7. 14). On the other hand, any vector from ]R1l (A) can be represented as A� = 0, . . . + Oil where � E ]R1l . Formula (7. 1 8) then implies that 1  AXB is orthogonal to any A� E ]R1l (A) in the sense of [ . , · 1�1I) . As such, AXB is the orthogonal projection ofl onto ]R" (A). If a different basis {a " a2 , . " , an } is introduced in the space ]R1l (A) instead of the original basis { l } then system (7. 1 4) is replaced by =
LXkak
a, + (ha2 + a'l>
a I , a2, ... al ,
'
k
(7.24)
where A is the matrix composed of the columns a k = 1 , 2, . . . , no Instead of the solution XB of system (7. 1 4) we will obtain a new solution XB of system (7.24). ,
Overdetermined Linear Systems. The Method of Least Squares
221
However, the projection ofl onto ]RII (A ) will obviously remain the same, so that the following equality will hold: 11
L Xkak k=l
=
n
L Xk ak· k=l
In general, if we need to find a projection of a given vector I E ]R1n onto a given subspace ]R1l C ]Rm , then it is beneficial to try and choose a basis {a I , a2, . . . , all } in this subspace that would be as close to orthonormal as possible (in the sense of a particular scalar product [ . , · lr) ). On one hand, the projection itself does not depend on the choice of the basis. On the other hand, the closer this basis is to orthonormal, the better is the conditioning of the system matrix in (7.24). Indeed, the entries cij of the matrix t = A *BA are given by: cij = [a;, ajl�n) , so that for an orthonormal basis we obtain t = J, which is an "ultimately" well conditioned matrix. (See Exercise 8 after the section for a specific example). 7.2.4
Overdetermined Systems in the O perator Form
Besides its canonical form (7.8), an overdetermined system can be specified in the operator form. Namely, let jRll and jRm be two real vector spaces, n m, let A : jRll f+ ]Rm be a linear operator, and let I E ]Rm be a fixed vector. One needs to find such a vector x E ]R" that satisfies:
oo
=
x.
Furthermore, the method is said to converge with the order /( ;:::: 1 if there is a constant
c > 0 such that
(8. 1 )
Note that i n the case of a first order convergence, j( 1 , it is necessary that c 1 . Then the rate o f convergence will be that o f a geometric sequence with the base c. Let us also emphasize that unlike in the linear case, the convergence of iterations for nonlinear equations typically depends on the initial guess x (O) . In other words, one and the same iteration scheme may converge for some particular initial guesses and diverge for other initial guesses. Similarly to the linear case, the notion of conditioning can also be introduced for the problem of nonlinear rootfin ding. As always, conditioning provides a quantita tive measure of how sensitive the solution (root) is to the perturbations of the input data. Let E lR be the desired root of the equation (x) O. Assume that this root has multiplicity m + 1 ;:::: 1 , which means that
=
F
x
F
0, such that if the initial guess xeD) is chosen according to Ix  x(D) j R, then all terms of the sequence (8. 5) are well defined, and the following error estimate holds:
O.

Let i/J = i/J(x) and lJI = lJI(x) be two functions with bounded second derivatives. For solving the equation F(x) = i/J(x)  lJI(x) = 0 one can use Newton's method. a) Let the graphs of the functions i/J = i/J (x) and lJI = lJI(x) be plotted on the Carte sian plane (x, y), and let their intersection point [the root of F(x) = OJ have the abscissa x = x. Assume that the Newton iterate x(p) is already computed and provide a geometric interpretation of how one obtains the next iterate x(p+I). Use the following form of Newton's linearization:
b) Assume, for definiteness, that i/J (x) > lJI(x) for x > x, and i/J'(x)  lJI'(x) > O. Let also the graph of i/J (x) be convex, i.e., i/J" (x) > 0, and the graph of lJI(x) be concave, i.e., 11'" (x) < O. Show that Newton's method will converge for any
iO) > x.
4. Use Newton's method to compute real solutions to the following systems of nonlinear equations. Obtain the results with five significant digits: a) sinx y b)
= 1.30; cos y  x = 0.84.
x2 + 4i = l ; x4 + i = 0.5.
5. A nonlinear boundary value problem:
2u 3 2 d= x , X E [O, IJ, dx2  U u (O) = I, u (I) = 3 ,
with the unknown function U = u (x) is approximated on the uniform grid Xj = j . h , j 0, 1 , . . , N, h = �, using the following secondorder centraldifference scheme:
.
=
. . U j+ I  2uj + Uj I 3  Uj = (jh ) 2 , j = 1 , 2, . . . , N  l , h2 uO 1 , LIN = 3, =
where U j, j = 0, 1 , 2, . , N, is the unknown discrete solution on the grid. Solve the foregoing discrete system of nonlinear equations with respect to
.
..
)
U j,
j
=
0 , 1 , 2, . . , N, using Newton's method. Introduce the following initial guess: U O) = 1 + 2(jh), j = 0, 1 , 2, . . . , N. This grid function obviously satisfies the boundary conditions:
248
A Theoretical Introduction to Numerical Analysis IIbO) = I , II�) = 3. For every iterate { U)P) }, P :::: 0, the next iterate {II1'+ I ) } is defined as U)P+I ) lI)p) + OU)p) , j = 0, 1 , 2, . . . ,N, where the increment { OU)p) } should satisfy a new system of equations obtained by =
Newton's linearization.
a) Show that this linear system of equations with respect to o ujp) j can be written as follows: ,
(p) 1  20 ltj(p) + 0 uj(p_)1 _ ( ) o IIj+ ( 2 ( ��h2�� 3 u p) ou p) J
.
= (JIl)
2
J
=
0, 1 , 2, . . . , N,
=
3 IIj(p+)1  2uj(p) + uj(p_)1 + (uj(p) ) h2
for j = 1 , 2, . . . , N  I ,
Prove that it can be solved by the tridiagonal elimination of Section 5.4.2.
b) Implement Newton's iteration on the computer. For every p :::: 0, use the tri diagonal elimination algorithm (Section 5.4.2) to solve the linear system with re spect to Oll)P) j = I , 2, . . . ,N  1. Consider a sequence of subsequently more fine ,
+
gn·ds.· N  32 , 64 , . . . . 0n every gn·d, Iterate tl·11 max I
0, when h
>
0,
(9. 1 2)
where [uJ h = {u(Xn) I n 0) 1 , . . . , N}. If, in addition, k > ° happens to be the largest integer such that the following inequality holds for all sufficiently small h: h k (9. 1 3) II [UJ h  u ( ) Iluh S:. ch , C = const, =
Numerical Solution of Ordinary Differential Equations
259
where c does not depend on h, then we say that the convergence rate of scheme (9. 1 1) is tJ(hk ) , or alternatively, that the error of the approximate solution, which is the quantity (rather, function) under the norm on the lefthand side of (9. 1 3), has order k with respect to the grid size h. Sometimes we would also say that the order of convergence is equal to k > 0. Convergence is a fundamental requirement that one imposes on the difference scheme (9. 1 1 ) so that to make it an appropriate tool for the numerical solution of the original differential (also referred to as continuous) problem (9. 1 ). If convergence does take place, then the solution [UJ h can be computed using scheme (9. 1 1) with any initially prescribed accuracy by simply choosing a suffic iently small grid size h. Having rigorously defined the concept of convergence, we have now arrived at the central question of how to construct a convergent difference scheme (9. 1 1) for computing the solution of problem (9. 1 ). The examples of Section 9. 1 . 1 suggest a simple initial idea for building the schemes: One should first generate a grid and sub sequently replace the derivatives in the governing equation(s) by appropriate differ ence quotients. However, for the same continuous problem (9. 1 ) one can obviously obtain a large variety of schemes (9. 1 1 ) by choosing different grids Dh and different ways to approximate the derivatives by difference quotients. In doing so, it turns out that some of the schemes do converge, while the others do not. 9.1.3
Verification of Convergence for a Difference Scheme
Let us therefore reformulate our central question in a somewhat different way. Suppose that the finitedifference scheme Lh U (h ) = fh) has already been constructed, and we expect that it could be convergent:
How can we actually check whether it really converges or not? Assume that problem (9. 1 1) has a unique solution u (h) , and let us substitute the grid function [UJh instead of u (h) into the lefthand side of (9. 1 1 ). If equality (9. 1 1 ) were to hold exactly upon this substitution, then uniqueness would have implied u(h) [uJ h ' which is ideal for convergence. Indeed, it would have meant that solution u(h) to the discrete problem Lh U(h) = fh) coincides with the grid function [UJ h that is sought for and that we have agreed to interpret as the unknown exact solution. However, most often one cannot construct system (9. 1 1) so that the solution [U] h would satisfy it exactly. Instead, the substitution of [U] h into the lefthand side of (9. 1 1 ) would typically generate a residual [)f h ) : =
(9. 1 4)
also known as the truncation error. If the truncation error [)fh) tends to zero as h > 0, then we say that the finitedifference scheme Lh U (h) = fh) is consistent, or alternatively, that it approximates the differential problem Lu = f on the solution u(x) of the latter. This notion indeed makes sense because smaller residuals for
260
A Theoretical Introduction to Numerical Analysis
smaller grid sizes mean that the exact solution [ul h satisfies equation (9. 1 1 ) with better and better accuracy as h vanishes. When the scheme is consistent, one can think that equation (9. 1 4) for [ul h is ob tained from equation (9. 1 1 ) for u (h) by adding a small perturbation ofh) to the right hand side fh) (0 f(h) is small provided that h is small). Consequently, if the solution u (h ) of problem (9. 1 1 ) happens to be only weakly sensitive to the perturbations of the righthand side, or in other words, if small changes of the righthand side fh) may only induce small changes in the solution u(h) , then the difference between the solution U(!l) of problem (9. 1 1 ) and the solution [ul h of problem (9. 1 4) will be small. In other words, consistency ofh) 4 0 will imply convergence:
The aforementioned weak sensitivity of the finitedifference solution u ( h) to pertur bations of f(h) can, in fact, be defined using rigorous terms. This definition will lead us to the fundamental notion of stability for finitedifference schemes. Altogether, we can now outline our approach to verifying the convergence (9. 1 2). We basically suggest to split this difficult task into two potentially simpler tasks. First, we would need to see whether the scheme (9. 1 1 ) is consistent. Then, we will need to find out whether the scheme (9. 1 1 ) is stable. This approach also indicates how one might actually construct convergent schemes for the numerical solution of problem (9. 1 ). Namely, one would first need to obtain a consistent scheme, and then among many such schemes select those that would also be stable. The foregoing general approach to the analysis of convergence obviously requires that both consistency and stability be defined rigorously, so that a theorem can even tually be proven on convergence as an implication of consistency and stability. The previous definitions of consistency and stability are, however, vague. As far as con sistency, we need to be more specific on what the truncation error (residual) 0fil) is in the general case, and how to properly define its magnitude. As far as stability, we need to assign a precise meaning to the words "small changes of the righthand side fh ) may only induce small changes in the solution u(h ) " of problem (9. 1 1 ). We will devote two separate sections to the rigorous definitions of consistency and stability.
9.2
Approximation of Cont inuous Problem by a D iffer ence Scheme. Consistency
In this section, we will rigorously define the concept of consistency, and explain what it actually means when we say that the finitedifference scheme (9. 1 1 ) approx imates the original differential problem (9. 1 ) on its solution u = u(x).
9.2.1
Numerical Solution of Ordinary Differential Equations 'Iruncat ion Error 8f" )
261
To do so, we will first need to elaborate on what the truncation error 8f( h ) is, and show how to introduce its magnitude in a meaningful way. According to formula (9. 14), 8f" ) is the residual that arises when the exact solution [U]II is substituted into the lefthand side of (9. l 1 ). The decay of the magnitude of 8f" ) as 7 0 is precisely what we term as consistency of the finitedifference scheme (9. 1 1 ). We begin with analyzing an example of a difference scheme for solving the following second order initial value problem:
h
du d2 u +a(x)l dx2 dx +b(x)u = cosx, O � x � , u(O) = I , u'(O) = 2. The grid Diz on the interval [0, 1 ] will be uniform: X/1 = nh , n = 0, 1 , . . . ,N, h
(9. l 5)
= 1 / N. To obtain the scheme, we approximately replace all the derivatives in relations (9. 1 5) by difference quotients:
d2 u u(x+ h)  2u(x) + u(x  h) dx2 � h2 du u(x+h) u(xh)
2h dx � u(O) dU u(h) h . dx I t=O �
(9. 1 6a) (9. 1 6b) (9. 1 6c)
Substituting expressions (9. 1 6a)(9. l 6c) into (9. l 5), we arrive at the system of equa tions that can be used for the approximate computation of [u] ll :
U/1+1  2u2n + Un1 +a (Xn ) un+1  Un I + b(Xn) U/1 = cosxll , h 2h n = I , 2, . . . ,N  l , U I  Uo Uo = 1 ,  = 2. h
(9. 1 7)
Scheme (9. 1 7) can be easily converted to form (9. 1 1 ) if we denote: UIl + I
 2u2n + Un I + a (Xn ) Un+ 1  Un 1 + b(Xn ) UIl , h 2h n = I , 2, . . . ,N  I ,
UO , til  Uo
fill =
{
h
cosxn , 1,
(9. 18)
n = I ,2, . . . ,N  l ,
2.
To estimate the magnitude of the truncation error 8flll that arises when the grid func tion [u] " is substituted into the lefthand side of (9. 1 1 ), we will need to "sharpen" the
262
A Theoretical Introduction to Numerical Analysis
approximate equalities (9. 1 6a)(9. 1 6c). This can be done by evaluating the quantities on the righthand sides of (9. 1 6a)(9. 1 6c) with the help of the Taylor formula. For the central difference approximation of the second derivative (9. 1 6a), we will need four terms of the Taylor expansion of u at the point x:
�
�
:
u(x + h) = u(x) + hu' (x) + h2 . u" (x) + h3. ulll (x) + h4. u(4) ( ; J ) , (9. l 9a) 2 3 4 u(x  h) = u(x)  hu' (x) + h2! U" (x) h u'" (x) + h4! u(4) (;2 ) . 3! For the central difference approximation of the first derivative (9. 1 6b), it will be _
sufficient to use three terms of the expansion:
2 h3 u (.,,.I' ) u (x + h ) = u (x) + hu' (x) + h2! u" (x) + 3T 3, III
u(x  h) = u(x)  hu' (x) +
(9. 1 9b)
�>" (x)  �� ulll (;4 ),
and for theforward difference approximation of the first derivative (9. 1 6c) one would only need two terms of the Taylor formula (to be used at x = 0):
h2 u(x + h) = u(x) + hu' (x) + 2! u" (;5 ).
(9. 1 9c)
Note that the last term in each of the expressions (9. 19a)(9. 19c) is the remainder (or error) of the Taylor formula in the socalled Lagrange form, where ; 1 , ;3 , ;5 E [x,x + h] and ;2 , ;4 E [x  h,x]. Substituting expressions (9. 1 9a), (9. 1 9b), and (9. 1 9c) into formulae (9. l 6a), (9. 1 6c), and (9. 1 6c), respectively, we obtain:
)
(
2 u(x + h)  2u(x) + u(x  h) (4) ( .1' ) _ (X) h u(4 ) ( .1'." ) .,,2 , + 1 +U U h2 24 u(x + h)  u(x  h) = u' (x) h2 (u111 ( .,,.1' ) 111 ( .,,.1' )) + 12 3 +u 4 , 2h h ,, ( .,,.I' ) . u(x + h)  u(x) = u x + u 5 x=O x=O 2 h "
1
9.2.2
'( )
I
Evaluation of the Truncation Error
(9.20a) (9.20b) (9.20c)
8f(h)
Let us now assume that the solution u = u(x) of problem (9. 15) has bounded derivatives up to the fourth order everywhere on the interval 0 � x � 1 . Then, ac cording to formulae (9.20a)(9.20c) one can write:
Numerical Solution of Ordinary Differential Equations
263
where �I ' �2 ' �3, �4 E [x  h,x + h]. Consequently, the expression for Lh [uj h given by formula (9. 1 8):
u(xn + h)  2u(xn) + u(xn  h) + a (Xn ) u(xn + h)  u(xn  h) h2 2h +b(xn)u(xn), n = 1 , 2, . . . ,N  1 , u(O), u(h)  u(O) h can be rewritten as follows:
n = I , 2, . . . , N  l ,
1 + 0,
" ) 2 + h U (� 25 ' where � 1 ' �2 , �3 , � E [xn h,xl1 + h], n = 1 ,2, . . . , N  1 , and �5 E [O,hj. Alternatively,
we can write:
jlh) + 8jlh) , L [u] h h where jlh) is also defined in (9. 1 8), and
[
=
1
4 + u (4) ( �2 ) + a(Xn ) u"' (�3) + U'" ( �4 ) h2 u( ) ( � J ) 24 ' 12 n = 1 , 2, . . . , N  1 , 0,
(9.21)
" h U (2�5 ) .
To quantify the truncation error 8jlh), i.e., t o introduce its magnitude, it will be convenient to assume that jlh) and 8jlh) belong to a linear normed space Fil that consists of the following elements:
n = 1 ,2, . . . ,N  l ,
(9. 22)
where CPr , Cf>2 , , CPN  I , and ljIo, ljIl is an arbitrary ordered system of numbers. One can think of g(h) given by formula (9.22) as of a combination of the grid function fPn, n = 1 , 2, . . . , N  1, and an ordered pair of numbers ( ljIo , ljIJ ) . Addition of the elements from the space F , as well as their multiplication by scalars, are performed h componentwise. Hence, it is clear that F is a linear (vector) space of dimension h . • •
264
A Theoretical Introduction to Numerical Analysis
N + 1 . It can be supplemented with a variety of different norms. If the norm in Fh is introduced as the maximum absolute value of all the components of g ( h ) : then according to (9.2 1 ) we obviously have for all sufficiently small grid sizes h: (9.23) where c is a constant that may, generally speaking, depend on u(x), but should not depend on h. Inequality (9.23) guarantees that the truncation error 81(11) does vanish when h + 0, and that the scheme (9. 17) has first order accuracy. If the difference equation (9. 17) that we have considered as an example is rep resented in the form LIz U ( h) = jlh ) , see formula (9. 18), then one can interpret Lh as an operator, L" : Viz t+ Fh . Given a grid function v(ll ) = { vn } , n = 0, 1 , . . . , n, that belongs to the linear space Vh of functions defined on the grid Dh , the operator Lh maps it onto some element g (h) E Fil of type (9.22) according to the following formula:
Vn+ 1  2vn + vll ! a (xn ) Vn+!  Vn ! b (Xn ) n, V + h2 2h + n = 1 , 2, . . . , N  1 , Vo, V I  Vo h
The operator interpretation holds for the general finitedifference problem (9. 1 1 ) as well. Assume that the righthand sides of all individual equations contained in Lh u (h) = 1(11) are components of the vector jlh) that belongs to some linear normed space Fh . Then, L h of (9. 1 1 ) becomes an operator that maps any given grid function u (ll) E Vh onto some element 1(11) E Fh , i.e., Lh : Vh t+ Fh • Accordingly, expression Lh [ul h stands for the result of application of the operator Lh to the function [ul h that is an element of Vh ; as such, Lh [ul h E Fh . Consequently, 81(11) � Lh [ul h  1(11) E Fh as a difference between two elements of the space Fh . The magnitude of the residual 81(11) , i.e., the magnitude of the truncation error, is given by the norm 11 81(11) II "i,' 9.2.3
Accuracy o f Order
hk
DEFINITION 9. 1 We will say that the finitedifference scheme Lh U (h) = jllz) approximates the differential problem Lu = 1 on its solution u = u(x) if the truncation error vanishes when the grid is refined, i.e., 11 8jlll) II"il 7 0 as h + O. A scheme that approximates the original differential problem is called consistent. If, in addition, k > 0 happens to be the large8t integer that guarantees the following estimate: CI =
const,
Numerical Solution oj Ordinary Differential Equations
265
where c] does not depend on the grid size h, then we say that the scheme has order of accuracy tJ(hk ) or that its accuracy is of order k with respect to h . 2
Note that in Definition 9. 1 the function u = u(x) is considered a solution of prob lem Lli = j. This assumption can provide useful information about 1I , e.g., bounds for its derivatives, that can subsequently be exploited when constructing the scheme, as well as when verifying its consistency. This is the reason for incorporating a solution of problem (9. 1 ) into Definition 9. 1 . Let us emphasize, however, that the notion of approximation of the problem Lu = j by the scheme Lh U ( h) j( h ) does not rely on the equality Lu = j for the function u. The central requirement of Def inition 9. 1 is only the decay of the quantity IILh [uJ h  j( h ) II Fh when the grid size h vanishes.Therefore, if there is a function u = u(x) that meets this requirement, then we could simply say that the truncation error of the scheme Lh u ( h ) f h ) has some order k > 0 with respect to h on a given function u, without going into detail regarding the origin of this function. In this context, it may often be helpful to use a slightly different concept of approximation  that of the differential operator L by the difference operator Lh (see Section 1 0.2.2 of Chapter 10 for more detail). Namely, we will say that the finitedifference operator Lh approximates the differ ential operator L on some function u = u(x) if II Lh [uJ h  [LuMFh t 0 as h t O. In the previous expression, [UJ h denotes the trace of the continuous function £I (x) on the grid as before, and likewise, [Lul il denotes the trace of the continuous function Lu on the grid. For example, equality (9.20a) can be interpreted as approximation of the differential operator Lu == £I" by the central difference on the lefthand side of (9.20a), with the accuracy (J(h2 ). In so doing, u(x) may be any function with the bounded fourth derivative. Consequently, the approximation can be considered on the class of all functions u with bounded derivatives up to the order four, rather than on a single function u. Similarly, equality (9.20b) can be interpreted as approxima tion of the differential operator Lu == £I' by the central difference on the lefthand side of (9.20b), with the accuracy tJ(h2 ), and this approximation holds on the class of all functions u = u(x) that have bounded third derivatives. =
=
9.2.4
Examples
Example 1
According to formula (9.2 1 ) and estimate (9.23), the finitedifference scheme (9. 17) is consistent and has accuracy tJ(h), Le., it approximates problem (9. 1 5 ) with the first order with respect to h. In fact, scheme (9. 17) can be easily improved so that it would gain second order accuracy. To achieve that, let us first notice that every component of the vector [) fh ) except for the last one, see (9.2 1 ), decays with the rate tJ(h2 ) when h decreases, and the second to last component is even equal to zero exactly. It is only the last component of the vector [)f il) that displays a slower rate of decay, namely, tJ(h). This last component is the residual generated by substituting 2Sometimes also referred to as the order of approximation of the continuous problem by the scheme.
A Theoretical Introduction to Numerical Analysis
266
into the last equation ( U I  uo)/h = 2 of system (9.17). Fortunately, the slower rate of decay for this component, which hampers the overall accuracy of the scheme, can be easily sped up. Using the Taylor formula with the remainder in the Lagrange form, we can write:
u = u(x)
+
u(h)  u(O) = u' (0) + � u"(0) h2 uIII ( J: ) 6 ':> ' 2 h
where
O :S � :S h.
At the same time, the original differential equation, along with its initial conditions, see (9. 15), yield:
u"(O) a(O)u' (O)  b(O)u(O) + cosO = 2a(0)  b(O) + 1. =
Therefore, i f we replace the last equality of (9.17) with the following:
U
1 � Uo 2 + � u" (0) =
=
�
2  [2a (0) + b (0)  1],
(9.24)
then we obtain a new expression for the righthand side jlh) , instead of the one given in formula (9. 1 8):
{COSXn,
n = l , 2, . . . ,N  1 , h l ) = 1, 2  [2a(0) +b(O)  1 J .
�
This, i n turn, yields a new expression for the truncation error:
0, h2 ulll (�) . 6
n = 1 , 2, . . . ,N 1 ,
(9.25)
Formula (9.25) implies that under the previous assumption of boundedness of all the derivatives up to the order four, we have 1 1 8jlh ) I I :S ch2 , where the constant c does not depend on h. Thus, the accuracy of the scheme becomes eJ(h2 ) instead of eJ(h). Let us emphasize that in order to obtain the new difference initial condition (9.24) not only have we used the original continuous initial conditions of (9. 15), but also the differential equation itself. It is possible to say that we have exploited the additional initial condition:
u" (x) + a(x)u' (x) + b(x)u(x) Ix=o = cos x !x=o ' which is a direct implication of the given differential equation of (9.15) at x = O.
Numerical Solution oj Ordinary Differential Equations
267
Example 2
Let
U Un { l , + I +Au , n 2h h LhU( ) = UO,UI , �' {l+X fh) = b,b,
n = 1 , 2, . . . ,N  1 , (9.26)
n = 1 , 2, . . . ,N  1 ,
and let us detennine the order of accuracy of the scheme: (9.27) on the solution
U u(x) =
of the initial value problem: 2
du
dx[U] +Au=l+x , u(O)=b. h h) I l + h)U( {U( X nX ( ) +Au Xn , 2h Lh[uh u(u(Oh)),, ) +AU(Xn)] + �; (U"I (�3) + ul/l(� ) , { [d�;n Lh[U]h = u(u(OO)+hU ), '(�O), �3 [xn,xn + h] �4 [xn h,xn]. u �o [O, h] du(dxxn) xn)=l+xn' +Au( f h)
For the exact solution
(9.28)
we obtain:
n = 1 , 2, . . . , N  1 ,
=
which, according to formula (9.20b), yields:
n = 1 , 2, . . . ,N  1 ,
E and for each where solution to (9.28), we have:
n:
E
and
E
2
and for the truncation error 5
we obtain:
n = 1 , 2, . . . ,N  1 ,
As
is a
A Theoretical Introduction to Numerical Analysis h. I, u, + 2hul [ + AUn I + 2 n=I, 2, . . , N l, [ U ] � (u'" ( �3 ) + ul /( �4 ) h tJ(h2). u, Uo =[£I]" hu' (�o) tJ(h).
268
Consequently, scheme (9.26)(9.27) altogether has first order accuracy with respect to It is interesting to note, however, that similarly to Example different compo nents of the truncation error have different orders with respect to the grid size. The system of difference equations: [

X,p

is satisfied b y the exact solution with the residual of order The first initial condition b is satisfied exactly, and only the second initial condition = b is satisfied by with the residual of order Example
3
Finally, let us illustrate the comments we have made right after the definition of consistency, see Definition 9. 1 , in Section 9.2.3. For simplicity, we will be consid ering problems with no initial or boundary conditions, defined on the grid: > 0, O , ± , ±2 , . . .. Let
x" = nh,
h n= I
Lh Lu = u" +u L, [u]h [Lulh = �: (u(4) (�d+u(4)(�2)) ' I u(Lhx[)ul="  [Lu], 1 ch2. L"u(1 ) = ] _u(x +h)2u(x)+u(x h) +ux( ) L"[U h = u"(x) �: (u(4) (�, ) + u(4) (�2)) + u(x) �: (u(4) ( � , ) +u(4) ( �2 ) ) +sinx x = h2 (u(4) (�, ) +u(4)(�2)) ' I i Lh [ u l h I h 2 / 1 2 . + u(x) Formula (9.20a) immediately implies that
approximates the differential operator:
with second order accuracy on the class of functions with bounded fourth derivatives. Indeed, according to (9.20a) for any such function we can write:
and consequently, ::::; Let us now take sinx and show that the homogeneous difference scheme 0 is consistent on this function. Indeed, �
+
=
 sin +
24
which means that Yet we note that in the previous ar  0 ::::; const · gument we have used the consideration that ( sinx ) " sinx = 0, i.e., that sinx =
Numerical Solution ofOrdinary Difef rential Equations Lu u" u = O. u
269
is not just a function with bounded fourth derivatives, but a solution to the homoge neous differential equation In other words, while the definition of + == consistency per se does not explicitly require that (x) be a solution to the underly ing differential problem, this fact may stilI be needed when actually evaluating the magnitude of the truncation error. However, for the specific example that we are currently analyzing, consistency can also be established independently. We have:
Liz u h = [ 1
sin(x + h)  2 sin(x) + sin(x  h)
h2
. + sm (x)
sinxcosh + cosxsin h  2 sin(x) + sinxcosh  cosxsin h _ 
=
. 2(cos h  1 ) . smx + smx = 2
h
�
( 1  42 sin2 2h ) smx. .
. + � (x)
h
Using the Taylor formula, one can easily show that for small grid sizes h the ex pression in the brackets above is 6(h2 ), and as sinx = (x) is bounded, we, again, conclude that the scheme has second order accuracy.
u
Lhu(il) = 0
9.2.5
Replacement of Derivatives by D ifference Quotients
In all our previous examples, in order to obtain a finitedifference scheme we would replace the derivatives in the original continuous problem (differential equa tion and initial conditions) by the appropriate difference quotients. This approach is quite universal, and for any differential problem with a sufficiently smooth solution (x) it enables construction of a scheme with any prescribed order of accuracy.
u
9.2.6
Ot her Approaches to Constructing D ifference Schemes
However, the replacement of derivatives by difference quotients is by no means the only method of building the schemes, and often not the best one either. In Sec tion 9.4, we describe a family of popular methods of approximation that lead to the widely used RungeKutta schemes. Here, we only provide some examples. The simplest scheme: U + 1  Un
n
=0, n=O,I, . . , N I , h=I N, uo=a, / theforward Euler scheme dldxl = 0, 0::::; x::::; I, u(O) a. = ll h
 G(Xn,lln)
(9.29)
known as is consistent and has first order accuracy with respect to h on the solutions of the differential problem:
 G(x, u)
=
.0
(9 3 )
Solution of the finitedifference equation (9.29) can be found by marching, Le., it can be computed consecutively one grid node after another using the formula: Un+ 1
U
+ hG(xll, un) .
(9.3 1 )
270
A Theoretical Introduction to Numerical Analysis explicit. a predictor
This fonnula yields Un+! in the closed fonn once the values of Un at the previous nodes are available. That's why the forward Euler scheme is called Let us now take the same fonnula (9.3 1 ) as in the forward Euler method but use it only as at the preliminary stage of computation. At this stage we obtain the intermediate quantity it: it
=
Un + hG(xn, un) .
Then, the actual value of the finitedifference solution at the next grid node Un+ ! is obtained by applying at the final stage:
The overall resulting
the cor ector predictorcorrector scheme: h =a,
it Un + hG (xn, un) , Un+l  Un = 2"1 [G(x , Un) + G(x , u) , n+! ] n Uo =
_
(9.32)
h
can be shown to have second order accuracy with respect to on the solutions of problem (9.30). It is also explicit, and is, in fact, a member of the RungeKutta family of methods, see Section 9.4. In the literature, scheme (9.32) is sometimes also referred to as the Heun scheme. Let us additionally note that the corrector stage of scheme (9.32) can actually be considered as an independent single stage finitedifference method of its own:
Un+!  Un = 2"1 [G(xn, un) + G(xn+ , u )] , ! n+! Uo =
h
(9.33)
a. CrankNicolson scheme.
Scheme (9.33) is known as the It approximates problem (9.30) with second order of accuracy. Solution of the finitedifference equation (9.33) can also be found by marching. However, unlike in the forward Euler method (9.29), equation (9.33) contains Un+1 both on its lefthand side and on its righthand side, as an argument of G(Xn+1 , Un+ I ). Therefore, to obtain Un+ 1 for each = 0, 1 , 2 , . . . , one basically needs to solve an algebraic equation. This may require special techniques, such as Newton's method of Section 8.3 of Chapter 8, when the function G happens to be nonlinear with respect to its argument u. Altogether, we see that for the Crank Nicolson scheme (9.33) the value of un+! is not immediately available in the closed form as a function of the previous Un. That's why the CrankNicolson method is called Finally, perhaps the simplest implicit scheme for problem (9.30) is the socalled scheme:
n
implicit. backward Euler
Un+!  Un G (Xn+! , U +! ) = 0, Uo = Il
h

a,
(9.34)
Numerical Solution ofOrdinary Differential Equations
271
that has first order accuracy with respect to h. Nowadays, a number of well established universal and efficient methods for solv ing the Cauchy problem for ordinary differential equations and systems are avail able as standard implementations in numerical software libraries. There is also a large number of specialized methods developed for solving particular (often, narrow) classes of problems that may present difficulties of a specific type. Exercises
1. Verify that the forward Euler scheme (9.29) has first order accuracy on a smooth solu tion u = u (x) of problem (9.30).
2. Modify the second initial condition u, the overall second order accuracy.
=
b in the scheme (9.26)(9.27) so that to achieve
3.* Verify that the predictorcorrector scheme (9.32) has second order accuracy on a smooth solution u = u (x) of problem (9.30). Hint. See the analysis in Section 9.4. 1 . 4 . Verify that the CrankNicolson scheme (9.33) has second order accuracy on a smooth solution u = u (x) of problem (9.30).
5.
9.3
Verify that the backward Euler scheme (9.34) has first order accuracy on a smooth solution u = u (x) of problem (9.30).
Stability of FiniteDifference Schemes
In the previous sections, we have constructed a number of consistent finite difference schemes for ordinary differential equations. It is, in fact, possible to show that those schemes are also convergent, and that the convergence rate for each scheme coincides with its respective order of accuracy. One can, however, build examples of consistent yet divergent (Le., non convergent) finitedifference schemes. For instance, it is easy to see that the scheme: n =
1 , 2, . . .
,N

1,
(9.35)
approximates the Cauchy problem:
dudx Au x u( A ) u(h) u u(x) =
+
O
=
b,
0, 0 :::: =
::::
1,
(9.36)
const,
on its solution = with first order accuracy with respect to h. However, the solution obtained with the help of this scheme does not converge to and does not even remain bounded as h + O.
[ul h '
272
A Theoretical Introduction to Numerical Analysis _q2 + + Ah)q =q, q2 ] I + UI1 = Uo [q2 qq2 l ql  q2 qlq, q21 ] +u, [ q , q 2 q2lunl qi hq2 qio. . Stability
Indeed, the general solution of the difference equation from (9.35) has the form: where C I and C2 are arbitrary constants, and characteristic equation: 2 (3 to satisfy the initial conditions, we obtain: 
11
are roots of the algebraic and C2
and
O . Choosing the constants CI

1

� 00
Analysis of formula (9.37) shows that max
O� l1h� I
1
n
as
�
Therefore, con
sistency alone is generally not sufficient for convergence. 9.3.1
(9.37)
is also required.
Definition of Stability
Consider an initial or boundary value problem
Lu=l, LtJ(hkh)U, (h) fil) u 1 ) 1( h h ) L ) , u l + 8f f [ , h u(x) Dh, [ulh h < c hk ( 8 ) I 1 c, h. h < ho £(1 ) Fh, 1I£(h) < 8, ho 8 £(1 ) u (l ) I z(h) u(h) h, £( 1)c.2 1 £(I') I F" , c2
(9.38)
and assume that the finitedifference scheme =
(9.39) of problem (9.38).
is consistent and has accuracy k > 0, on the solution defined by the formula: This means that the truncation error D =
(9.40)
where is the projection of the exact solution following inequality (cf. Definition 9. 1):
II F"
_
I
onto the grid
,
satisfies the
(9.41)
where is a constant that does not depend on We will now introduce two alternative definitions of stability for scheme (9.39).
DEFINITION 9.2 Scheme (9. 39) will be called 8table if one can choose E two number8 > 0 and > 0 87Lch that for any and for any the finite differen ce pro blem: I II'),
(9.42)
obtained from (9. 39) by adding the perturbation to the righthand side, of has one and only one solution Z( ll ) wh08e deviation from the 80i'u tion the 7Lnperturbed problem (9. 39) 8atisfies the estimate: where C2 does not depend on
Ilu"
::;
= const,
(9.43)
Numerical Solution ofOrdinary Differential Equations u(ll) c(ll)
273
f(h)
Inequality (9.43) implies that a small perturbation of the righthand side may only cause a small perturbation z(ll) of the corresponding solution. In other words, it implies weak sensitivity of the solution to perturbations of the right hand side that drives it. This is a setup that applies to both linear and nonlinear problems, because it discusses stability of individual solutions with respect to small perturbations of their respective source terms. Hence, Definition 9.2 can basically be referred to as the definition for nonlinear equations. Note that for a different solution that corresponds to another righthand side fit), the value of C2 in inequality (9.43) may, generally speaking, change. The key point though is that C2 does not increase when the grid size h becomes smaller. is linear. Then, Definition 9.2 f+ Let us now assume that the operator is equivalent to the following definition of stability for linear equations:
u(h)
DEFINITION 9.3 stable if for any given such that
u(h)
where
C2
Lh : Vh Fh fh) Fh L"u(h) f(h) . Lh
Lh
Scheme (9. 39) with a linear operator will be called " E the equation = f ) has a unique solution
does not depend on
(9.44)
h,
We emphasize that the value of C2 in inequality (9.44) is one and the same for all E Fil . The equivalence of Definitions 9.3 and 9.2 (stability in the nonlinear and will be rigorously proven in in the linear sense) for the case of a linear operator the context of partial differential equations, see Lemma 10. 1 of Chapter 10.
fh)
9.3.2
The Relation between Consistency, Stability, and Con vergence
The following theorem is of central importance for the entire analysis. THEOREM 9.1 Let the scheme = be consistent with the order of accuracy tJ(hk ), k > 0, o n the solution = of the problem = and let this scheme also be stable. Then, the finitedifference solution converges to the exact and the following estimate holds: solution
Lhu(h) u f(hu() x)
[ul, ,
Luu(h)f,
(9.45) where
CI
and
C2
are the constants from estimates (9.41) and (9.43).
The result of Theorem 9.45 basically means that a consistent and stable scheme converges with the rate equal to its order of accuracy.
A Theoretical Introduction to Numerical Analysis
274 PROOF
Let
=
e (hl = Dj(h l and [ UJ h z(hl .
Then, estimate (9.43) becomes:
II [U] h  u(h l lluh ::; c211oihl IIFh · o
Taking into account (9.4 1 ) , we immediately obtain (9.45).
Note that when proving Theorem 9 . 1 , we interpreted stability in the nonlinear sense (Definition 9.2). Because of the equivalence of Definitions 9.2 and 9.3 for a linear operator Lh , Theorem 9. 1 , of course, covers the linear case as well. There may be situations, however (mostly encountered for partial differential equations, see Chapter 10), when stability can be established for the linear case only, whereas sta bility for nonlinear problems may not necessarily lend itself to easy analysis. Then, the result of Theorem 9 . 1 will hold only for linear finitedifference schemes. Let us now analyze some examples. First, we will prove stability of the forward Euler scheme: un+ !  Un
n = O,I . . ,N l, h G = nh, n = , N, h N. G G x, u = x u =ux x u: G x, u I �� I ::; M = u (Xn , Un )
,
= CPn ,
Uo
=
1/1,
(9.46)
where Xn 0, 1 , . . . and = 1 / Scheme (9.46) can be employed for the numerical solution of the initial value problem: (9.47)
Henceforth, we will assume that the functions = ( ) and cP cp( ) are such that problem (9.47) has a solution ( ) with the bounded second derivative so that lu" ( ) I ::; const. We will also assume that ( ) has a bounded partial derivative with respect to = const.
(9.48)
Next, we will introduce the norms:
l I u(hl lluh max n l n/, IIihl l [Fh = max{ [ I/I[, max l cpn [ } , n
and verify that the forward Euler scheme (9.46) is indeed stable. This scheme can be recast in the operator form (9.39) if we define:
u = { u h G xn, un , Un+ l  Un
Lh (hl ihl
=
{
o,
CP (Xn ) , 1/1.
(
)
n = 0, 1 , 2, . . . ,
N
n = 0, 1 , 2, . . . ,  l ,
N
1,
Numerical Solution ofOrdinary Difef rential Equations 275 ZIl+ h ZI/ (XI/,Zn) = XIl) + £n, n=O, l , . . ,N l, zo = o/ + £, £(hl = { ££n. , n=0, 1 , 2, . . ,N  I, Zn Ull = WIl h XIl,ZI1  Xn, lIn = Wn = Mi(l )WIl, Zwn (h) =Un.{WO, WI , . . , WN}: WIl+IhWIl  Mn(h)Wn lOn, n=O,l, . . , N I, (9.50) Wo = £. "In: IM�h) 1 M. (9.50) +hM,�2hI)wWn_In +hI +£h(ln l +(Mh)I +Mh)I£Il1lw+nl h+hl£, l1£, 1 IWn+ d = (I1 ( 1+Mh) +( I + Mh)Mh)32IIwwnn_d_21 ++ 2h(13h(1 ++ Mh)Mh)2I 1£I£(h(hl Il FIl"Fh (I2(1+Mh)+ Mhnt+ Il l£w(oh)l +I F"(n +2elM)hI(£l(+lI) IMh)Fh , "l £(hl I Fh (n + l)h Nh = IWn+ l 2eMI £(h) I Fh · C2 = 2eM &(h) 9.u = u(2)x. ) 9.2.
At the same time, the perturbed problem (9.42) in detail reads [ef. formula (9.46)] : I

G
qJ (
(9.49)
which means that in formula (9.42) we can take:
Let us now subtract equations (9.46) from the respective equations (9.49) termwise. In so doing, let us also denote and take into account that
G(
) G(
aG(xll' �") au
)
where �n is some number in between system of equations with the unknown
and
_
We will thus obtain the following
_
According to (9.48), we can claim that
�
Then, system
yields:
�
�
� (1 �
�
�
�
because
�
1 . Hence:
�
This inequality obviously implies an estimate of type (9.43):
which is equivalent to stability with the constant in the nonlinear sense (Definition Let us also recall that scheme (9.46) has accuracy on the solution of problem (9.47), see Exercise 1 of the previous Section
A
276
Theoretical Introduction to Numerical Analysis h
Theorem 9. 1 would therefore guarantee a first order convergence with respect to of the forward Euler scheme (9.46) on the interval 0 ::; x ::; 1 . In our next example, we will analyze convergence of the finitedifference scheme (9.7) for the second order boundary value problem (9.4). This scheme is consistent and guarantees second order accuracy due to formula (9.20a). However, stability of the scheme (9.7) yet remains to be verified. Due to the linearity, it is sufficient to establish the unique solvability of the prob lem: 1l1l+1  2uII + UII_ I I , 2, . . , l, 1 + II gi , (9.5 1 ) o a , UN {3 ,
h2 u( = X2)ul _= l n= . Nl ul l = = \:In : b a cl _
for arbitrary { gil }, a, and {3 , and to obtain the estimate: 11
::; c max { l al , 1{3 I , max I gn l } · n
max
(9.52)
We have, in fact, previously considered a problem of type (9.5 1 ) in the context of the tridiagonal elimination, see Section 5.4.2 of Chapter 5. For the problem:
an ull  I + bn un + Cn Un  1 Uo a , UN = {3
gil ,
under the assumption:
I nl
> l n l + I l l + 8, 8 > 0,
we have proven its unique solvability and the estimate:
{
max lull l ::; max l a l , I{3 I , n
lg l � max m m u
In the case of problem (9.5 1 ), we have:
}
.
bll = h2 1X/2l"
(9.53)
2
Consequently, I ll l > l n l + I cll l + and estimate (9.53) therefore implies estimate (9.52) with 2. We have thus proven stability of the scheme (9.7). Along with the second order accuracy of the scheme, stability yields its convergence with the rate Note that the scheme (9.7) is linear, and its stability was proven in the sense of Definition 9.3. Because of the linearity, it is equivalent to Definition 9.2 that was used when proving convergence in Theorem 9. 1 . Let us also emphasize that the approach to proving convergence o f the scheme via independently verifying its consistency and stability is, in fact, quite general. Formula f does not necessarily have to denote an initial or boundary value problem for an ordinary differential equation; it can actually be an operator equation from a rather broad class. We outline this general framework in Section 10. 1 .6 of Chapter 1 0, where we discuss the Kantorovich theorem. Here we just mention that
c =b a
tJ(h2).
Lu =
1,
Numerical Solution ofOrdinary Differential Equations u Lu f LhU(h) f(h) .
277
it is not very important what type of problem the function solves. The continuous operator equation = is only needed in order to be able to construct its difference = In Section 9.3.3, we elaborate on the latter consideration. counterpart Namely, we provide an example of a consistent, stable, and therefore convergent scheme for an integral rather than differential equation. 9.3.3
Convergent Scheme for an Integral Equation
We will construct and analyze a difference scheme for solving the equation: I
4 Lu u(x)  !o K(x,y)u(y)dy f(x), 0 :::; x :::; ) 1 < 1. I K ( x , y N, h = 1 N, [0, 1 ] Xn nh,[U]nh 0, 1 , 2, . . , N. / u(x) Dh. Dh [u]/z u(xn)  !a K(Xl ,Y)U(y)dy=f(xn), n=0,1, 2, . . ,N, 4, cp = cp(y) 4. 1 .2::.:; y :::; 1 , ! cp (y )dy::;,; h (i +CPI + · + CPNI + CP; ) , h= � , a tl(h2). ullh (K(X2l ' O) uo+K(K(xxn,n,l)h)UI+). . +K(xn, (N I )h)UN_1 + 2 UN = In, n =0, 1 , 2, . . ,N. h Lhu( ) f(/l) f( O ) , a ( h ) , Lhu(h) =� 19gN,, If(Nh) f(1 ), =
=
1,
(9. 5 )
:::; p while assuming that the kernel in (9.54) is bounded: Let us specify a positive integer set and introduce a uniform grid on as before: = As usual, our unknown grid function of the exact solution on the grid will be the table of values To obtain a finitedifference scheme for the approximate computation of we will replace the integral in each equality: =
I
(9.55)
by a finite sum using one of the numerical quadrature formulas discussed in Chap ter namely, the trapezoidal rule of Section It says that for an arbitrary twice differentiable function on the interval ° the following approximate equality holds: I
and the corresponding approximation error is Having replaced the integral in every equality (9.55) with the aforementioned sum, we obtain:
(9.56)
I
The system of equations (9.56) can be recast in the standardized form if we set: =
g. l ,
liz)
=
=
A Theoretical Introduction to Numerical Analysis gn =un h (K(X2n,O) uo+Kxn,( . h)UNI .+ . . + K(x2n,l) UN) , . n=0,1, 2 , , Lhu(h fh)u = u(x) tJ(h2). (9.56) Lu = f (9.54), u(h {uo, UI, . · . , UN} (9.56), Us lusl lunl, n=0,1, 2, . . ,N. n s ( 9 . 5 6) I!(xs) I I us h (K(X2s,O) uo +K(xs, h)uj + . . + K(x2s, l) UN)1 IUsl  h (% + �+%}usl = ( l Nhp) l usl = (l p) l usl, IK(x,y) 1 p < 1. 1 1 lIu(h) lIu" = n lunl = lu.d : 0 just a little further. Every solution of the differential equation u' + Au = 0 remains bounded for all :2 It is natural to expect a similar behavior of the finitedifference solution as well. More precisely, consider the same model problem (9 . 36), but on the semiinfinite interval :2 0 rather than on o ::::: ::::: 1 . Let rP) be its finitedifference solution defined on the grid Xn = nh, n = 0, 1,2 , . . . . The corresponding scheme is called asymptotically stable, or Astable, if
u" "
x o.
x
x
282
A Theoretical Introduction to Numerical Analysis
for any fixed grid size h the solution is bounded on the entire grid: sup
O :::; n
0, one can employ the second order scheme: Y,, + I  2YIl + Yn  l _ ( ) _ j(xn ) , n = I , 2, . . . ,N  l , P Xn Yn h2 Yo = Yo, YN = Y1 ,
and subsequently use the tridiagonal elimination for solving the resulting system of linear algebraic equations. One can easily see that the sufficient conditions for using tridiagonal elimination, see Section 5.4 of Chapter 5, will be satisfied. 9.5.3
Newton's Method
As we have seen, the shooting method of Section 9.5.1 may become unstable even if the original boundary value problem is well conditioned. On the other hand, the disadvantage of the tridiagonal elimination as it applies to solving boundary value problems (Section 9 . 5.2) is that it can only be used for the linear case. An alternative is offered by the direct application of Newton's method. Recall, in Section 9.5. 1 we have indicated that Newton's method can be used for finding roots of the scalar algebraic equation (9.73) in the context of shooting. This method, however, can also be used directly for solving systems of nonlinear algebraic and differential equations, see Section 8.3 of Chapter 8. Newton's method is based on linearization, i.e., on reducing the solution of a nonlinear problem to the solution of a sequence of linear problems. Consider the same boundary value problem (9.7 1 ) and assume that there is a function Y = Yo (x) that satisfies the boundary conditions of (9.71) exactly, and also roughly approximates the unknown solution Y = y(x). We will use this function
A Theoretical Introduction to Numerical Analysis
292
Yo(x) ( x ) ( x ) + v( x ) , y Yo v(x) YoYo(x). y'(x) y�(x) +v'(xf(),x yl/y(x)) y�(f(x)x+vl, /(yx�)), f(x,Yo + v,Yo' + v') f(x,Yo ,Yo' ) + d dY,yo, �» v + d dyyo,' v, + 2 + 1 '12) 2 v v(xv)I/: + + q>(x), +v(IvO' 1)2), v( l ) (x)  df(xdy,yo' ,y�) (x) ).df(xdy,yo,y�») , q>(x) f(x,yo,y�) y�(x v YI yo +v. YI ,Y2,Y3, .
y=
as the initial guess for Newton's iteration. Let
(9.79) Substitute equality (9.79) into
=
where is the correction to the initial guess formula (9.71) and linearize the problem around
using the following relations:
=
=
n
Disregarding the quadratic remainder tJ( for the correction = pv
'
_
qv
=
_
,
V
•
we arrive at the linear problem
where
pp
(
f/ V
=
_
qq
=
0,
(9.80)
_
=
By solving the linear problem (9.80) either analytically or numerically, we deter mine the correction and subsequently define the next Newton iteration as follows: =
Then, the procedure cyclically repeats itself, and a sequence of iterations . is computed until the difference between the two consecutive iterations becomes smaller than some initially prescribed tolerance, which means convergence of New ton's method. Note that the foregoing procedure can also be applied directly to the nonlinear finitedifference problem that arises when the differential problem (9.7 1) is approxi mated on the grid, see Exercise 5 after Section 8.3 of Chapter 8. Exercises
1 . Show that the coefficients in front of YI and Yo in formula (9.75) are indeed bounded by 1 for all x E [0, 1 J and all a > O.
2. Use the shooting method to solve the boundary value problem (9.74) for a "moderate" value of a = 1 ; also take Yo = 1 , YI =  1 .
a) Approximate the original differential equation with the second order central dif ferences on a uniform grid: XII = nh, n = 0, 1 , . . . , N, h = l iN. Set y(O) = Yo = 0, and use y (h) = YI = a as the parameter to be determined by shooting.
b) Reduce the original second order differential equation to a system of two first order equations and employ the RungeKutta scheme (9.68). The unknown pa rameter for shooting will then be y' (0) = a, exactly as in problem (9.76).
Numerical Solution of Ordinary Differential Equations
293
In both cases, conduct the computations on a sequence of consecutively more fine grids (reduce the size h by a factor of two several times). Verify experimentally that the numerical solution converges to the exact solution (9.75) with the second order with respect to h. 3.* Investigate applicability of the shooting method to solving the boundary value problem: y" + a2 y = O, y(O)
=
Yo ,
O :S; x :S; 1 , y ( l ) = Yt ,
which has a "+" sign instead of the "" in the governing differential equation, but otherwise is identical to problem (9.74).
9.6
S at uration o f FiniteDifference Methods b y S mooth ness
Previously, we explored the saturation of numerical methods by smoothness in the context of algebraic interpolation (piecewise polynomials, see Section 2.2.5, and splines, see Section 2.3.2). Very briefly, the idea is to see whether or not a given method of approximation fully utilizes all the information available, and thus attains the optimal accuracy limited only by the threshold of the unavoidable error. When the method introduces its own error threshold, which may only be larger than that of the unavoidable error and shall be attributed to the specific design, we say that the phenomenon of saturation takes place. For example, we have seen that the in terpolation by means of algebraic polynomials on uniform grids saturates, whereas the interpolation by means of trigonometric or Chebyshev polynomials does not, see Sections 3. 1 .3, 3.2.4, and 3.2.7. In the current section, we will use a number of very simple examples to demonstrate that the approximations by means of finite difference schemes are, generally speaking, also prone to saturation by smoothness. Before we continue, let us note that in the context of finite differences the term "saturation" may sometimes acquire an alternative meaning. Namely, saturation the orems are proven as to what maximum order of accuracy may a (stable) scheme have if it approximates a particular equation or class of equations on a given fixed stenci1. 4 Results of this type are typically established for partial differential equa tions, see, e.g., [EOSO, IseS2]. Some very simple conclusions can already be drawn based on the method of undetermined coefficients (see Section 1 0.2.2). Note that this alternative notion of saturation is similar to the one we have previously introduced in this book (see Section 2.2.5) in the sense that it also discusses certain accuracy limits. The difference is, however, that these accuracy limits pertain to a particu lar class of discretization methods (schemes on a given stencil), whereas previously 4The stencil of a difference scheme is a set of grid nodes, on which the finitedifference operator is built that approximates the original differential operator.
294
A Theoretical Introduction to Numerical Analysis
we discussed accuracy limits that are not related to specific methods and are rather accounted for by the loss of information in the course of discretization. Let us now consider the following simple boundary value problem for a second order ordinary differential equation:
u" = j(x), O ::=; x ::=; l , u(O) = O, u(I) = O,
(9. 8 1 )
where the righthand side j(x) is assumed given. We introduce a uniform grid on the interval [0, 1 ] :
xn = nh, n = O, I , . . . ,N, Nh = l , and approximate problem (9.8 1 ) using central differences:
un + !  2Un + UIl  1 = }" = j(x ) , n = 1, 2, . . . ,N  l , n n h2 Uo = 0 , UN = 0.
(9.82)
Provided that the solution u = u(x) of problem (9. 8 1 ) is sufficiently smooth, or more precisely, provided that its fourth derivative u (4) (x) is bounded for 0 ::=; x ::=; 1 , the approximation (9.82) is secondoder accurate, see formula (9.20a). In this case, if the scheme (9.82) is stable, then it will converge with the rate tJ(h2 ). However, for such a simple difference scheme as (9.82) one can easily study the convergence directly, i.e., without using Theorem 9. 1 . A study of that type will be particularly instrumental because on one hand the regularity of the solution may not always be sufficient to guarantee consistency, and on the other hand, it will allow one to see whether or not the convergence accelerates for the functions that are smoother than those minimally required for obtaining tJ(h2 ). Note that the degree o f regularity o f the solution u(x) t o problem (9. 8 1 ) i s imme diately determined by that of the righthand side j(x). Namely, the solution u(x) will always have two additional derivatives. It will therefore be convenient to use different righthand sides j(x) with different degree of regularity, and to investigate directly the convergence properties of scheme (9.82). In doing so, we will analyze both the case when the regularity is formally insufficient to guarantee the second order convergence, and the opposite case when the regularity is "excessive" for that purpose. In the latter case we will, in fact, see that the convergence still remains only second order with respect to h, which implies saturation. Let us first consider a discontinuous righthand side: (9.83) On each of the two subintervals: [0, 1 / 2] and [ 1 / 2, 1 ] , the solution can be found as a combination of the general solution to the homogeneous equation and a particular solution to the inhomogeneous equation. The latter is equal to zero on [0 , 1 / 2] and on
Numerical Solution of Ordinary Differential Equations
295
[I /2, 1 ] it is easily obtained using undetermined coefficients. Therefore, the overall solution of problem (9.81), (9.83) can be found in the form: o :::; x :::; !, ! < x :::; 1 ,
(9.84)
where the constants C I , CZ, C3 , and C4 are to be chosen so that to satisfy the boundary conditions u(o) = u( 1 ) = 1 and the continuity requirements:
(9.85) Altogether this yields:
C I =0, C3 + C4 + 21 = 0, C4 1 = 0, C  C4 1 = 0 . C I + C"2z  C3 "2 z 8 2 Solving equations (9.86) we find: C I = 0, Cz =  81 ' C3 = 81 ' 
so that


{
u(x) = I �x,5 I 2 OI :::; x :::; i , g  gX + lX , 1 < x :::; 1 . In the finitedifference case, instead of (9.83) we have: f,  o, n = O,N I , , , .N, �, 1 , n = 2 + I ' 2 + 2 , , , . , N. 11 
{
(9.86)
(9.87) (9.88) (9.89)
Accordingly, the solution is to be sought for in the form:
n = O, I , . . . , � + I , n = � , � + I , . . . ,N,
(9.90)
where on each subinterval we have a combination of the general solution to the homogeneous difference equation and a particular solution of the inhomogeneous difference equation (obtained by the method of undetermined coefficients). Notice that unlike in the continuous case (9.84), the two grid subintervals in formula (9.90) overlap across the entire cell [N/2,N/2 + 1] (we are assuming that N is even). There fore, the constants C I , CZ, C3 , and C4 in (9.90) are to be determined from the boundary conditions at the endpoints of the interval [0, 1]: Uo = UN = 0, and from the matching conditions in the middle that are given simply as [ef. formula (9.85)]:
(�h) C3 + C4 (�h) + � (�h) z , C I + cz ( � h + h ) C3 + C4 ( � h + h) + � ( � h + h r C I + Cz
=
=
(9.91)
296
A Theoretical Introduction to Numerical Analysis
Altogether this yields:

C2 C "2 C4 C I + "2 3
CI = 0, 1 8
= 0,
C3 + c4 + 21 = 0, C2  C4 21 2h = 0, 

(9.92)
where the last equation of system (9.92) was obtained by subtracting the first equa tion of (9.9 1 ) from the second equation of (9.9 1 ) and subsequently dividing by h. Notice that system (9.92) which characterizes the finitedifference case is almost identical to system (9.86) which characterizes the continuous case, except that there is an tJ(h) discrepancy in the fourth equation. Accordingly, there is also an tJ(h) difference in the values of the constants [cf. formula (9.87)] :
C4 =  85 4h '
c) = 0,

so that the solution to problem (9.82), (9.89) is given by:
n = 0, 1 , . . , � + 1 , n = �, � + I , . . . ,N. By comparing formulae (9.88) and (9.93), where nh = xn, we conclude that .
(9.93)
i.e., that the solution of the finitedifference problem (9.82), (9.89) converges to the solution of the differential problem (9. 8 1 ), (9.83) with the first order with respect to h. Note that scheme (9.82), (9.89) falls short of the second order convergence because the solution of the differential problem (9.82), (9.89) is not sufficiently smooth. Instead of the discontinuous righthand side (9.83) let us now consider a continu ous function with discontinuous first derivative:
{ Xl
j(x) = x  I , o! :::;< xx :::;:::; !,l.
(9.94)
Solution to problem (9.8 1 ), (9.94) can be found in the form:
o :::; x :::; ! , ! < x :::; 1 , where the constants C l , C2 , C3 , and C4 are again to be chosen so that to satisfy the boundary conditions u(O) = = 1 and the continuity requirements (9.85):
u(l)
C l = 0, 5 C2  1  C  C4 c) + = 0' + 3 2 48 2 48
C3 + C4  3'1 =0, C2  81  C4 + 38 = 0.
(9.9 5)
297
Numerical Solution of Ordinary Differential Equations Solving equations
(9.95) we find: 1 C I 0 , C2 = 8 ' =
so that
u (x) =
8
6
'
3
{ I X +I xl3x + Ix3 _
{
1...
24
8
6
_
I
C4 = 8 '
(9.96)
0 ::; x ::; 1 ,
(9.97)
2 2 X ' ! < X ::; 1.
(9.94) we write:  (nh) , n 0, I , . . . , � ,
In the discrete case, instead of
fn =
=
(nh)  I , n = "2N + 1 , "2N + 2 , . . . ,N,
and then look for the solution Un to problem
Un
=
(9.82), (9.98) in the form:
nh) 3 , { C I + C2 ((nh)nh) + �� ((nh)3  1 (nh? , C3 + c4
(9.98)
n = O) I , . . . , � + I , n = � , � + I , . . . ,N.
For the matching conditions in the middle we now have [cf. formulae
(9.91)]:
(� h)  � ( �hr = C3 + C4 ( �h) + � ( � hr  � (� hr , (9.99) C I + C2 ( � h + h)  � ( � h + hr = C3 + C4 ( � h + h) + � ( � h + hr  � ( � h + hr , C I + C2
and consequently:
C I = 0, C4 5 C2 1 C I +     C3   +  = 0
(9.100)
2 48
2 48 ' where the last equation of (9.100) was obtained by subtracting the first equation of (9.99) from the second equation of (9.99) and subsequently dividing by h. Solving equations (9.100) we obtain [cf. formula (9.96)]: and Un
=
{
� (nh)  � (nh) 3 + ¥ (nh) , n = O, I , . . . , � + I , 2  � + � (nh) + t (nh) 3  1 (nh) + !fo ( 1  nh) , n = � , � + I , . . . ,N.
(9.101)
A Theoretical Introduction to Numerical Analysis
298
It is clear that the error between the continuous solution (9.97) and the discrete solu tion (9. 101) is estimated as
which means that the solution of the finitedifference problem (9.82), (9.98) con verges to the solution of the differential problem (9.81 ), (9.94) with the second order with respect to h. Note that second order convergence is attained here even though the degree of regularity of the solution  third derivative is discontinuous  is formally insufficient to guarantee second order accuracy (consistency). In much the same way one can analyze the case when the righthand side has one continuous derivative (the socalled space of functions), for example:
j(x)
C1
(9. 102) For problem (9.81 ), (9. 102), it is also possible to prove the second order convergence, which is the subject of Exercise 1 at the end of the section. The foregoing examples demonstrate that the rate of finitedifference convergence depends on the regularity of the solution to the underlying continuous problem. It is therefore interesting to see what happens when the regularity increases beyond Consider the righthand side in the form of a quadratic polynomial:
C1 .
j(x) = x(x  l).
(9. 103)
(C�
This function is obviously infinitely differentiable space), and so is the solution u= of problem (9.81), (9. 103), which is given by:
u(x)
3
1 1 1 u(x) x 6 . 12 + x 12  x 4
=
(9. 104)
Scheme (9.82) with the righthand side 1"
=
nh(nh  1 ) , n = 0, 1 , . . . ,N,
(9. 105)
approximates problem (9.81), (9. 103) with second order accuracy. The solution of the finitedifference problem (9.82), (9. 1 05) can be found in the form:
un = Cl + C2 (nh) + (nh ?(A(nh) 2 + B(nh) +C), 'v"
'
(9. 106)
"
p
where u�g) is the general solution to the homogeneous equation and u� ) is a particular solution to the inhomogeneous equation. The values of A, B, and are to be found
C
Numerical Solution of Ordinary Differential Equations
299
using the method of undetermined coefficients:
which yields:
A( 12(nh) 2 + 2h2 ) + B(6nh) + 2C (nh) 2  (nh) =
and accordingly,
1 1 A= 12 ' B =  6 '
1 2 C = Ah2 = h 12 .
(9. 107)
For the constants C I and C2 we substitute the expression (9. 106) and the already available coefficients A, B, and of (9. 107) into the boundary conditions of (9.82) and write:
C
Uo
Consequently,
= C I = 0,
UN
= C2 + A + B +C = O.
C2 = 112 + 112 h2 , so that for the overall solution Un of problem (9.81 ), (9. 105) we obtain: CI = 0
and
(9. 108) Comparing the continuous solution u(x) given by Un given by (9. 108) we conclude that
(9. 104) with the discrete solution
which implies that notwithstanding the infin ite smoothness of the righthand side
J(x) of (9. 103) and that of the solution u(x), scheme (9.82), (9. 105) still shows only second order convergence. This is a manifestation of the phenomenon of saturation by smoothness. The rate of decay of the approximation error is determined by the
specific approximation method employed on a given grid, and does not reach the level of the pertinent unavoidable error. To demonstrate that the previous observation is not accidental, let us consider another example of an infinitely differentiable (C"') righthand side:
f(x) = sin ( rrx ) .
(9. 109)
300
A Theoretical Introduction to Numerical Analysis
The solution of problem (9.8 1 ), (9. 109) is given by:
u(x)
=
� Jr sin(Jrx) .
(9. 1 10)
The discrete righthand side that corresponds to (9. 1 09) is:
fn = s i n(Jrnh), n = O, I , . . . ,N,
(9. 1 1 1 )
and the solution to the finitedifference problem (9.82), (9. 1 1 1 ) is to be sought for in the form Un = A sin( Jrnh) B cos( Jrnh) with the undetermined coefficients A and B, which eventually yields:
+
Un = 
h2
. 2 ;I' sin( 4 sm
Jrnh) .
(9. 1 12)
The error between the continuous solution given by (9. 1 1 0) and the discrete solution given by (9. 1 1 2) is easy to estimate provided that the grid size is small, h « 1 :
�
h2
�
[ni'  t (ni,)3r � > 4 (;: ) 2 1 H �h
1
Jr2 4
[ + )'] 1 � :2 H H:)' �
tJ(h' ).
This, again, corroborates the effect o f saturation, as the convergence o f the scheme (9.82), (9. 1 1 1 ) is only second order in spite of the infinite smoothness of the data. In general, all finitedifference methods are prone to saturation. This includes the methods for solving ordinary differential equations described in this chapter, as well as the methods for partial differential equations described in Chapter 10. There are, however, other methods for the numerical solution of differential equations. For ex ample, the socalled spectral methods described briefly in Section 9.7 do not saturate and exhibit convergence rates that selfadjust to the regularity of the corresponding solution (similarly to how the error of the trigonometric interpolation adjusts to the smoothness of the interpolated function, see Theorem 3.5 on page 68). The literature on the subject of spectral methods is vast, and we can refer the reader, e.g., to the monographs [0077, CHQZ88, CHQZ061, and textbooks [BoyO I , HOG06]. Exercise 1. Consider scheme (9.82) with the righthand side:
{(nh
! )2 , n = 0, 1 , 2, . . . , � , ill  ( I 1 2 1l 1  2: ) ' � , � + I , . . . , N. _
Numerical Solution of Ordinary Differential Equations
301
This scheme approximates problem (9.8 1), (9. 102). Obtain the finitedifference solu tion in closed form and prove second order convergence.
9.7
The Notion of Spectral Methods
In this section, we only provide one particular example of a spectral method. Namely, we solve a simple boundary value problem using a Fourierbased technique. Our specific goal is to demonstrate that alternative discrete approximations to differ ential equations can be obtained that, unlike finitedifference methods, will not suffer from the saturation by smoothness (see Section 9.6). A comprehensive account of spectral methods can be found, e.g., in the monographs [GOn, CHQZ88, CHQZ06], as well as in the textbooks [BoyO l , HGG06] . The material of this section is based on the analysis of Chapter 3 and can be skipped during the first reading. Consider the same boundary value problem as in Section 9.6: (9 . 1 1 3 ) u" = f(x), O ::=; x ::=; l , u(O) = 0, u(l) = O, where the righthand side f(x) is assumed given. In this section, we will not approx imate problem (9. 1 13) on the grid using finite differences. We will rather look for an
approximate solution to problem (9. 1 13) in the form of a trigonometric polynomial. Trigonometric polynomials were introduced and studied in Chapter 3. Let us for mally extend both the unknown solution u = u(x) and the righthand side f = f(x) to the interval [ 1 , 1] antisymmetrically, i.e., u(x) = u(x) and f(x) = f(x), so that the resulting functions are odd. We can then represent the solution u(x) of problem (9. 1 13) approximately as a trigonometric polynomial:
n+1 u (n ) (x) = L Bk sin ( rrk.x) k= 1
(9. 1 14)
with the coefficients Bk to be determined. Note that according to Theorem 3.3 (see page 66), the polynomial (9 . 1 14) , which is a linear combination of the sine functions only, is suited specifically for representing the odd functions. Note also that for any choice of the coefficients Bk the polynomial u (n ) (x) of (9.1 14) satisfies the boundary conditions of problem (9. 1 1 3) exactly. Let us now introduce the same grid (of dimension n + 1) as we used in Section 3. 1 :
1 1 Xm = n + 1 m + 2(n + 1 ) ' m = O , I , . . . , n,
(9. 1 15)
and interpolate the given function f(x) on this grid by means of the trigonometric polynomial with n + 1 terms:
i n ) (x) =
n+1 L bk sin ( rrk.x) . k=1
(9. 1 1 6)
302
A Theoretical Introduction to Numerical Analysis
The coefficients of the polynomial (9. 1 1 6) are given by: k = 1 , 2, . . . ,n, (9. 1 17) To approximate the differential equation u" = f of (9. 1 1 3), we require that the second derivative of the approximate solution U(Il) (x): (9. 1 1 8) coincide with the interpolant of the righthand side jln ) (x) at every node Xm of the grid (9. 1 15): (9. 1 19) Note that both the interpolant jln l (x) given by formula (9. 1 1 6) and the derivative ,p U(Il) (X) given by formula (9. 1 1 8) are sine trigonometric polynomials of the same order n + 1 . According to formula (9. 1 1 9), they coincide at Xm for all m = 0, 1 , . , n. Therefore, due to the uniqueness of the trigonometric interpolating polynomial (see Theorem 3 . 1 on page 62), these two polynomials are, in fact, the same everywhere on the interval 0 ':::; x .:::; 1 . Consequently, their coefficients are identically equal: . .
(9. 1 20) Equalities (9. 1 20) allow one to find Bk provided that bk are known. Consider a particular example analyzed in the end of Section 9.6:
f(x) = sin ( nx) .
(9. 121)
The exact solution of problem (9. 1 1 3), (9. 1 2 1 ) is given by: sin ( nx) . u(x) = � n
(9. 122)
According to formulae (9. 1 17), the coefficients bk that correspond to the righthand side f(x) given by (9. 1 2 1 ) are:
Consequently, relations (9. 1 20) imply that B,
= 2 n
1
and Bk = O, k = 2, 3, . . . , n + 1 .
Therefore, (9. 1 23)
Numerical Solution of Ordinary Differential Equations
303
By comparing fonnulae (9. 123) and (9. 122), we conclude that the approximate method based on enforcing the differential equation u" = f via the finite system of equalities (9. 1 19) reconstructs the exact solution of problem (9. 1 13), (9. 121). The error is therefore equal to zero. Of course, one should not expect that this ideal behavior of the error will hold in general. The foregoing particular result only takes place because of the specific choice of the righthand side (9.121). However, in a variety of other cases one can obtain a rapid decay of the error as n increases. Consider the odd function f( x) = f(x) obtained on the interval [ 1 , 1] by extending the righthand side of problem (9. 1 1 3) antisymmetrically from the interval [0, 1]. Assume that this function can also be translated along the entire real axis:
[21 + 1 ,2(/ + 1 ) + 1 ] : f(x) = f(x  2(1 + 1 )), 1 = 0, 1 , ±2, ±3, . . . so that the resulting periodic function with the period L = 2 be smooth. More pre cisely, we require that the function f(x) constructed this way possess continuous derivatives of order up to r > 0 everywhere, and a square integrable derivative of order r + 1 : I I (x) dx < 00 . Clearly the function f(x) = sin ( nx) , see formula (9. 1 21), satisfies these require ments. Another example which, unlike (9. 121), leads to a full infinite Fourier expan sion is f(x) = sin ( n sin ( nx)) . Both functions are periodic with the period L = 2 and infinitely smooth everywhere (r = 00) . Let us represent f(x) as the sum of its sine Fourier series: 'Vx E
[fr+1) f
(9. 124)
f(x) L f3k sin ( knX) , =
k=l
where the coefficients 13k are defined as:
l f(x) sin (knx)dx.
(9. 1 25)
L 13k sin (knx) , 8 Sn (x) = L f3k sin ( knx) ,
(9.126)
13k
=
2
The series (9. 124) converges to the function f(x) unifonnly and absolutely. The rate of convergence was obtained when proving Theorem 3.5, see pages 6871. Namely, if we define the partial sum Sn (x) and the remainder 8Sn (x ) of the series (9. 124) as done in Section 3.1.3: Sn (X) = then
n+ l
k=l
k=n+ 2
, (9. 1 27) If(x)  Sn (x) 1 = 1 8Sn (x) I ::; � n+ where Sn is a numerical sequence such that Sn = 0(1), i.e., limn >oo Sn = O. Substitut
ing the expressions:
r
2
A Theoretical Introduction to Numerical Analysis
304
into the definition (9. 1 17) of the coefficients bk we obtain:
��v�'
bn+ 1
=
�
'�v�� a�
k = 1 , 2, . . . ,n,
1Il + 1 i 8Sn (xm ) ( 1 ) 1II . i Sn )(1) (x 1_ _ m n + 1 m=O n + 1 IIl=O
,
/3,,+ 1
,
''v
a/3n + !
(9. 128)
'
The first sum on the righthand side of each equality (9. 128) is indeed equal to the genuine Fourier coefficient 13k of (9. 125), k = 1 , 2, . . . , n + I , because the partial sum Sn(x) given by (9. 126) coincides with its own trigonometric interpolating polyno mial5 for all 0 ::; x ::; 1. As for the "corrections" to the coefficients, 8f3b they come from the remainder 8Sn(x) and their magnitudes can be easily estimated using in equality (9. 127) and formulae (9. 1 17):
:�
j8f3k l ::; 2 r /2 ' k = 1 , 2, . . . ,n + 1 . (9. 129) n Let us now consider the exact solution u = u(x) of problem (9. 1 1 3). Given the assumptions made regarding the righthand side f = f(x), the solution u is also a smooth odd periodic function with the period L = 2. It can be represented as its own
Fourier series:
u(x) = L Yk sin(kJrx) , k= 1
(9. 130)
where the coefficients Yk are given by:
= 10 1 u(x) sin (knx)dx.
Yk 2
(9. 131)
Series (9. 1 30) converges uniformly. Moreover, the same argument based o n the pe riodicity and smoothness implies that the Fourier series for the derivatives u' (x) and u"(x) also converge uniformly.6 Consequently, series (9. 1 30) can be differentiated (at least) twice termwise: �
u"(x) = _n2 L k2 Yk sin(kJrx) . k=1
(9. 132)
Recall, we must enforce the equality u" = f. Then by comparing the expansions (9. 132) and (9. 124) and using the orthogonality of the trigonometric system, we have:
(9. 133)
S Oue to the uniqueness of the trigonometric interpolating polynomial, see Theorem 3. 1 . 6The first derivative u' (x) will be an even function rather than odd.
Numerical Solution of Ordinary Differential Equations
305
Next, recall that the coefficients Bk of the approximate solution U (II) (x) defined by (9. 1 14) are given by formula (9. 1 20). Using the representation bk = 13k + 8 f3k . see formula (9. 128), and also employing relations (9. 1 33), we obtain: 1 Bk =  2 2 bk n k 1 1 =  2 2 f3k  2 2 8 f3k n k nk 1 8 f3k , k = 1 , 2 , . . = Yk n2 k2
(9. 1 34) .
,n
+ 1.
Formula (9. 1 34) will allow us to obtain an error estimate for the approximate solution urn ) (x) . To do so, we first rewrite the Fourier series (9. J 30) for the exact solution u(x) as its partial sum plus the remainder [cf. formula (9. J 26)]: u(x)
=
SII (X) + 8 Sn (x)
=
n+l
L Yk sin (knx) + L
k =1
k=II+ 2
Yk sin(knx) ,
(9. J 35)
and obtain an estimate for the convergence rate [cf. formula (9. 1 27)]: lu(x)


S
II
17n (x) 1 = 1 8 Sn (x) I ::; 5 ' +
(9. J 36)
nr 'I.
where 17n = o( 1) as n > 00. Note that according to the formulae (9. 136) and (9. 1 27), the series (9. J 30) converges faster than the series (9. 1 24), with the rates 0 n  (r+ i)
(
)
(r+ � ) , respectively. The reason is that if the righthand side
(
)
[
(9. 1 37)
and 0 n f = f(x) of problem (9. 1 13) has r continuous derivatives and a square integrable derivative of order r + J , then the solution u u(x) to this problem would normally have r + 2 continuous derivatives and a square integrable derivative of order r 3. Next, using equalities (9. 1 14), (9. 1 34), and (9. 1 35) and estimates (9. 1 27) and (9. 136), we can write Vx E [0, 1]: =
l u(x)  u (n ) (x) I
=
[ [ [
Sn (x)
+ 8Sn (x)

%
n+l
Bk sin(nkx)
[
+
n+l
� Yk sin(nkx) + � n8213kk2 sin(nkx) 8 13 8 SIl(x) + L Tz sin(nkx) I 8SI1 (x) I + L 1 8 f3k l k=1 n k k= 1 � � � 00. The key dis tinctive feature of error estimate (9. I 37) is that it provides for a more rapid conver gence in the case when the righthand side f(x) that drives the problem has higher
A Theoretical Introduction to Numerical Analysis
306
regularity. In other words, similarly to the original trigonometric interpolation (see Section 3 . 1 .3), the foregoing method of obtaining an approximate solution to prob lem (9. 1 13) does not get saturated by smoothness. Indeed, the approximation error selfadjusts to the regularity of the data without us having to change anything in the algorithm. Moreover, if the righthand side of problem (9. 1 1 3) has continuou s periodic derivatives of all orders, then according to estimate (9. 1 37) the method will converge with a spectral rate, i.e., faster than any inverse power of n. For that reason, methods of this type are referred to as spectral methods in the literature. Note that the simple Fourierbased spectral method that we have outlined in this section will only work for smooth periodic functions, i.e., for the functions that withstand smooth periodic extensions. There are many examples of smooth right hand sides that do not satisfy this constraint, for example the quadratic function 1 ) used in Section 9.6, see formula (9. 103). However, a spectral method can be built for problem (9. 1 1 3) with this righthand side as well. In this case, it will be convenient to look for a solution as a linear combination of Chebyshev polynomi als, rather than in the form of a trigonometric polynomial (9. 1 14). This approach is similar to Chebyshevbased interpolations discussed in Section 3.2. Note also that in this section we enforced the differential equation of (9. 1 1 3) by requiring that the two trigonometric polynomials, urn ) and coincide at the nodes of the grid (9. 1 15), see equalities (9. 1 1 9). In the context of spectral meth ods, the points Xm given by (9. 1 15) are often referred to as the collocation points, and the corresponding methods are known as the spectral collocation methods. Al ternatively, one can use Galerkin approximations for building spectral methods. The Galerkin method is a very useful and general technique that has many applications in numerical analysis and beyond; we briefly describe it in Section 1 2.2.3 when discussing finite elements. Similarly to any other method of approximation, one generally needs to analyze accuracy and stability when designing spectral methods. Over the recent years, a number of efficient spectral methods have been developed for solving a wide variety of initial and boundary value problems for ordinary and partial differential equations. For further detail, we refer the reader to [G077 ,CHQZ88,BoyO 1 ,CHQZ06,HGG06].
j(x)
j(x) x(x =

:� (x)
Exercise
fn) (x),
1 . Solve problem (9. 1 13) with the righthand side f(x) = sin(nsin(nx)) on a computer using the Fourier collocation method described in this section. Alternatively, apply the second order difference method of Section 9.6. Demonstrate experimentally the difference in convergence rates.
Chapter 1 0 FiniteDifference Schemes for Partial Differential Equ a tions
In Chapter 9, we defined the concepts of convergence, consistency, and stability in the context of finitedifference schemes for ordinary differential equations. We have also proved that if the scheme is consistent and stable, then the discrete solution converges to the corresponding differential solution as the grid is refined. This re sult provides a recipe for constructing convergent schemes. One should start with building consistent schemes and subsequently select stable schemes among them. We emphasize that the definitions of convergence, consistency, and stability, as well as the theorem that establishes the relation between them, are quite general. In fact, these three concepts can be introduced and studied for arbitrary functional equations. In Chapter 9, we illustrated them with examples of the schemes for or dinary differential equations and for an integral equation. In the current chapter, we will discuss the construction of finitedifference schemes for partial differential equations, and consider approaches to the analysis of their stability. Moreover, we will prove the theorem that consistent and stable schemes converge. In the context of partial differential equations, we will come across a number of important and essentially new circumstances compared to the case of ordinary differ ential equations. First and foremost, stability analysis becomes rather elaborate and nontrivial. It is typically restricted to the linear case, and even in the linear case, very considerable difficulties are encountered, e.g., when analyzing stability of initial boundary value problems (as opposed to the Cauchy problem). At the same time, a consistent scheme picked arbitrarily most often turns out unstable. The "pool" of grids and approximation techniques that can be used is very wide; and often, special methods are required for solving the resulting system of finitedifference equations.
10. 1 10. 1 . 1
Key Definitions and Illustrating Examples Definition of Convergence
Assume that a continuous (initial)boundary value problem: Lu = f
(10.1) 307
308
A Theoretical Introduction to Numerical Analysis
is to be solved on some domain D with the boundary r = dD. To approximately compute the solution u of problem ( 1 0. 1 ) given the data j, one first needs to specify a discrete set of points D" C {D U r} that is called the grid. Then, one should intro duce a linear normed space V" of all discrete functions defined on the grid Dh , and subsequently identify the discrete exact solution [U]II of problem ( 1 0. 1 ) in the space Vii. As [U] h will be a grid function and not a continuous function, it shall rather be regarded as a table of values for the continuous solution u. The most straightforward way to obtain this table is by merely sampling the values of u at the nodes Dh ; in this case [u]Jz is said to be the trace (or projection) of the continuous solution u on the grid. ' Since, generally speaking, neither the continuous exact solution u nor its discrete counterpart [ul l! are known, the key objective is to be able to compute [ul h approxi mately. For that purpose one needs to construct a system of equations 0 0.2) with respect to the unknown function u ( h ) E Vh , such that the convergence would take place of the approximate solution lP) to the exact solution [u] 1! as the grid is refined: 0 0.3) As has been mentioned, the most intuitive way of building the discrete operator Lh in ( l 0.2) consists of replacing the continuous derivatives contained in L by the ap propriate difference quotients, see Section 10.2. 1 . In this case the discrete system ( 10.2) is referred to as afinitedifference scheme. Regarding the righthand side j(h) of ( 1 0.2), again, the simplest way of obtaining it is to take the trace of the continuous righthand side j of ( 1 0. 1 ) on the grid DJz. The notion of convergence ( 1 0.3) of the difference solution u(ll ) to the exact solu tion [u]Jz can be quantified. Namely, if k > 0 is the largest integer such that the fol lowing inequality holds for the solution U (!l ) of the discrete (initial)boundary value problem ( 1 0.2): ( 10A) c = const, then we say that the convergence rate is tJ(hk ) or alternatively, that the magnitude of the error, i.e., that of the discrepancy between the approximate solution and the exact solution, has order k with respect to the grid size h in the chosen norm II . Ilul! ' Note that the foregoing definitions of convergence ( 10.3) and its rate, or order ( 10.4), are basically the same as those for ordinary differential equations, see Section 9. 1 .2. Let us also emphasize that the way we define convergence for finitedifference schemes, see formulae ( 1 0.3) and ( 1 0.4), differs from the traditional definition of convergence in a vector space. Namely, when h ., 0 the overall number of nodes in the grid DI! will increase, and so will the dimension of the space Vii. As such, Vh shall, in fact, be interpreted as a sequence of spaces of increasing dimension param eterized by the grid size h, rather than a single vector space with a fixed dimension. I There are many other ways to define [uJ,,, e.g., those based on integration over the grid cells.
FiniteDifference Schemes for Partial Differential Equations
309
Accordingly, the limit in ( 10.3) shall be interpreted as a limit of the sequence of norms in vector spaces that have higher and higher dimension. The problem of constructing a convergent scheme ( 1 0.2) is usually split into two subproblems. First, one obtains a scheme that is consistent, i.e., that approximates problem ( 1 0. 1 ) on its solution u in some specially defined sense. Then, one should verify that the chosen scheme ( 10.2) is stable. 10.1.2
Definition of Consistency
To assign a tangible meaning to the notion of consistency, one should first intro duce a norm in the linear space hI that contains the righthand side f hl of equation (10.2). Similarly to Vh , F/, should rather be interpreted as a sequence of spaces of in creasing dimension parameterized by the grid size h, see Section 10. 1 . 1 . We say that the finitedifference scheme ( 10.2) is consistent, or in other words, that the discrete problem ( 1 0.2) approximates the differential problem ( 10. 1 ) on the solution u of the latter, if the residual 8f( h l that is defined by the following equality: ( 10.5) i.e., that arises when the exact solution [U] h is substituted into the lefthand side of (10.2), vanishes as the grid is refined: ( 1 0.6) If, in addition, k > 0 happens to be the largest integer such that
k 11 8fh l II F,, :S c [ h ,
( 1 0.7)
where c[ is a constant that does not depend on h, then the approximation is said to have order k with respect to the grid size h. In the literature, the residual 8fill of the exact solution [u] /z see formula ( 1 0.5), is referred to as the truncation error. Accordingly, if inequality ( 1 0.7) holds, we say that the scheme (1 0.2) has order of accuracy k with respect to 11 (cf. Definition 9.1 of Chapter 9). Let us, for example, construct a consistent finitedifference scheme for the follow ing Cauchy problem: '
au  au = rp(x,t), ax u(x, O) = lfI(x) ,
at
{ {
 00 < x < 00 , 0 < t :S T,  00 < x < 00 .
Problem ( 10.8) can be recast in the general form ( 10. 1 ) if we define
au au 00 < x < 00 , 0 < :S T, t ax ' u(x,O),  00 < x < 00 , 00 00 :S T, f = rp(x,t), 00 < x < 00., 0 < t lfI(x) ,  < x
0 and r > 0 are fixed real numbers, and [T /r] denotes the integer part of T / r. We will also assume that the temporal grid size r is proportional to the spatial size h: r = rh, where r = const, so that in essence the grid Dh be parameterized by only one quantity h. The discrete exact solution [Ul h on the grid Dh will be defined in the simplest possible way outlined in Section 10. 1 . 1 as a trace, i.e., by sampling the values of the continuous solution u(x, t) of problem (10.8) at the nodes ( 10.9): [ul h = {u(mh,pr)}. We proceed now to building a consistent scheme ( 10.2) for problem ( 10.8). The value of the grid function u(h) at the node (xm,tp ) = (mh, pr) of the grid Dh will hereafter be denoted by u�. To actually obtain the scheme, we replace the partial derivatives �� and �� by the first order difference quotients: du u(x,t + r)  u(x, t) ;::,: ' dt (X,I) r u(x + h,t)  u(x, t) du . h dx (X,I) 
l
I
;::,:
Then, we can write down the following system of equations:
u�+ 1  U;" r
m = 0, ± I , ±2, . . . , p = 0, 1 , . . . , [T /r]  I , u� = ljIm , m = 0, ±I, ±2, . . . , (10.10) where the righthand sides CPt. and ljIm are obtained, again, by sampling the values of the continuous righthand sides cp(x, t) and ljI(x) of (10.8) at the nodes of Dh : CPt. cp(mh,pr), m = 0, ± I , ±2, . . . , p = 0, 1 , . . . , [T /r]  1 , (10. 1 1) ljIm = ljI(mh), m = 0, ± I , ±2, . . . . The scheme ( 10. 10) can be recast in the universal form (10.2) if the operator Lh and the righthand side fh ) are defined as follows: p+ 1  Ump um+ p l  umP Um h Lh u ( ) h ' m = 0, ± I , ±2, . . . , p = O, I , . . . , [T/rj  I, r m = 0, ± I , ±2, . . . , u�, m = 0, ± I , ±2, . . . , p = O, I , . . . , [T/r]  I , m = 0, ± I , ±2, . . . , where CPt. and ljIm are given by (10. 1 1 ). Thus, the righthand side j( " ) is basically a pair of grid functions CPt. and ljIm such that the first one is defined on the two dimensional grid (10.9) and the second one is defined on the onedimensional grid: (xm,O) = (mh,O), m = 0, ± I , ±2, . . . . =
=
{
FiniteDifference Schemes for Partial Differential Equations
311
The difference equation from ( 10. 10) can be easily solved with respect to u;,,+ I : ( 10. 1 2) Thus, if we know the values u;:" m 0, ± 1 , ±2 , . . . , of the approximate solution u(h) at the grid nodes that correspond to the time level t = p'C, then we can use ( 10 . 1 2) and compute the values u;,,+ I at the grid nodes that correspond to the next time level t (p + 1 ) 'C. As the values u� at t 0 are given by the equalities U�, = lJIm , see ( 10. 10), then we can successively compute the discrete solution u;:, one time level after another for t = 'C, t = 2 'C, t = 3 'C, etc., i.e., everywhere on the grid D" . The schemes, for which solution on the upper time level can be obtained as a closed form expression that contains only the values of the solution on the lower time level(s), such as in formula ( 10. 1 2), are called explicit. The foregoing process of computing the solution one time level after another is known as the time marching. Let us now determine what order of accuracy the scheme ( 10. 10) has. The linear space F" will consist of all pairs of bounded grid functions ih) = [ ep,{;, lJIm V with the norm: 2 ( 10. 13) / l ill) / F" � max / ep,{; / + max I lJmI / . =
=
=
m,p
m
We note that one can use various norms for the analysis of consistency, and the choice of the norm can, in fact, make a lot of difference. Hereafter in this section, we will only be using the maximum norm defined by ( 10. 1 3). Assume that the solution u(x, t) of problem ( 10.8) has bounded second derivatives. Then, the Taylor formula yields:
where � and T) are some numbers that may depend on m, p, and h, and that satisfy the inequalities 0 :::: � :::: h, 0 :::: T) :::: 'C. These formulae allow us to recast the expression
L,, [U] h =
{(
{
U(xm, tp + 'C)  u (xm, tp ) U(xm , O ) ,
in the following form:
Lh [U] h
=
'C
)
_
u (xm + h , tp )  U (xm, tp )
au _ au � a 2 u(xm , tp + T) ) + I ,tp) (xm 2 a t ax at2 U(Xm , O ) + 0 ,
h
_
'
,
� a 2 u (xm + � tp ) 2
ax2
'
2If either max I cp/,; I or max l 1l'm l is not reached, then the least upper bound, i . e., supremum, sup lcpt,;1 or
suP I 1I'm i should b e used instead i n formula ( 10. 1 3).
31
2
A Theoretical Introduction to Numerical Analysis
or alternatively: where
Consequently [see ( 10 . 1 3)], ( 1 0.1 4) We can therefore conclude that the finitedifference scheme ( 10. 10) rendersfirst or der accuracy with respect to h on the solution u(x,t) of problem ( 1 0.8) that has bounded second partial derivatives. 10.1.3
Definition of Stability
DEFINITION 1 0. 1 The scheme (1 0. 2) is said to be stable if one can find IS > 0 and ho > 0 such that for any h ho and any grid function £ (/1) E Fh: 11£ (11) 1 1 IS, the finitedifference pr"Oblem
0 811Ch that fOT any h ho and j(/I ) E Fil it i8 uniquely 80lvable, and the 80lution u(1l) 8ati8fie8
l one can easily see that the scheme ( 10. 1 0) will remain consistent, i.e., it will still provide the fJ (h ) accuracy on the solution u(x, t) of the differential problem ( 10.8) that has bounded second derivatives. However, the foregoing stability proof will no longer work. Let us show that for r > 1 the solution U (/I ) of the finitedifference problem ( 10. 1 0) does not, in fact, converge to the exact solution [uh (trace of the solution u (x,t ) to problem ( 10.8) on the grid D of ( 1 0.9)). Therefore, there may be no stability either, as otherwise h consistency and stability would have implied convergence by Theorem 10. 1 .
FiniteDifference Schemes jar Partial Differential Eqllations 1 0 . 1 .4
317
=
The Courant , Friedrichs, and Lewy Condition
In this section, we will prove that if r r /h > 1 , then the finitedifference scheme (10.10) will not converge for a general function lfI(x). For simplicity, and with no loss of generality (as far as the no convergence argument), we will assume that qJ(x, t) 0, which implies qJ,� = qJ(mh, pr) == O. Let us also set T = I , and choose the spatial grid size h so that the point (0, 1 ) on the (x, t) plane be a grid node, or in other words, so that the number N = r1 = rh1 be integer, see Figure 1 0. 1 . Using the difference equation in the form ( 10.12): limp+ l (I  r) limp + rllmp + l ' one can easily see that the value Ub+1 of the discrete solution lP) at the grid node (0, 1 ) (in this case p + 1 = N) can be expressed through the values of lib and uf of the solution at the nodes (0, 1  r) and (h, 1  r), respectively, see Figure 10. 1 . Likewise, the values of lib and uf are expressed through the values of ub  1 , ur 1 , and u�  I at the three grid nodes: (0, 1  2r), (h , 1  2r), and (2h, 1  2r). (O, l) Similarly, the values of ug  1 , ur 1, and ur I are, in turn, expressed through the ==

t'
t 1 I i at
values of the solution at the four grid nodes: (0, I  3r), (h, 1  3r), (2h, I 3r), and (3h, 1  2r). Eventually we ob tain that ug +1 is expressed through the values of the solution u�, = lJIm at the nodes: (0,0), (h, O), (2h,0), ... , (hT /r,O). All these nodes belong to the interval:
hT = h = 1 O < x < r r
•
•

( Jfr,O)

.
(1,0) x
FIGURE 10. 1 : Continuous and dis crete domain of dependence.
r
of the horizontal axis t 0, see Fig ure 10.1, where the initial condition =
u (x,O )
•
=
lfI(x)
for the Cauchy problem (10.8) is specified. Thus, solution u;" to the finitedifference problem (l0. 1O) at the point t) (0 , 1 ) {o? (m, p) does not depend on the values of the function lJI ( ) at the points that do not belong to the interval 0 ::; l /r. It is, however, easy to see that solution to the original continuous initial value problem:
x
= (O,N)
all at
all
x ::;
ax = 0 u (x O ) = lfI(x) , _
,
(x, x
'
00 00
<x < <x
1, one should ( ) (O ) not, generally speaking, expect convergence of the scheme ( 1 0. 1 0). Indeed, in this case the interval 0 ::; x ::; 1/ r < 1 of the horizontal axis does not contain the point ( 1 , 0). Let us assume for a moment that for some particular 1JI(x) the convergence, by accident, does take place. Then, while keeping the values of 1JI(x) on the interval 0 ::; x ::; 1 /r unaltered, we can disrupt this convergence by modifying 1JI(x) at and around the point x 1 . This is easy to see, because the latter modification will obviously affect the value u(O, 1 ) of the continuous solution, whereas the value u� of the discrete solution at (0, 1 ) will remain unchanged, as it is fully determined by 1JI(x) on the interval 0 ::; x ::; 1 / r. Moreover, one can modify the function 1JI(x) at the point x = 1 and in its vicinity so that it will stay sufficiently smooth (twice differentiable). Then, the solution u(x, t) = 1JI(x + t) will inherit this smoothness, and the scheme ( 10. 1 0) will therefore remain consistent, see ( 1 0. 1 4). Under these conditions, stability of the scheme ( 1 0. 1 0) would have implied convergence. As, however, there is no convergence for r > 1 , there may be no stability either. The argument we have used to prove that the scheme ( 10. 10) is divergent (and thus unstable) when r = r:/ h > I is, in fact, quite general. It can be presented as follows. Assume that some function 1JI is involved in the formulation of the original prob lem, i.e., that it provides all or part of the required input data. Let P be an arbitrary point from the domain of the solution u. Let also the value u(P) depend on the values of the function 1JI at the points of some region GIjI = GIjI(P) that belongs to the domain of 1JI. This means that by modifying 1JI in a small neighborhood of any point Q E GIjI(P) one can basically alter the solution u(P). The set GIjI(P) is referred to as the continuous domain of dependence for the solution u at the point P. In the previous example, GIjI(P) ! P= , I = ( 1 , 0) , see Figure 10. 1 . (O ) Suppose now that the solution u is computed by means of a scheme that we denote Lh U(h ) = lh ) , and that in so doing the value of the discrete solution u (h ) at the point p(h ) of the grid closest to P is fully determined by the values of the function 1JI on some region G� ) (P). This region is called the discrete domain of dependence, and in the previous example: G�) (P) Ip= , l = {x I 0 ::; x ::; 1 / r}, see Figure 10. 1 . (O ) The Courant, Friedrichs, and Lewy condition says that for the convergence u(ll) t u to take place as h t 0, the scheme should necessarily be designed in 
=
such a way that in any neighborhood ofan arbitrary point from GIjI(P) there always be a pointfrom G� ) (P), provided that h is chosen sufficiently small. In the literature, this condition is commonly referred to as the CFL condition. The number r r:/ h that in the previous example distinguishes between the convergence and divergence as r § 1 , is known as the CFL number or the Courant number. =
FiniteDifference Schemes for Partial Differential Equations
3 19
Let us explain why in general there is no convergence if the foregoing CFL con dition does not hold. Assume that it is not met so that no matter how small the grid size h is, there are no points from the set G�) (P) in some fixed neighborhood of a particular point Q E Go/(P). Even if for some lfI the convergence u(h ) + u does incidentally take place, we can modify lfI inside the aforementioned neighborhood of Q, while keeping it unchanged elsewhere. This modification will obviously cause a change in the value of u(P). At the same time, for all sufficiently small h the values of u (h) at the respective grid nodes p(h ) closest to P will remain the same because the function lfI has not changed at the points of the sets G�) (P). Therefore, for the new function lfI the scheme may no longer converge. Note that the CFL condition can be formulated as a theorem, while the foregoing arguments can be transformed into its proof. We also reemphasize that for a consis tent scheme the CFL condition is necessary not only for its convergence, but also for stability. If this condition does not hold, then there may be no stability, because oth erwise consistency and stability would have implied convergence by Theorem 10. 1 . 10.1.5
The Mechanism of Instability
The proof of instability for the scheme ( 10. 10) given in Section 1 0. 1 .4 for the case h > 1 is based on using the CFL condition that is necessary for convergence and for stability. As such, this proof is nonconstructive in nature. It would, however, be very helpful to actually see how the instability of the scheme ( 10. 10) manifests itself when r > 1 , i.e., how it affects the sensitivity of the solution u(h) to the pertur bations of the data fh ) . Recall that according to Section 10. 1 .3 the scheme is called stable if the finitedifference solution is weakly sensitive to the errors in fh ) , and the sensitivity is uniform with respect to h. Assume for simplicity that for all h the righthand sides of equations ( 1 0. 1 0) are identically equal to zero: cp� == 0 and lfIm == 0, so that r
= 'r /
ih )
=
[�J
= 0,
and consequently, solution u(h ) {u�} of problem ( 1 0. 10) is also identically equal to zero: uft! == Suppose now that an error has been committed in the initial data, and instead of lfIm = 0 a different function o/m = ( _ 1 ) 1Il e (e = const) has been specified, which yields:
o.
=
[� J ,
JUI) = lfIm
IIJUI) IIFf,
=
e,
on the righthand side of ( 10. 10) instead of f h ) = O. Let us denote the corresponding solution by a(ll ) . In accordance with the finitedifference equation:
ump+ l  ( 1  r ) limp + rump +! , for aln we obtain: alII = ( 1  r)if!n + ril?,,+ ! = ( 1  2r)a?n. We thus see that the error committed at p = 0 has been multiplied by the quantity 1  2r when advancing to
320
A Theoretical Introduction to Numerical Analysis
p = 1 . For yet another time level p = 2 we have: 2 ( 1 lim =

2r )2 U0lli'
and in general, When r > 1 we obviously have 1 2r <  1 , so that each time step, i.e., each transi tion between the levels p and p + 1 , implies yet another multiplication of the initial error tt?n = ( _ 1 )111 £ by a negative quantity with the absolute value greater than one. For p = [ T /1J we thus obtain: 
Consequently, I lii (h) Il u"
=
max sup I t/f., I P 111
=
11

2r l [T I (rh)] sup li l/iin i l = 1 1 m

2rl[TI(rh)] 11](//) I I F" .
In other words, we see that for a given fixed T the error (  1 )111 £ originally committed when specifying the initial data for the finitedifference problem, grows at a rapid exponential rate 1 1 2r l [TI (rlz)] as h � This is a manifestation of the instability for scheme ( 10. 10) when r > 1 . 
10. 1 .6
O.
The Kantorovich Theorem
For a wide class of linear operator equations, the theory of their approximate so lution (not necessarily numerical) can be developed, and the foregoing concepts of consistency, stability, and convergence can be introduced and studied in a unified general framework. In doing so, the exact equation and the approximating equation should basically be considered in different spaces of functions  the original space and the approximating space (like, for example, the space of continuous functions and the space of grid functions).4 However, it will also be very helpful to establish a fundamental relation between consistency, stability, and convergence for the most basic setup, when all the operators involved are assumed to act in the same space of functions. The corresponding result is known as the Kantorovich theorem; and a number of more specific results, such as that of Theorem 10. 1 , can be interpreted as its particular realizations. Of course, each individual realization will still require some nontrivial constructive steps pertaining primarily to the proper definition of the spaces and operators involved. A detailed description of the corresponding develop ments can be found in [KA82], including a comprehensive analysis of the case when the approximating equation is formulated on a different space (subspace) of func tions. Monograph [RM67] provides for an account oriented more toward computa tional methods. The material of this section is more advanced, and can be skipped during the first reading. 4 The approximating space often appears isomorphic to a subspace of the original space.
FiniteD{flerence Schemes for Partial Differential Equations
321
Let U and F be two Banach spaces, and let L be a linear operator: L : U � F that has a bounded inverse, L I : F � U, i lL  I II < 00 In other words, we assume that the problem .
( 10.26)
Lu = f
is uniquely solvable for every f E F and wellposed. Let Liz : U � F be a family of operators parameterized by some h (for example, we may have h l in , n = 1 , 2, 3 , . . . ). Along with the original problem (10.26), we introduce a series of its "discrete" counterparts: =
(1 0.27) where u(ll) E U and each Liz is also assumed to have a bounded inverse, L; I : F � U , I lL; I II < 00 The operators L" are referred to as approximating operators. We say that problem (10.27) is consistent, or in other words, that the operators Liz of (1 0.27) approximate the operator L of (10.26), if for any u E U we have .
(10.28) Note that any given u E U can be interpreted as solution to problem (10.26) with the righthand side defined as F 3 f � Lu. Then, the general notion of consistency (10.28) becomes similar to the concept of approximation on a solution introduced in Section 10. 1.2, see formula (1 0.6). Problem (10.27) is said to be stable if all the inverse operators are bounded uni formly: (10.29) I l L; I II ::; C const, =
which means that C does not depend on h. This is obviously a stricter condition than simply having each L; I bounded; it is, again, similar to Definition 10.2 of Section 10. 1 .3. THEOREM 1 0.2 (Kantorovicb) Provided that properties (10. 28) and (1O. 29) hold, the solution u (h ) of the approximating proble'm (1 0. 27) converges to the solution u of the original prob lem (1 O. 26): 0, as II (10.30) l I u  u (Iz) llv
>
PROOF
Given
> o.
(10.28) and ( 10.29), we have
Ilu  lI (11) llv ::; I iL; I L"lI  L; ' fIIF ::; II L; I II I I Llz ll  filF ::; CfI Ljztl  filF = ::; CfILhll  Lu + LII  filF = Cfl L hU  L u liF 0, as II 0,
>
because LII = f and LIzll(Iz) = f.
>
0
322
A Theoretical Introduction to Numerical Analysis
Theorem 10.2 actually asserts only one implication of the complete Kantorovich theorem, namely, that consistency and stability imply convergence, cf. Theo rem 10. 1 . The other implication says that if the approximate problem is consistent and convergent, then it is also stable. Altogether it means, that under the assumption of consistency, the notions of stability and convergence are equivalent. The proof of this second implication is based on the additional result known as the principle of uniform boundedness for families of operators, see [RM67]. This proof is gener ally not difficult, but we omit it here because otherwise it would require introducing additional notions and concepts from functional analysis. 10.1.7
On t he Efficacy of FiniteDifference Schemes
Let us now briefly comment on the approach we have previously adopted for as sessing the quality of approximation by a finitedifference scheme. In Section 1 0. 1 .2, it is characterized by the norm of the truncation error /l 8l ") lI fi" measured as a power of the grid size see formula (10.7). As we have seen in Theorem 1O.l, for stable schemes the foregoing characterization, known as the order of accuracy, coincides with the (asymptotic) order of the error I I [ul h  u( h) II v" , also referred to as the convergence rate, see formula (lOA). As the latter basically tells us how good the approximate solution is, it is natural to characterize the scheme by the amount of computations required for achieving a given order of error. In its own tum, the amount of computations is generally proportional to the overall number of grid nodes N. For ordinary differential equations N is inversely proportional to the grid size h. Therefore, when we say that E is the norm of the error, where E tJ(hk ), we actu ally imply that E tJ(N k ), i.e., that driving the error down by a factor of two would require increasing the computational effort by roughly a factor of ,yl. Therefore, in the case of ordinary differential equations, the order of accuracy with respect to h adequately characterizes the required amount of computations. For partial differential equations, however, this is no longer the case. In the pre viously analyzed example of a problem with two independent variables x and t, the grid is defined by the two sizes: h and r. The number of nodes N that belong to a given bounded region of the plane is obviously of order rh) . Let r = rho In this case N = tJ(h2 ), and saying that E tJ (hk ) is equivalent to saying that 2 2 E = fJ(Nk/ ). If r = rh , then N tJ(h  3), and saying that £:: = tJ (li ) is equivalent to saying that £:: tJ(N  k/ 3 ) . We see that in the case of partial differential equations the magnitude of the error may be more natural to quantify using the powers of N 1 rather than those of h. Indeed, this would allow one to immediately determine the amount of additional work (assumed proportional to N) needed to reduce the error by a prescribed factor. Hereafter, we will nonetheless adhere to the previous way of characterizing the accu racy by means of the powers of h, because it is more convenient for derivations, and it also has some "historical" reasons in the literature. The reader, however, should always keep in mind the foregoing considerations when comparing properties of the schemes and determining their suitability for solving a given problem. We should also note that our previous statement on the proportionality of the com
h,
=
=
(x, t)
=
=
1/(
=
FiniteDifference Schemes for Partial Differential Equations
323
putational work to the number of grid nodes N does not always hold either. There are, in fact, examples of finitedifference schemes that require &(N 1 + a ) arithmetic operations for finding the discrete solution, where ex may be equal to 1/2, 1 , or even 2. One may encounter situations like that when solving finitedifference problems that approximate elliptic equations, or when solving the problems with three or more independent variables [e.g., u(x,y,t)]. In the context of real computations, one normally takes the code execution time as a natural measure of the algorithm quality and uses it as a key criterion for the comparative assessment of numerical methods. However, the execution time is not necessarily proportional to the number of floating point operations. It may also be affected by the integer, symbolic, or logical operations. There are other factors that may play a role, for example, the socalled memory bandwidth that determines the rate of data exchange between the computer memory and the CPU. For mUltiproces sor computer platforms with distributed memory, the overall algorithm efficacy is to a large extent determined by how efficiently the data can be exchanged between dif ferent blocks of the computer memory. All these considerations need to be taken into account when choosing the method and when subsequently interpreting its results. 10.1.8
B ibliography Comments
The notion of stability for finitedifference schemes was first introduced by von Neumann and Richtmyer in a 1 950 paper [VNR50] that discusses the computation of gasdynamic shocks. In that paper, stability was treated as sensitivity of finite difference solutions to perturbations of the input data, in the sense of whether or not the initial errors will get amplified as the time elapses. No attempt was made at proving convergence of the corresponding approximation. In work [OHK5 1 J , O'Brien, Hyman, and Kaplan studied finitedifference schemes for the heat equation. They used the apparatus of finite Fourier series (see Sec tion 5.7) to analyze the sensitivity of their finitedifference solutions to perturbations of the input data. In modem terminology, the analysis of [OHK5 1 ] can be qualified as a spectral study of stability for a particular case, see Section 10.3. The first comprehensive systems of definitions of stability and consistency that enabled obtaining convergence as their implication, were proposed by Ryaben'kii in 1952, see [Rya52] , and by Lax in 1 953, see [RM67, Chapter 3]. Ryaben 'kii in work [Rya52] analyzed the Cauchy problem for linear partial differ ential equations with coefficients that may only depend on t. He derived necessary and sufficient conditions for stability of finitedifference approximations in the sense of the definition that he introduced in the same paper. He also built constructive examples of stable finitedifference schemes for the systems that are hyperbolic or pparabolic in the sense of Petrowsky. Lax, in his 1 953 system of definitions, see [RM67, Chapter 3], considered finite difference schemes for timedependent operator equations. His assumption was that these schemes operate in the same Banach spaces of functions as the original differ ential operators do. The central result of the Lax theory is known as the equivalence theorem. It says that if the original continuous problem is uniquely solvable and
A Theoretical Introduction to Numerical Analysis
324
wellposed, and if its finitedifference approximation is consistent, then stability of the approximation is necessary and sufficient for its convergence. The system of basic definitions adopted in this book, as well as the form of the key theorem that stability and consistency imply convergence, are close to those proposed by Filippov in 1 955, see [Fi155] and also [RF56]. The most important dif ference is that we use a more universal definition of consistency compared to the one by Filippov, [FiI55]. We emphasize that, in the approach by Filippov, the analysis is conducted entirely in the space of discrete functions. It allows for all types of equations, not necessarily timedependent, see [RF56]. In general, the framework of [FiI55], [RF56] can be termed as somewhat more universal and as such, less con structive, than the previous more narrow definitions by Lax and by Ryaben 'kii. Continuing along the same lines, the Kantorovich theorem of Section 1 0. 1 .6 can, perhaps, be interpreted as the "ultimate" generalization of all the results of this type. Let us also note that in the 1 928 paper by Courant, Friedrichs, and Lewy [CFL28], as well as in many other papers that employ the method of finite differences to prove existence of solutions to differential equations, the authors establish inequalities that could nowadays be interpreted as stability with respect to particular norms. How ever, the specific notion of/mitedifference stability has rather been introduced in the context of using the schemes for the approximate computation of solutions, under the assumption that those solutions already exist. Therefore, stability in the frame work of approximate computations is usually studied in weaker norms than those that would have been needed for the existence proofs. In the literature, the method of finite differences has apparently been first used for proving existence of solutions to partial differential equations by Lyusternik in 1924, see his later publication [Lyu40]. He analyzed the Laplace equation. Exercises 1.
For the Cauchy problem ( l0.8), analyze the following finitedifference scheme: 1'+ 1
Um
where
r
 liI'm
=
lipm  llpm_ 1
h
ll�:,
rh and r = const.
 rpmP ' =
If/II"
m = 0, ± I , ±2, . . , p = O, I , . . . , [ T/rJ  I , .
In
= 0, ± I , ±2, . . . ,
a) Explicitly write down the operator Lil and the righthand side fill that arise when recasting the foregoing scheme in the form Lilll(lil fill. =
b) Depict graphically the stencil of the scheme, i.e., locations of the three grid nodes, for which the difference equation connects the values of the discrete solu tion IPl . c ) Show that the finitedifference scheme i s consistent and has first order accuracy with respect to h on a solution u(x,f) of the original differential problem that has bounded second derivatives. d) Find out whether the scheme is stable for any choice of r.
FiniteDifference Schemes for Partial Differential Equations
325
2. For the Cauchy problem: Ur + Ux = cp(x, t ) , U(x, O) = ljI(x) ,
 CX>
<x
,
analyze the following two difference schemes according to the plan of Exercise I :
3 . Consider a Cauchy problem for the heat equation: du d2u = Cp(x, t), Tt  dx2 u(x, O) = ljI(x) ,
CX> CX>
<x
<x
,
0 ::; t ::; T,
( 1 0.3 1 )
,
and introduce the following finitedifference scheme: I
U�+ I  2uf" + 1/;',_ I U;,,+  u;" "'''...,..,, "'' = cpt" h2 r m = 0, ± I , ±2, . . . , p = O, I , . . . , [T/rj  l ,
U?n
=
(10.32)
1jI11/, m = 0, ± I , ±2, . . . .
Assume that norms in the spaces U" and F" are defined according to ( 1 0.21) and (1 0.22), respectively. a) Verify that when r/h2 = r = const, scheme ( 1 0.32) approximates problem ( 1 0.3 1 ) with accuracy O(lJ 2 ) .
b)* Show that when r ::; 1 /2 and cp (x, t) == 0 the following maximum principle holds for the finitedifference solution lift,: sup l u;,,+ I I ::; sup l u;,J
J1l
Hint.
m
See the analysis in Section 10.6. 1 .
c)* Using the same argument as i n Question b)*, prove that the finitedifference scheme (10.32) is stable for r ::; 1 /2, including the general case cp (x, t) oj:. O.
4.
Solution of the Cauchy problem ( 10.3 1 ) for the onedimensional homogeneous heat equation, cp(x, t) == 0, is given by the Poisson integral:
See whether or not it is possible that a consistent explicit scheme (10.32) would also be convergent and stable if r 11, h O. Hint. Compare the domains of dependence for the continuous and discrete problems, and use the Courant, Friedrichs, and Lewy condition. =
>
326
A Theoretical Introduction to Numerical Analysis
5. The acoustics system of equations:
av aw aw av <x< at ax ' at ax ' v(x,O) tp(x), w(x,O) = lfI(x),  00
00 ,
 00
=
0 ::; t ::; T, <x< 00 ,
has solutions of the type:
tp(x  t)  lfI(x  t) + tp (x + t) + lfI(x + t) ' 2 w(x, t) =  tp(x  t) + lfI(x  t) + tp (x + t ) + lfI(x + t) . 2 v(x, t) =
Can the following finitedifference scheme converge?
v,;.+1

r
Hint.
+ = 0 ' %, 1  w!:. _ �+ 1 v,;. = 0 h r m = 0, ± 1 , ±2, . . . , p = O, l, . . . , [T/r]  l , v� = tpm, w� = ""n, m = 0, ± 1 , ±2, . . . .
w�+ 1  %.
vin
h

'
Compare the domains of dependence for the continuous and discrete problems.
6.* The Cauchy problem:
au au <x< " t >0 at  ax = 0 ' u(x,O) eiaX, 0 < a < 2rr:, a const,  00
00
=
=
has the following solution: Solution of the corresponding finitedifference scheme:
Unr}+ l  ump r
Upm+ 1  ump h
= 0,
m = 0 , ± 1 , ±2, . . . , p = O, l , . . .
,
u?n = eia(mh) , m = 0, ± 1 , ±2, . . . ,
is given by (verify that):
For p = t /r and m = x/h, this discrete solution converges to the foregoing continuous solution as h + 0 for any fixed value of r = r/ h. At the same time, when r > 1 the difference scheme does not satisfy the Courant, Friedrichs, and Lewy condition, which is necessary for convergence. Explain the apparent paradox.
327
FiniteDifference Schemes for Partial Differential Equations 10.2 10.2.1
Construction of Consistent D ifference Schemes Replacement of Derivatives by Difference Quotients
Perhaps the simplest approach to constructing finitedifference schemes that would approximate the corresponding differential problems is based on replacing the derivatives in the original equations by appropriate difference quotients. Below, we illustrate this approach with several examples. In these examples, we use the following approximation formulae, assuming that h is small:
df(x)
f(x + h)  f(x) df(x) f(x)  f(x  h) h �� h df(x) f(x + h)  f(x  h) ( 1 0.33) ' 2h �� 2 d f(x) f(x + h)  2f(x) + f(x  h) h2 dx2 � Suppose that the function f = f(x) has sufficiently many bounded derivatives. Then, ��
the approximation error for equalities ( 10.33) can be estimated similarly to how it has been done in Section 9.2. 1 of Chapter 9. Using the Taylor formula, we can write:
4 3 2 f(x + h) = f(x) + h!'(x) + h2 1 f" (x) + h3 1 fill (x) + h4 ! i4 ) (x) + 0 (h4 ) , f(x  h) = f(x)  h!'(x) + >" (x)  >III(X) + >(4 ) (x) + 0 (h4 ) .
�
�
�
( 10.34)
where o(h a ) is a conventional generic notation for any quantity that satisfies: limh >o {o(ha )jha } = Substituting expansions ( 10.34) into the righthand sides of approximate equalities ( 10.33), one can obtain the expressions for the error:
o.
l {
[� [� [� [��
]
f(X + h  f(x) = !, (x) + !" (X) + (h) , O f(X)  (X  h) =!, (x) +  f"(x) + O (h) , f(x + h) �f(X  h) =!, (x) + 2 f"'(x) (h2 ) , +0 2 f(X + h) ) + f(X  h) = f" (x) + i4) (X) + 0(h2 ) .
2��X
] ]
]
( 1 0.35)
The error of each given approximation formula in ( 10.33) is the term in rectangular brackets on the righthand side of the corresponding equality ( 10.35). It is clear that formulae of type ( 1 0.33) and ( 1 0.35) can also be used for approxi mating partial derivatives with difference quotients. For example:
du(x, t) u(x,t + r)  u(x, t) r t � d
A Theoretical Introduction to Numerical Analysis
328 where
( )]
d2 U (X , t) u(x,t + r)  u(x,t)  d U (X,t) + +0 r . 2 dt2 dt r
[:E.
_
Similarly, the following formula holds:
d U (X, t)
u(x + h,t)  ll(X, t) h
a; � and
u(x + h,t)  u(x, t) h
=
dll(X,t) d2 (X, t) , + � U 2 + 0 2 ot oX
[
,
(I)] 1
.
Example 1
Let us consider again the Cauchy problem ( 10.8) of Section 10. 1 .2:
d U dU = (x,t), dt  dX cp u(x,O) = ljI(x) ,
 00
< x < 00 , 0 < t
00
<x
r
10.
Prove that the monotone schemes defined the same way as in Exercise 9 for cp ° and j = jleft , . . . , jright satisfy the maximum principle from page 3 15, i.e., that the maximum of the corresponding difference solution will not increase as the time elapses. =
1 1 .* (Godunov theorem) Prove that no onestep explicit monotone scheme may have ac curacy higher than e(h) on smooth solutions of the differential equation Ut + eux = 0.
One version of the proof, due to Harten, Hyman, and Lax, can be found, e.g., in [Str04, pages 7172]. Note also that there are exceptions, but they are all trivial. For example, the first order upwind scheme with r = 1 has zero error on the exact solution u = u(x  t) of lit + Ux = 0.
Hint.
A Theoretical Introduction to Numerical Analysis
348
1 2. Consider a Cauchy problem for the onedimensional wave (d' Alembert) equation:
a2 u a211 = (x t) <x< at2  ax2 rp lI(x,O) = lj/(O) (x) , a u ,o) = lj/( I ) (x) , ,
{
,
�
00
 00
,
0 < t . (x,t), lj/(O) (x), and lj/( I ) (x) are given. 1 3. Consider the process of heat transfer on a finite interval, as opposed to the infinite line. It is governed by the heat equation subject to both initial and boundary conditions, which altogether yield the initial boundary value problem:
au a211 = (x,t), O < x < I , O < t ' 0 is given. For an arbitrary f, estimate ( 10. 1 02) can always be guaranteed by selecting: \f( a) =
{
l, 0,
if if
a E [ a*  D , a* + DJ , a 5t' [a*  D, a* + DJ ,
where a* = argmax I A ( a ) 1 and D > O. Indeed, as the function I A (a) 1 2p is a continuous, inequality (10. 1 02) will hold for a sufficiently small D = D (£).
FiniteDifference Schemes for Partial Differential Equations
365
If estimate ( 10.101) does not take place, then we can find a sequence k = 0, 1 , 2, 3, . . . , and the corresponding sequence Tk = T(h k ) such that
hk ,
q=
(
, ) [T/rd
m;x \ A. ( a hd \
+ 00
as
k+ oo.
Let us set E = 1 and choose \f(a) to satisfy (10. 102) . Define lJIm as Fourier co efficients of the function \f(a), according to formula ( 10.98) . Then, inequality ( 10. 102) for Pk = [T lrk] transforms into: l\uPk l\ 2 ::::
(d  1 ) l\ lJII\ 2 � l\ uPk l\ :::: ( Ck

1 ) l \ lJI l\ ,
i.e., there is indeed no stability ( 10.96) with respect to the initial data.
0
Theorem 10.3 establishes equ ivalence between the von Neumann spectral condi tion and the 12 stability of scheme ( 1 0.95) with respect to the initial data. In fact, one can go even further and prove that the von Neumann spectral condition is necessary and sufficient for the fullfledged 12 stability of the scheme ( 10.95) as well, i.e., when the righthand side (()fn is not disregarded. One implication, the necessity, immedi ately follows from Theorem I 0.3, because if the von Neumann condition does not hold, then the scheme is unstable even with respect to the initial data. The proof of the other implication, the sufficiency, can be found in [GR87, § 25]. This proof is based on using the discrete Green's functions. In general, once stability with respect to the initial data has been established, stability of the full inhomogeneous problem can be derived using the Duhamel principle. This principle basically says that the solution to the inhomogeneous problem can be obtained as linear superposition of the solutions to some specially chosen homogeneous problems. Consequently, a stability estimate for the inhomogeneous problem can be obtained on the basis of stability estimates for a series of homogeneous problems, see [Str04, Chapter 9]. 10.3.6
Scalar Equations vs. Systems
As of yet, our analysis of finitedifference stability has focused primarily on scalar equations; we have only considered a 2 x 2 system in Examples 1 0 and 1 1 of Sec tion 10.3.3. Besides, in Examples 5 and 9 of Section 10.3.3 we have considered scalar difference equations that connect the values of the solution on more than two consecutive time levels of the grid; they can be reduced to systems of finitedifference equations on two consecutive time levels. In general, a constant coefficient finitedifference Cauchy problem with vector unknowns (i.e., a system) can be written in the form similar to ( 10. 95) : iright
I.
i= ileft
� :� 
BjU l
jright
I.
i =  ileft
� = VIm, m = O, ± I , ±2, . . . ,
U ,
AjU '+ i =
�
q>�p
P = 0, 1 , . . . , [T l r]  l ,
( 1 0. 103)
A Theoretical Introduction to Numerical Analysis
366
under the assumption that the matrices jrigh.
L
j= jlef'
j Bjeia ,
0 :::::
ex
0 such that for any particular matrix A from the family, and any positive integer p, the following estimate holds: IIAP II ::::: K. Scheme ( 1 0 . 1 03) is stable in 12 if and only if the family of amplification matrices A = A( ex , h) given by ( 1 0. 1 04) is stable in the sense of the previous definition (this family is parameterized by ex E [0, 2n) and h > 0). The following theorem, known as the Kreiss matrix theorem, provides some necessary and sufficient conditions for a family of matrices to be stable.
FiniteDifference Schemes for Partial Differential Equations
367
THEOREM 1 0.4 (Kreiss) Stability of a family of matrices A is equivalent to any of the following con ditions: 1. There is a constant C I > 0 such that for any matrix A from the given family, and any complex number z, IzI > 1 , there is a resolvent (A  zJ)  ! bounded as: 2. There are constants C2 > 0 and C3 > 0, and for any matrix A from the given family there is a nonsingular matrix M such that IIMII :S C2 ,
IIM  1 II :S C2 , and the matrix D � MAM 1 is upper triangular, with the offdiagonal entries that satisfy: Idij I
:S C3 min{ 1

Iq,
1
ICj} ,
where ICj = dii and ICj = djj are the corresponding diagonal entries of D, i. e., the eigenvalues of A. 3. There is a constant C4 > 0 , and for any matrix A from the given family there is a Hermitian positive definite matrix H, such that
C4 1 I 0 (and in particular, to the right endpoint x = 1 ) measured in the number of grid cells will again increase with no bound as h t O. Yet the number of grid cells to the left endpoint x = 0 will not change and will remain equal to zero. In other words, the point (i, i) will never be far from the left boundary, no matter how fine the grid may be. Consequently, we can no longer expect that perturbations of the solution to problem ( 1 0. 1 24) near i = 0 will behave similarly to perturbations of the solution to equation ( 1 0. 1 09), as the latter is formulated on the grid infinite in both directions. Instead, we shall rather expect that over the short periods of time the perturbations of the solution to problem ( 1 0. 1 24) near the left endpoint x = 0 will develop analo gously to perturbations of the solution to the following constantcoefficient problem:
p Urp+ n !  UI/1 r
II UpO + l
_
=
up a (0 , t.,; m + 1
0,
2Urpn + ump  I = 0' h2 m = 0, 1 , 2, . . . , p ? 
o.
( 1 0 . 1 25)
Problem ( 10. 1 25) is formulated on the semiinfinite grid: m = 0, 1 , 2, . . . (i.e., semi infinite line x ? 0). It is obtained from the original problem ( 1 0. 1 24) by freezing the coefficient a (x, t ) at the left endpoint of the interval 0 :::; x :::; 1 and by simultaneously "pushing" the right boundary off all the way to +00 . Problem ( 10. 1 25) shall be analyzed only for those grid functions uP = { ub , uf , . . . } that satisfy: U�l
t
0,
as m t
+ 00 .
( 10 . 1 26)
Indeed, only in this case will the perturbation be concentrated near the left boundary x = 0, and only for the perturbations of this type will the problems ( 10. 1 24) and ( 10. 1 25) be similar to one another in the vicinity of x = O. Likewise, the behavior of perturbations to the solution of problem ( 10. 1 24) near
379
FiniteDifference Schemes for Partial Differential Equations the right endpoint
x = 1 should resemble that for the problem:
p Ump+ 1r Ump _ a( 1 , {l lImp + 1  2U2mp + Um_1 'J
h
=
0,
( 1 0. 1 27)
12 uft+ l = 0, m = . . . , 2,  1 , 0 , 1 , 2, . . , M, p 2': 0, that has only one boundary at m M. Problem ( 10. 1 27) is derived from problem ( 1 0. 1 24) by freezing the coefficient a(x,t) at the right endpoint of the interval 0 :::; .
=
x :::; 1 and by simultaneously pushing the left boundary off all the way to
00 It should be considered only for the grid functions uP { . . . , u� I ' uS, uf, . . . , uft} that satisfy: ( 1 0 . 1 28) � + 0, as m + 00. =

.
Uz
The three problems: ( 1 0 . l 09), ( 10. 1 25), and ( 1 0. 1 27), are easier to investigate than the original problem ( 10. 1 24), because they are all h independent provided that r = r/h2 = const, and they all have constant coefficients. Thus, the issue of studying stability for the scheme ( 10. 1 24), with the effect of the boundary conditions taken into account, can be addressed as follows. One needs to formulate three auxiliary problems: ( 10. 1 09), ( 10. 1 25), and ( 1 0. 1 27). For each of these three h independent problems, one needs to find all those numbers A, (eigen values of the transition operator from uP to uP+ I ), for which solutions of the type ( 1 0. 1 29)
exist. In doing so, for problem ( 1 0. 109), the function uD {u�}, m 0, ± 1 , ±2, . . . , has to be bounded on the grid. For problem ( 1 0. 125), the grid function uD {u?n } , and for problem ( 1 0. 1 25), the grid m 2': 0, has to satisfy: u?z + 0 as m + function uD {u?n }, m :::; M, has to satisfy: u?z + 0 as m + 00. For scheme ( 10. 1 24) to be stable, it is necessary that the overall spectrum of the difference ini tial boundary value problem, i.e., all eigenvalues of all three problems: ( 1 0. 1 09), ( 1 O . l 25), and ( 10 . 1 27), belong to the unit disk: I)" I :::; 1 , on the complex plane. This is the BabenkoGelfand stability criterion. Note that problem ( 10. 109) has to be considered for every fixed x E (0, 1 ) and all t. =
=
+"",
=
=
REMARK 1 0 . 1 Before we continue to study problem ( 10.124), let us present an important intermediate conclusion that can already be drawn based on the foregoing qualitative analysis. For stability of the pure Cauchy problem (10. 108) that has no boundary conditions it is necessary that finitedifference equations ( 10. 109) be stable in the von Neumann sense \f(x, t) . This require ment remains necessary for stability of the initial boundary value problem (10. 124) as well. Moreover, when boundary conditions are present, two more auxiliary problems: (10. 125) and ( 10. 127) , have to be stable in a similar sense. Therefore, adding boundary conditions to a finitedifference Cauchy problem will not, generally speaking, improve its stability. Boundary conditions may ei ther remain neutral or hamper the overall stability if, for example, problem
380
A Theoretical Introduction to Numerical Analysis
( 10. 109) appears stable but one of the problems ( 10.125) or (10.127) happens to be unstable. Later on, we will discuss this phenomenon in more detail. 0
Let us now assume for simplicity that a (x, t ) == 1 in problem ( 10. 1 24), and let us calculate the spectra of the three auxiliary problems ( 10. 1 09), ( 10. 1 25), and ( 10. 1 27) for various boundary conditions I I ug + 1 = 0 and 12uf..t+ 1 = o. Substituting the solution in the form u;" = }.,Pum into the finitedifference equation ( 10. 1 09), we obtain: which immediately yields: Um+ 1 
A  I + 2r
r
llm + Um 1 = 0.
( 10. 1 30)
This is a second order homogeneous ordinary difference equation. To find the general solution of equation ( 10. 1 30) we write down its algebraic characteristic equation:
q2  A  Ir + 2r q + 1 = 0 .
( 1 0. 1 3 1 )
If q is a root of the quadratic equation ( 1 0. 1 3 1 ), then the grid function solves the homogeneous finitedifference equation: = 0.
( 10. 1 32)
 00,
yields the solution of
If I q I = 1 , i.e., if q = eia , then the grid function which is obviously bounded for m equation ( 10. 1 32), provided that
>
A = 1  4r sin2
+ 00
and m
�,
0 ::::
>
a
±oo.
FiniteDifference Schemes for Partial Differential Equations
381
is still equal to one, see equation 00. 1 3 1 ). Consequently, the absolute value of the first root of equation (10. 1 3 1 ) will be greater than one, while that of the second root will be less than one. Let us denote Iql (A) I < I and Iq2 (A) I > 1. The general solution of equation (10. 1 30) has the form: where C I and C2 are arbitrary constants. Accordingly, the general solution that satis fies additional constraint ( 10. 126), i.e., that decays as m > +00, is written as and the general solution that satisfies additional constraint ( 10. 128), i.e., that decays as m >  00, is given by CI
To calculate the eigenvalues of problem ( 1 0. 1 25), one needs to substitute Um =
q'{' into the left boundary condition I I Uo = 0 and find those q l and A, for which it is satisfied. The corresponding A will yield the spectrum of problem ( 1 0. 1 25). If, for example, Z l llO == llO = 0, then equality C I q? = 0 will not hold for any C I =I 0,
=
so that problem (10. 125) has no eigenvalues. (Recall, the eigenfunction must be nontrivial.) If li llO U I  Uo 0, then condition C I (q l  q?) C I (q l  I ) = 0 again yields C [ 0 because q l =I 1, so that there are no eigenvalues either. If, however, I l uo 2u I  Lto 0, then condition CI (2q l  q?) = CI (2q[  1 ) 0 is satisfied for CI =I 0 and q l 1 /2 < 1 . Substituting q l = 1/2 into the characteristic equation ( 1 0. 1 3 1 ) we find that ==
==
=
=
=
=
(
A = I + r ql  2 + �
q]
)
=
= I + .c .
2 This is the only eigenvalue of problem (10. 1 25). It does not belong to the unit disk on the complex plane, and therefore the necessary stability condition is violated. The eigenvalues of the auxiliary problem ( 10. 127) are calculated analogously. They are found from the equation h UM 0 when =
For stability, it is necessary that they all belong to the unit disk on the complex plane. We can now provide more specific comments following Remark 10. 1 . When boundary condition ZI UO == 2u I  Uo = 0 is employed in problem ( 10. 1 25) then the solution that satisfies condition (10. 1 26) is found in the form u;;, = APq't , where q [ = 1 /2 and A = I + r/2 > 1 . This solution is only defined for m 2:: O. If, however, we were to extend it to the region m < 0, we would have obtained an unbounded function: u;" > 00 as m >  00 . In other words, the function Lt;', = APq'{' cannot be used in the framework of the standard von Neumann analysis of problem (10.109). This consideration leads to a very simple explanation of the mechanism of instabil ity. The introduction of a boundary condition merely expands the pool of candidate
A Theoretical Introduction to Numerical Analysis
382
functions, on which the instability may develop. In the pure von Neumann case, with no boundary conditions, we have only been monitoring the behavior of the harmon ics eiam that are bounded on the entire grid m = 0, ± 1 , ±2, . . .. With the boundary conditions present, we may need to include additional functions that are bounded on the semiinfinite grid, but unbounded if extended to the entire grid. These functions do not belong to the von Neumann category. If any of them brings along an unstable eigenvalue I I., / > 1, such as 1., = 1 + r/ 2 , then the overall scheme becomes unstable as well. We therefore reiterate that if the scheme that approximates some Cauchy problem is supplemented by boundary conditions and thus transformed into an initial boundary value problem, then its stability will not be improved. In other words, if the Cauchy problem was stable, then the initial boundary value problem may either remain stable or become unstable. If, however, the Cauchy problem is unstable, then the initial boundary value problem will not become stable. Our next example will be the familiar first order upwind scheme, but built on a finite grid: Xm = mh, m = 0 , 1 , 2, . . . ,M, Mh = 1 , rather than on the infinite grid
m = 0, ± I , ±2, . . .
:
ump +1 Ump = 0, U;:,+ I  ufn "'''r h ( 10. 133) m = 0, 1 , 2, . . . , M  1, p = 0 , 1 , 2 , . . . , [T / r]  1, U� = lJf(xm), u�+1 = O. Scheme ( 10.133) approximates the following first order hyperbolic initial boundary value problem:
du du = 0 0 :::; x :::; 1 , 0 < t :::; T, dt dx ' u(x, O) = lJf(x), u(l,t) = 0, on the interval 0 :::; x :::; 1. To investigate stability of scheme ( 10.133), we will em _
ploy the BabenkoGelfand criterion. In other words, we will need to analyze three auxiliary problems: A problem with no lateral boundaries:
Ump = 0, Ump'+ 1Ump+!  Ump '.:..: h',  m = 0, ± 1, ±2, . . . ,
( 10. 134)
a problem with only the left boundary:
ump''Ump = 0, ufn+ 1  ufn ':':"; +lh m = 0, 1 , 2 , . . . ,
(10. 135)
and a problem with only the right boundary:
Ump''Ump+ !  Ump � + !  Ump = 0, h . m = M  l ,M 2, . . , 1 ,0,  1 , . . . , upM+ !  0 . 
(10 .1 36)
FiniteDifference Schemesfor Partial Differential Equations
383
Note that we do not set any boundary condition at the left boundary in problem (10.135) as we did not have any in the original problem (10.133) either. We will need to find spectra of the three transition operators from uP to up+ 1 that correspond to the three auxiliary problems (10.134), (10.135), and (10.136), respectively. and determine under what conditions will all the eigenvalues belong to the unit disk 1).1 :5 I on the complex plane. Substituting a solution of the type:
into the finitedifference equation: ' +r ' p+I  (1  r)um Um+11
um
r
0:
T/h,
that corresponds to all three problems (10.134), (10.135), and (10.136), we obtain the following first order ordinary difference equation for the eigenfunction {um } : ()..  I + r}um  rUm+1
=
O.
(10.137)
Its characteristic equation: A  I + r  rq = O
(10.138)
yields the relation between ).. and q. so that the general solution of equation (10.137) can be written as "' um = cq = c
(1. _ I ) +,
m
r
m = 0, ± I , ±2, . . . ,
c = const.
When Iql = I, i.e., when q = eia, 0 :5 IX < 21[", we have: ).. = l  r + re'a. The point it = it(a) sweeps the circle ofradiusr centered atthe point (lr,O) on the complex plane. This circle gives the spectrum. i.e., the full set of eigenvalues, of the first auxiliary problem (10.134), see Figure 1O.ll(a). It is clearly the same spectrum as we have discussed in Section 10.3.2, see formula (10.77) and Figure 1O.5(a). 1m'
1m }.. '
Im
'l\_
o
(a) Problem (I0.134) FIGURE
(b) Problem (10.135)
I ReA
(c) Problem (10.136)
10.11: Spectra of auxiliary problems for the upwind scheme (10.133).
384
A Theorerical lnlroduction to Numerical Analysis
As far as the second aux.iliary problem (10.135), we need to look for its nontrivial + +00, see formula (10.126). Such a solution, cA.. pq'n, obviously exislS for any q: Iql < I . The corresponding eigenvalues It == It (q) I  r + rq fill the interior of the disk bounded by the circle ). I  r + reif]. on the complex; plane, see Figure 1O.lI(b). Solutions of the third auxiliary problem (10.136) chat would satisfy (10.128). i.e., that would decay as m + _00, must obviously have the fonn: uf" cAPq"', where )q) > I and the relation between .t and q is, again. given by formula (10.138). The homogeneous boundary condition u:t 0 of (10.136) implies that a nontrivial eigenfunction UIII cq'" may only exist when A A(q) == 0, i.e., when q == (r I )/r. The quantity q given by this expression may have its absolute value greater than one
solutions [hat would decrease as m
Ullt
=
=
=
=
=
=
=
if either of the two inequalities holds:
0'
,I
 < 1 . ,
r< < 1/2. problem (10.136) has the eigenvalue A == O. see
The first inequality has no solutions. The solution to the second inequality is
1/2.
Consequenl'Y. when r
10.11 (c). 10.12, we are schematically showing the combined sets of all eigenval ues, i.e., combined spectra, for problems (1O.134), (10.135), and (1O.136) for the three different cases: r < 1/2, 1/2 < r < 1, and r > I . Figure
In Figure
ImJ.. i
IA DiG, ,,· (a) r < lj2 FIGURE
Im''
Im J.. j
0·,


,
I
0
I ReA
:..
(b) lj2 < r < J
(c) r > 1
10.12: Combined spectra of auxiliary problems for scheme (10.133).
It is clear that the combined eigenvalues of all three auxiliary problems may only belong tothe unit disk IAI
:5 I on the complex plane ifr :5 1. Therefore. condition r :s: I is necessary for stability of the difference initial boundary value problem (10.133). Compared to the von Neumann stability condition of Section
10.3. the key dis
tinction of the BabenkoGelfand criterion is that it takes into account the boundary conditions for unsteady finitedifference equations on finite intervals. This criterion can also be generalized to systems of such equations. In this case, a scheme that may look perfectly natural and "benign" at a first glance. and that may. in particular,
FiniteDifference Schemes for Partial Differential Equations
385
satisfy the von Neumann stability criterion, could still be unstable because of a poor approximation of the boundary conditions. Consequently, it is important to be able to build schemes that are free of this shortcoming. In [GR87), the spectral criterion of Babenko and Gelfand is discussed from a more general standpoint, using a special new concept of the spectrum of a family of operators introduced by Godunov and Ryaben'kii. In this framework, one can rigorously prove that the BabenkoGelfand criterion is necessary for stability, and also that when it holds, stability cannot be disrupted too severely. In the next section, we reproduce key elements of this analysis, while referring the reader to [GR87, Chapters 1 3 & 14] and [RM67, § 6.6 & 6.7] for further detail. 10.5.2
Spectra of the Families o f O perators. The Godunov Ryaben'kii Criterion
In this section, we briefly describe a rigorous approach, due to Godunov and Ryaben 'kii, for studying stability of evolutiontype finitedifference schemes on fi nite intervals. In other words, we study stability of the discrete approximations to initial boundary value problems for hyperbolic and parabolic partial differential equations. This material is more advanced, and can be skipped during the first read ing. As we have seen previously, for evolution finitedifference schemes the discrete solution u (h ) = { uf,,}, which is defined on a twodimensional spacetime grid:
(Xm ,tp ) (mh, pr), m = 0, I , . . . ,M, p = 0, 1 , . . . , [T /rJ , ==
gets naturally split or "stratified" into a collection of onedimensional grid functions {uP } defined for individual time layers tp , p = 0, 1 , . . . , [T /rJ. For example, the first order upwind scheme: p+ 1
 Ump um+ !  Um = cp , ;" h m = 0, 1 ,2, . . . ,M l , p = 0, 1 , 2, . . . , [T/rJ  l , u0m _ 'I'm , P
Lilli
I'
""':':":',_ 'c
( 1 0. 1 39)

for the initial boundary value problem:
au au = cp(x, t) O x � l , O < t � T, , � ax u(x,O) = 'I' (x), u(l ,t) = x (t),
7ft 
can be written as:
LIM 
1'+ 1 X P+ ! , 
m = O, I , . . . ,M  I , m = O, I , . . . ,M,
( 10. 140)
A Theoretical Introduction to Numerical Analysis
386
where r = r/h. Form ( 10. 140) suggests that the marching procedure for scheme (10. 139) can be interpreted as consecutive computation of the grid functions:
uo , u I , . . . , up , . . . , u [TIT] , defined on identical onedimensional grids m = 0, 1 , . . . , M that can all be identified with one and the same grid. Accordingly, the functions uP, p = 0 , 1 , . . . , [T / r] , can be considered elements of the linear space U� of functions u = { uo , u I , . . . , UM } defined on the grid m = 0 , 1 , . . . ,M. We will equip this linear space with the norm, e.g., I l u llv'h = max O�m �M I um l
or
We also recall that in the definitions of stability (Section 10. 1 .3) and convergence (Section 10. 1 . 1 ) we employ the norm I l u(h) I l vh of the finitedifference solution u(h) on the entire twodimensional grid. Hereafter, we will only be using norms that explicitly take into account the layered structure of the solution, namely, those that satisfy the equality: Having introduced the linear normed space U�, we can represent any evolution scheme, in particular, scheme (10.1 39), in the canonicalform:
uP+ I Rh UP + rpP , uO is given. =
( 10.141)
In formula ( 10. 141), R" : U� U� is the transition operator between the consec utive time levels, and pP E UIz. If we denote vp+ 1 Rh UP , then formula ( 10.140) yields: p+1  ( I  r) umP + rumP + I ' m = 0, 1, . . . M 1 . (10. 142a) VIII �
=
As far as the last component m = M of the vector vp+ I , a certain flexibility exists in the definition of the operator Rh for scheme (10.1 39). For example, we can set:
( lO.l42b) which would also imply: pf" = cpf,, ,
m = O, I , . . . M  l ,
( 10. 142c)
in order to satisfy the first equality of ( l 0. 141). In general, the canonical form ( 10.141) for a given evolution scheme is not unique. For scheme (10.1 39), we could have chosen vf t = 0 instead of vf t = uft in for mula ( 10. 142b), which would have also implied pf:t = X:+I in formula ( lO.l42c). However, when building the operator Rh , we need to make sure that the following
FiniteDifference Schemes for Partial Differential Equations
387
rather natural conditions hold that require certain correlation between the norms in the spaces U� and Fh: IIpPll vJ. :::: KI II /il ) I I F" , P = 0, 1 , . . . , [T /r] , (10. 143) I l uO llv:, :::: K2 11 /1z) II Fi,' The constants K! and K2 in inequalities (10. 143) do not depend on h or on f(ll). For scheme ( 10. 1 39), if we define the norm in the space Fh as: P+!  P max + max l + l l lJli l qJ%z n 1I / 1z) I IFI' = max m, p In r P and the norms of p P and uo as I Ip P llv'h = ma and lI uo llv' max lu?n l , respecm lp%z l tively, then conditions (10. 143) obviously hold for the operator Rh and the source term p P defined by formulae ( 10. 142a)( 1O. 142c). Let us now take an arbitrary ri E U� and obtain ul , u2 , , a[TIr] E U� using the recurrence formula up+ 1 = RlzuP . Denote u(h) = {uP }��;] and evaluate }(/l ) � Lh u(/z) . Along with conditions ( 10. 143), we will also require that
x
IX
XI
h
=
nl
. • •
(10. 144) where the constant K3 does not depend on uO E U� or on h. In practice, inequalities (10 . 143) and ( 10. 144) prove relatively nonrestrictive.7 These inequalities allow one to establish the following important theorem that pro vides a necessary and suffic ient condition for stability in terms of the uniform bound edness of the powers of Rh with respect to the grid size h. THEOREM 1 0. 6 Assume that when reducing a given evolution scheme t o the canonical form (1 0. 141) the additional conditions (10. 143) are satisfied. Then, for stability of the scheme in the linear sense (Definition 1 0. 2) it is sufficient that
IIR� I I :::: K,
P =
0, 1 , . . . , [T /r],
( 10. 145)
where the constant K in forrnula (1 0. 145) does not depend on h. If the third additional condition (10. 144) is met as well, then estimates (10. 145) are also necessary for stability.
Theorem 10.6 is proven in [GR87, § 41]. For scheme (10. 139), estimates ( 10. 145) can be established directly, provided that r :::: 1 . Indeed, according to formula ( 10. 142a), we have for m = 0, 1 , . . . ,M  1 :
l uf,z l = IluPllv" I vf,tl l = 1 ( 1  r)u� + rU�+ 1 1 :::: ( 1  r + r ) max m
h
7The first condition of ( 1 0. 143) can, in fact, be further relaxed, see [GR87, § 42].
388
A Theoretical Introduction to Numerical Analysis
and according to formula ( l0. 142b), we have for m = M: Consequently, which means that II R" II ::::; I . Therefore, II Rf. 1I ::::; II Rh II P ::::; 1 , and according to Theo rem 10.6, scheme ( 1 0. 1 3 9) is stable. REMARK 1 0 . 2 We have already seen previously that the notion of stability for a finitedifference scheme can be reformulated as boundedness of powers for a family of matrices. Namely, in Section 10.3.6 we discussed stability of finitedifference Cauchy problems for systems of equations with constant coefficients (as opposed to scalar equations) . We saw that finite difference stability (Definition 10.2) was equivalent to stability of the corre sponding family of amplification matrices. The latter, in turn, is defined as boundedness of their powers, and the Kreiss matrix theorem (Theorem 10.4) provides necessary and sufficient conditions for this property to hold. Transition operators R" can also be interpreted as matrices that operate on vectors from the space U� . In this perspective, inequality ( 10. 145) implies uniform boundedness of all powers or stability of this family of operators (matrices) . There is, however, a fundamental difference between the consider ations of this section and those of Section 10.3.6. The amplification matrices that appear in the context of the Kreiss matrix theorem (Theorem 10.4) are parameterized by the frequency a and possibly the grid size h. Yet the di mension of all these matrices remains fixed and equal to the dimension of the original system, regardless of the grid size. In contradistinction to that, the dimension of the matrices R" is inversely proportional to the grid size h, i.e., it grows with no bound as h t O. Therefore, estimate (10. 145) actually goes beyond the notion of stability for families of matrices of a fixed dimension (Section 10.3.6), as it implies ::>tability (uniform bound on powers) for a family of matrices of increasing dimension. 0
As condition ( 1 0. 145) is equivalent to stability according to Theorem 10.6, then to investigate stability we need to see whether inequalities ( 1 0. 145) hold. Let Ah be an eigenvalue of the operator Rh , and let v(h ) be the corresponding eigenvector so that R"v(h ) = Ah V(" ) . Then, and consequently II Rf,11 � I Ah IP. Since AJz is an arbitrary eigenvalue, we have:
II Rf.11 � [max I AIzIJ P ,
p
=
0, I , . . . [T /rJ ,
FiniteDifference Schemes for Partial Differential Equations
389
where [max IA" IJ is the largest eigenvalue of Rh by modulus. Hence, for the estimate ( 10. 145) to hold, it is necessary that all eigenvalues A of the transition operator R" belong to the following disk on the complex plane:
(10. 146)
where the constant C l does not depend on the grid size h (or r). It means that inequal ity (10. 146) must hold with one and the same constant CI for any given transition operator from the family {R,,} parameterized by h. Inequality (10.146) is known as the spectral necessary condition for the uniform boundedness of the powers II R;'II. It is called spectral because as long as the opera tors R" can be identified with matrices of finite dimension, the eigenvalues of those matrices yield the spectra of the operators. This spectral condition is also closely related to the von Neumann spectral stability criterion for finitedifference Cauchy problems on infinite grids that we have studied in Section 10.3, see formula (10.81). Indeed, instead of the finitedifference initial boundary value problem ( 10. 1 39), consider a Cauchy problem on the grid that is infinite in space:
limp+ 1  UIp/I 00.147) Uom lJIm , m = 0, ± 1 , ±2, . . . , p = 0, 1 , 2, . . . , [T /r]  1. The von Neumann analysis of Section 10.3.2 has shown that for stability it is neces sary that r = r/h ::; 1 . To apply the spectral criterion (10.146), we first reduce scheme (10.147) to the canonical form (10. 141). The operator Rh : UIz UIz, R" uP = vp+ l , and the source term p P are then given by [cf. formulae (10. 142)]: vmp+ 1  (1  r) ump + rump + l , p� = cp�, m = 0, ± 1 , ±2, . . . . The space UIz contains infinite sequences u { . . . , lLm, . . . , lL ( , uo , U l , . . . , um, . . . }. We can supplement this space with the C norm: I lu l l = sup l um l . The grid funcm tions U = {um} = {eiUI/I } then belong to the space UIz for all a E [0, 2n) and provide ==
f+

=
eigenfunctions of the transition operator:
R" u = (l  r)eiUm + reiU (II1+ I ) = [(1  r) + reiU JeiUm A (a)u, =
where the eigenvalues are given by:
A (a) = ( 1  r) + reiU . (10 . 148) According to the spectral condition of stability (10. 146), all eigenvalues must satisfy the inequality: I A (a) I ::; 1 + C I r, which is the same as the von Neumann condition (10.78). As the eigenvalues ( 10. 148) do not explicitly depend on the grid size, the spectral condition (10. 146) reduces here to IA (a) 1 ::; 1 , cf. formula (10.79).
390
A Theoretical Introduction to Numerical Analysis
Let us also recall that as shown in Section 10.3.5, the von Neumann condition is not only necessary, but also sufficient for the h stability of the twolayer (one step) scalar finitedifference Cauchy problems, see formula (10.95). If, however, the 2 space U� is equipped with the 12 norm: l I uli = [hI,;;;=_� IUml 2 ] 1 / (as opposed to m the C norm), then the functions {eia } no longer belong to this space, and therefore may no longer be the eigenfunctions of Rh• Nonetheless, we can show that the points A (a) of (10.148) still belong to the spectrum of the operator R", provided that the latter is defined as traditionally done in functional analysis, see Definition 10.7 on page 393. 8 Consequently, if we interpret A in formula (10.146) as all points of the spectrum rather than just the eigenvalues of the operator Rh, then the spectral condition (10. 146) also becomes sufficient for the h stability of the Cauchy problems (10.95) on an infinite grid m = 0, ± 1 , ±2, . . .. Returning now to the difference equations on finite intervals and grids (as op posed to Cauchy problems), we first notice that one can most easily verify estimates (10. 145) when the matrices of all operators Rh happen to be normal: R/zRi, = R'hR". Indeed, in this case there is an orthonormal basis in the space U� composed of the eigenvectors of the matrix Rh, see, e.g., [HJ85, Chapter 2]. Using expansion with respect to this basis, one can show that the spectral condition ( l 0.146) is necessary and sufficient for the 12 stability of an evolution scheme with normal operators Rh on a finite interval. More precisely, the following theorem holds. THEOREM 1 0. 7 Let the operators Rh in the canonical form (1 0. 141) be normal, and let them all be uniformly bounded with respect to the grid: IIRh ll � C2 , where C2 does not depend on h. Let also all norms be chosen in the sense of 12 . Then, for the estimates (1 0. 145) to hold, it is necessary and sufficient that the inequalities be satisfied: (10. 149) max IAnl � 1 + Cl r, c] = const,
n
where A] , A2, . . . , AN are eigenvalues of the matrix formula (1 0. 149) does not depend on h.
Rh
and the constant
c]
in
One implication of Theorem 10.7, the necessity, coincides with the previous nec essary spectral condition for stability that we have justified on page 389. The other implication, the sufficiency, is to be proven in Exercise 5 of this section. A full proof of Theorem 10.7 can be found, e.g., in [GR87, §43]. Unfortunately, in many practical situations the operators (matrices) Rh in the canonical form ( l 0. 141) are not normal. Then, the spectral condition ( l 0. 146) still remains necessary for stability. Moreover, we have just seen that in the special case of twolayer scalar constantcoefficient Cauchy problems it is also sufficient for sta bility and that sufficiency takes place regardless of whether or not R" has a full sys8 In general, the points ;" ;., ( a, h) given by Definition 10.3 on page 35 1 will be a part of the spectrum in the sense of its classical definition, see Definition 1 0.7 on page 393. =
FiniteDifference Schemes for Partial Differential Equations
391
tern of orthonormal eigenfunctions. However, for general finitedifference problems on finite intervals the spectral condition ( 1 0. 1 46) becomes pretty far detached from sufficiency and provides no adequate criterion for uniform boundedness of II R� II. For instance, the matrix of the transition operator Rh defined by formulae ( 10. 1 42a) and ( l0. 1 42b) is given by:
l r r
0 ···
0 0
0 ··· 0 ···
0
0
0
1 r r ... 0 0 
Rh =
( 1 0 . 1 50) 0 0
l r r 0
1
Its spectrum consists of the eigenvalues A = 1 and A = 1 r and as such, does not depend on h (or on r). Consequently, for any h > 0 the spectrum of the operator Rh consists of only these two numbers: A = I and A = I r. This spectrum belongs to the unit disk I A I :::; I when 0 :::; r :::; 2. However, for I < r :::; 2, scheme ( 10. 1 39) violates the Courant, Friedrichs, and Lewy condition necessary for stability, and hence, there may be no stability IIR�" :::; K for any reasonable choice of norms. Thus, we have seen that the spectral condition ( 10. 1 46) that employs the eigenval ues of the operators Rh and that is necessary for the uniform boundedness IIR� II :::; K appears too rough in the case of nonnormal matrices. For example, it fails to detect the instability of scheme ( 1 0 . 1 39) for 1 < r :::; 2. To refine the spectral condition we will introduce a new concept. Assume, as before, that the operator Rh is defined on a normed linear space U�. We will denote by {Rh } the entire family of operators Rh for all legitimate values of the parameter h that characterizes the grid. 9 

DEFINITION 10.6 A complex number A is said to belong to the spectrum of the family of operators {Rh } if for any ho > 0 and E > 0 one can always find such a value of h, h < ho , that the inequality
will have a solution u E Uj, . The set of all such of the family of operators {R,,} .
A
will be called the spectnLm
The following theorem employs the concept of the spectrum of a family of opera tors from Definition 10.6 and provides a key necessary condition for stability. THEOREM 10.8 (GodunovRyaben 'kii) If even one point Ao of the spectrum of the family of operators {R/z } lies outside the unit disk on the complex plane, i. e., I Ao I > I , then there is no 9 By the very nature of finitedifference schemes, h may assume arbitrarily small positive values.
A Theoretical Introduction to Numerical Analysis
392
common constant
K such
that the ineqnality
IIRf,II ::::; K will hold for all h > 0 and all integer values of p from 0 till some where Po (h) 7 00 as h 7 O. PROO F
that for all
Let us first assume that no such numbers
h < ho the following estimate holds:
p = Po(h) ,
ho > 0 and c > 0 exist ( 1 0. 1 5 1 )
This assumption means that there is no uniform bound on the operators themselves. As s11ch, there may be no bound on the powers Rf, either. Consequently, we only need to consider the case when there are such ho > 0 and c > 0 that for all h < ho inequality ( 10.151 ) is satisfied. Let I� I = 1 + 8, where � is the point of the spectrum for which I � I > 1 . Take an arbitrary K > 0 and choose p and £ so that:
Rh
( 1 + 8) p > 2K, 1  ( l + c + c2 + . . . + cp_ l ) £ > '2I ' According to Definition 10.6, one can find arbitrarily small is a vector u E U;, that solves the inequality:
Let
h, for which there
u be the solution, and denote:
It is clear that
IIz11 < £IIu ll .
Moreover, it is easy to see that
Rf,u = 'A6u + ('A6 ' Z + 'A6 2Rh Z + " , + Rf, ' z). I � I > 1, we have: II 'A6 ' z + At 2Rhz + . . . + Rr ' z l l < 1 � I P ( 1 + IIRhll + II RJ, II + . . . + II Rf,  I I I )£ II u ll ,
As
and consequently,
II Rf,u l l > l'AuI P[1  £(1 + c + c2 + . . . + cP  ' ) ] II u ll I 1 > ( 1 + 8)P '2 " u ll > 2K '2 " u " = K ll u ll . In doing so, the value of ensure p < po (h) .
h can always be taken sufficiently
small so that to
FiniteDifference Schemes for Partial Differential Equations
393
Since the value of K has been chosen arbitrarily, we have essentially proven that for the estimate \lRfr1 1 < K to hold, it is necessary that all points ofthe spectrum of the family {R,, } belong to the unit disk IA I ::; I on the complex plane. D Next recall that the following definition of the spectrum of an operator R" (for a fixed h) is given in functional analysis. DEFINITION 1 0. 7 A complex number A is said to belong to the spec trnm of the operator R" : Vf, f7 Viz �f for any E > 0 the inequality
\lR" u  Au l l u' < E \lu\l u'
h
h
has a solution
u E Vf, .
The set of all such
A
is called the spectrum R" .
At first glance, comparison of the Definitions 1 0.6 and 10.7 may lead one to think ing that the spectrum of the family of operators {Rd consists of all those and only those points on the complex plane that are obtained by passing to the limit h 7 0 from the points of the spectrum of R" , when h approaches zero along all possible subsequences. However, this assumption is, generally speaking, not correct. Consider, for example, the operator R" : Viz f7 Vf, defined by formulae ( I 0. 142a) and ( l 0 . 1 42b). It is described by the matrix ( l 0. 150) and operates in the M + 1 dimensional linear space Vf" where M l /h . The spectrum of a matrix consists of its eigenvalues, and the eigenvalues of the matrix ( 10. 1 50) are A 1 and A = 1  r. These eigenvalues do not depend on h (or on r) and consequently, the spectrum of the operator Rh consists of only two points, A 1 and A I  r, for any h > O. As, however, we are going to see (pages 397402), the spectrum of the family of operators {Rd contains not only these two points, but also all points of the disk I A  I + 1' 1 ::; r of radius I' centered at the point ( 1  r, 0 ) on the complex plane, see Figure 10. 1 2 on page 384. When I' ::; 1 , the spectrum of the family of operators {Rh } belongs to the unit disk I A I ::; 1 , see Figures 1 0. 12( a) and 1O. 1 2(b). However, when I' > 1 , this necessary spectral condition of stability does not hold, see Figure 1 0. 1 2( c), and the inequality \lRf,\I ::; K can not be satisfied uniformly with respect to h. Before we accurately compute the spectrum of the family of operators {R,, } given by formula ( 10. 1 50), let us qualitatively analyze the behavior of the powers I Rf, II for I' > 1 and also show that the necessary stability criterion given by Theorem 10.8 is, in fact, rather close to sufficient. We first notice that for any h > 0 there is only one eigenvalue of the matrix R" that has unit modulus: A = 1 , and that the similarity transformation SI; ' R"Sh , where =
=
=
1 0 0 ·· · 0 1 0 1 0 ··· 0 1
1 0 0 ··· 0 1 0 1 0 ··· 0 1 and 0 0 0 ··· 1 1 0 0 0 ··· 0 1
=
SI;'
=
0 0 0 ··· 1 1 0 0 0 ··· 0 1
394
A Theoretical Introduction to Numerical Analysis
reduces this matrix to the blockdiagonal form:
l r r
o o o
l

0 ···
0 0
r r ···
0 0
= H" .
O O ··· I rO 0 0 ··· 0 I
When 1 < r < 2, we have 1 1 r l < 1 for the magnitude of the diagonal entry 1 r. Then, it is possible to prove that Hi. t diag { O,O, . . . ,0, I } as p t 00 (see The orem 6.2 on page 1 78). In other words, the limiting matrix of Hi. for the powers p approaching infinity has only one nonzero entry equal to one at the lower right corner. Consequently, 0 0 0 ··· 0 1 0 0 0 ··· 0 1 

0 0 0 ··· 0 1 0 0 0 ··· 0 1 and as such, lim� \\Ri. 1I = 1 . We can therefore see that regardless of the value of h, the P ' norms of the powers of the transition operator Rh approach one and the same finite limit. In other words, we can write lim II Rf.\\ 1 , and this "benign" asymptotic p'Tl OQ
=
behavior of \\Rf.\\ for large p'r is indeed determined by the eigenvalues A = 1 r and A = 1 that belong to the unit disk. The fact that the spectrum of the family of operators {Rh } does not belong to the unit disk I IR�I I for r > 1 manifests itself in the behavior of II Ri. 1I for h t 0 and for moderate (not so large) values of p'r. The maximum value of IIRf. \\ on the interval < p'r < T, where T is an ar bitrary positive constant, will rapidly grow as h decreases, see Figure 10. 1 3. This is pre cisely what leads to the insta bility, whereas the behavior of o \\Rf.\\ as p'r t 00, which is re lated to the spectrum of each FIGURE 10. 13: Schematic behavior of the pow individual operator Rh , is not ers I\Rf. 1I for 1 < r < 2 and h3 > h2 > h I . important from the standpoint of stability. 
o
FiniteDifference Schemes for Partial Differential Equations
395
Let us also emphasize that even though from a technical point of view Theo rem 1 0.8 only provides a necessary condition for stability, this condition, is, in/act, not so distant/rom sufficient. More precisely, the following theorem holds. THEOREM 10.9 Let the operators Rh be defined on a linear normed space U� for each and assume that they aTe uniformly bounded with Tespect to h:
h > 0,
( 10. 1 52) Let also the spectrum of the family of operators {Rh } belong to the unit disk on the complex plane: I A I :S 1 . Then for any 11 > 0, the nOTms of the poweTS of operatoTs Rh satisfy the estimate: ( 10. 153) where A = A(11 ) may depend on 11, but does not depend on the grid size
h.
Theorem 1 0.9 means that having the spectrum of the family of operators {Rh } lie inside the unit disk is not only necessary for stability, but it also guarantees us from a catastrophic instability. Indeed, if the conditions of Theorem 1 0.9 hold, then the quantity max I IRf, 1! either remains bounded as h � 0 or increases, but slower I :O; p :O;[TI T] than any exponential junction, i.e., slower than any ( 1 + 11 ) [TI T] , where 11 > 0 may be arbitrarily small. Let us first show that if the spectrum of the family of operators belongs to the disk I A I :S p, then for any given A that satisfies the inequality IA I :::: p + 11 , 11 > 0, there are the numbers A = A( 11 ) and ho > 0 such that Vh < ho and Vu E UIz, II i 0, the following estimate holds: PRO O F
{Rh }
+ 11 I IR" u  Au llu' > PA ( 11) I! u l!u" h

h
( 10. 1 54)
Assume the opposite. Then there exist: 11 > 0; a sequence of real numbers hk > 0, hk � 0 as k � 00; a sequence of complex numbers Ab IAkl > P + 11 ; and a sequence o f vectors U hk E Ulzk such that: ( 10.1 55) For sufficiently large values of k, for which PtTJ < I , the numbers Ak will not lie outside the disk I A I :S c + 1 by virtue of estimate (10. 1 52), because outside this disk we have:
A Thearetical lntrodllctian ta Numerical Analysis
396
Therefore, the sequence of complex numbers Ak is bounded and as such, has a limit point X , IX I � p + 1] . Using the triangle inequality, we can write: IIRhk uhk  Akllhk I I uhk � II R"k U/zk  X Uhk I I Uf,k  I Ak  X I I l uhk I I Uhk · Substituting into inequality ( 10.1 55) , we obtain:
€
Therefore, according to Definition 10.6 the point X belongs to the spectrum of the family of operators { R/z } . This contradicts the previous assumption that the spectrum belongs to the disk 1 1. 1 ::::; p . Now let R be a linear operator on a finitedimensional normed space U, R : U f7 U . Assume that for any complex A , 11. 1 � y> 0, and any u E U the following inequality holds for some a = const > 0 :
I IR Il  Au ll � a ll u ll · Then,
1' + 1 I I RP I I ::::; L ,
a
p=
( 1 0. 1 56)
1 , 2, . . . .
( 1 0. 1 57)
Inequality ( 10.157) follows from the relation: ( 1 0 . 1 58) combined with estimate ( 10.156) , because the latter implies that I I (R 1.1)  1 I I ::::; � . To prove estimate ( 10.153), we set R = R'l) P = 1 so that 1 1. 1 � 1 + 1] = y, and use (10.154 ) instead of ( 10.156). Then estimate (10.157) coincides with ( 10.153) . It only remains to justify equality (10.158) . For that purpose, we will use an argument similar to that used when proving Theorem 6.2. Define
llP + 1
= Ru P
and w(A)
=
"" liP
L p=o AI"
where the series converges uniformly at least for all A E C, I A I > c, see formula ( 10.152) . Multiply the equality up+ 1 = Rul by A I' and take the sum with respect to p from p = ° to P = 00. This yields:
AW(A)  A liO = RW(A),
or alternatively,
(R  U)w(A) =  Auo ,
FiniteDifference Schemesfor Partial Differential Equations From the definition of W(A) it is easy to see that vectorfunction A P  l w(A) i at infinity:
As
uP = RP uD
,
 uP
397
is the residue of the
the last equality is equivalent to ( 10.158).
o
Altogether, we have seen that the question of stability for evolution finite difference schemes on finite intervals reduces to studying the spectra of the families of the corresponding transition operators {R,, } . More precisely, we need to find out whether the spectrum for a given family of operators {R,, } belongs to the unit disk I A I :::; 1. If it does, then the scheme is either stable or, in the worst case scenario, it
may only develop a mild instability.
Let us now show how we can actually calculate the spectrum of a family of oper ators. To demonstrate the approach, we will exploit the previously introduced exam ple 0. 142a), ( I 0. 1 42b). It turns out that the algorithm for computing the spectrum of the family of operators {RIz } coincides with the BabenkoGeljand procedure de scribed in Section 1 0.5. 1 . Namely, we need to introduce three auxiliary operators: f+ 7 ff+ f7 R , R , and R . The operator R , v = R u, is defined on the linear space of bounded grid functions u = { . . . , U  I , uo , li t , . . . } according to the formula:
(I
vm = ( l  r)um + rum+ ] , m
=
O , ± , ±2, . . . ,
I
( 10. 1 59)
which is obtained from ( 1O. 142a), ( l O. l 42b) by removing both boundaries. The 7 operator R is defined on the linear space of functions U = { UD ' u I , . . . , um, . . . } that vanish at infinity: I Um I + ° as m + +00. It is given by the formula:
vm = ( 1  r)um + rum+ l , m = 0, 1 , 2, . . . ,
( 10 . 1 60)
which is obtained from ( 1O. 1 42a), ( l 0. 142b) by removing the right bound<ary. Finally, the operator R is defined on the linear space of functions { . . . , um , . . . , UD , . . . , UM I , UM } that satisfy: I um l + ° as m +  00. It is given by the formula:
vm = ( 1  r)um + rum + ] , m = . . . ,  1 , 0, I , . . . , M  1 , VM = UM ,
(10.161)
which i s obtained from ( l 0. 142a), ( l0. 142b) by removing the left boundary. Note 7 00
The curve A = A ( a) is a circle of radius r on the complex plane centered at the point f> ( 1  r,O), see Figure 10. 1 1 (a). We will denote this circle by A . t The eigenvalues of the operator R are all those and only those complex numbers t A, for which the equation R U = Au has a solution U = {uo, tt l , ' " , urn , . . . } that satisfies lim IUrn l = O. Recasting this equation with the help of formula (10.160), we m+ +oo have:
( I  r  A)urn + rurn+ I = O, m = 0, I , 2, . . . . The solution Urn = cqrn may only be bounded as m > +00 if I q l < 1 . The corre sponding eigenvalues A = 1  r + rq completely fill the interior of the disk of radius r centered at the point ( 1  r, 0), see Figure 10. 1 1 (b). We will denote this set by A . The eigenvalues of the operator R are computed similarly. Using formula (10. 161), we can write equation R U = Au as follows: 1. The second equation provides an additional constraint (1  A )qM = 0 so that A = 1 . However, for this particular A we also have q 1, which implies no decay as m > 00 We therefore conclude that the equation R U Au has no solutions U = {urn} that satisfy lim IUrn l = 0, i.e., there are no eigenvalues: A = 0. m+  oo The combination of all eigenvalues A = A U A U A is the disk IA  (1  r) I ::; r on the complex plane; it is centered at ( 1  r, 0) and has radius r. We will now show that the spectrum of the family of operators {R/z} coincides with the set A. This is equivalent to showing that every point An E A belongs to the spectrum of {R/z} and 
=
.
f>
t
O. As Ao E A, then Ao E A or > +Ao E A , because A = 0. Note that when E is small one may call the solution U of inequality (10.162) "almost an eigenvector" of the operator Rh , since a solution to the equation Rh u  Ao u = 0 is its genuine eigenvector. f> Let us first assume that Ao E A . To construct a solution U of inequality (10.162), f> we recall that by definition of the set A there exists qo : I qol = 1 , such that Ao = 1  r + rqo and the equation (1  r  Ao)vm + rvm+ ! = 0, m = 0, ± l , ±2, . . . , has a bounded solution Vm q'O, m = 0, ± 1 , ±2, . . . . We will consider this solution only for m 0, 1 , 2, . . . ,M, while keeping the same notation v. It turns out that the vector: =
=
v [vo, V I , V2 , · · · , VM] = [ 1 , qo ,q5, · · · ,q�J almost satisfies the operator equation Rh v  Aov = 0 that we write as: ( l  r  Ao)vm + rvm+1 = 0, m = 0, 1 ,2, . . ,M  1 , ( l  Ao)vM = 0. The vector v would have completely satisfied the previous equation, which is an even stronger constraint than inequality (10.162), if it did not violate the last relation (1  Ao)VM = 0. 10 This relation can be interpreted as a boundary condition for the =
.
difference equation:
( l  r  Ao)um + rum+ l = 0, m = 0, 1 ,2, . . . , M  1. The boundary condition is specified at m M, i.e., at the right endpoint o f the in terval 0 :::; :::; 1 . To satisfy this boundary condition, let us "correct" the vector v = [1,qO ,q6, . . . ,q�J by multiplying each of its components Vm by the respec tive factor (M  m )h. The resulting vector will be denoted u = [uo, U I , . . . , UM], Um (M  m)hq'O. Obviously, the vector U has unit norm: lIullu' = max m !um ! max m I (M  m)hq'Ol Mh = I . =
x
=
h
=
=
We will now show that this vector U furnishes a desired solution to the inequality (10. 162). Define the vector W � Rh U  Aou, W = [wo , WI , WMJ. We need to esti mate its norm. For the individual components of w, we have: . • • ,
Iwm l 1 ( I  r  Ao) (M  m)hq'O + r(M  m  1 )hq'O+ 1 1 = 1 [( 1  r  Ao) + rqo] (M  m)hq'O  rhq'O+1 1 = 1 0 · (M  m)hq'O  rhq'O+ 1 1 = rh, m = 0, I , . . ,M  1 , I WM I = IUM  AoUM I = 1 0  Ao · 0 1 = O. Consequently, II wll u' rh, and for h < E/r we obtain: II w llu' = IIRh U  Aoullu' < E = Ellull u" i.e., inequality (10.162) is satisfied. Thus, we have shown that if Ao E A , then this point also belongs to the spectrum of the family of operators {Rh }. =
.
f;
IO Relation (I
h
h

Ao )VM
=
=
0 is violated unless Ao
h
=
qO
=
1.
h
400
A Theoretical Introduction to Numerical Analysis +
Next, let us assume that � E A and show that in this case � also belongs to the + spectrum of the family of operators {R/z}. According to (10. J 60), equation R v �v = 0 is written as:
(1  r  �)vm + rvlIl+ ! = 0 , m = 0, 1 , 2, . . . .
+
Since � E A , this equation has a solution Vm = q3', In = 0, 1 , 2 , . . . , where We will consider this solution only for m = 0 , 1 , 2, . . . , M:
Iqol < 1 .
lIullu' = 1 . "
def
1 _ For the components of the vector w we have: R}ztl  ''{)u. I wm l = 1 ( 1  r  �)qo + rqo+ ! 1 = 0, m = O, I , . . . , M  l , IWM I = 1 1  � 1 · l q 0 we can always choose a sufficiently small h so that 1 1  �1 ' lqo l !/h < f. Then, Il w llu' = IIRh u  �u llu' < f = f ll u ll u' and the inequality ( 10. 1 62) is satisfied.
As before, define w
=

II
h
�
h
It
Note that if the set A were not empty, then proving that each of its elements belongs to the spectrum of the family of operators {RIl} would have been similar. Altogether, we have thus shown that in our specific example given by equations � + � (10. 142) every � E { A U A U A } is also an element of the spectrum of {Rh }. � + � Now we need to prove that if � (j. { A U A U A } then it does not belong to the spectrum of the family of operators {Rh } either. To that end, it will be sufficient to show that there is an hindependent constant A, such that for any u = [uo , [I J , · . . , UM] the following inequality holds:
(10. 163) Then, for f < A, inequality (10.162) will have no solutions, and therefore the point � will not belong to the spectrum. Denote i = R" u  �u, then inequality ( 10. 163) reduces to:
(10. 164) Ilillu' � A ll u l lu" Our objective is to justify estimate ( 10. 164). Rewrite the equation R" u  � u = i as: (1  r  �)um + rum + ! = im, m = O, I , . . . M l , ( 1  �)UM = iM , and interpret these relations as an equation with respect to the unknown u = {um}, whereas the righthand side i = {j;n } is assumed given. Let (10. 1 65) um = Vm + wm, m = O, I , . . . , M, It
II
,

FiniteDifference Schemes jar Partial Differential Equations where Vm are components of the bounded solution the following equation:
{o, =jm =
40 1
v = {vm } , m = 0, ± 1 , ±2, . . . , to
,
m < 0, ( l  r  Ao)vm + rvm+ 1 ( 10. 1 66) fill , m = O, l , . . . M  l , m � M. 0, Then because of the linearity, the grid function W = {wm} introduced by formula A
if if if
def
( 1 0. 1 65) solves the equation:
( 1  r  Ao)wm + rwm+ 1 = 0,
m = 0, 1 , . . . , M  1 , ( I  Ao)WM = jM  ( I  Ao ) vM .
( 10. 167)
I Jml.
I
Let us now recast estimate ( 1 0. 1 64) as IUm :=; A I max According to m ( 10. 165), to prove this estimate it is sufficient to establish individual inequalities: ( 1 O. l68a) ( l0. 1 68b) where A I and A 2 are constants. We begin with inequality ( 1O. 1 68a). Notice that equation ( 10. 166) is a first order constantcoefficient ordinary difference equation:
= Jm , m=0, ± I, ±2, m �(_�) { Gm = , m:=;O,
avm + bvm+J ..., where a = 1  r  Ao, b = r. Its bounded fundamental solution is given by a
0,
b
m � 1,
k=� G",kJk
Ao rt { A U A U A }, i.e., l Ao  ( 1  r) I > r, and consequently lalbl > 1 . Representing the solution v'" in the form of a convolution: Vm = L and because
�
�/2, which makes the previous estimate equivalent to ( l0. 1 68a). Estimate ( l0. 1 68b) can be obtained by representing the solution of equation ( l0. 1 67) in the form: �
1 . Moreover, we can say that 1 1  Ao I = bl > 0, because if Ao = 1 , then Ao would have belonged to the set +++ t { A U A U A }. As such, using formula ( 1 0. 1 69) and taking into account estimate ( l 0 . 1 68a) that we have already proved, we obtain the desired estimate ( l0. 1 68b):
I Wm 1
\
j
= fM  1( I AoAo)VM ' l qom  M I < 1 I fMAoI + 1 VM I 1 I max f ml l < m + A l max l j,n l = A2 max l fm l . III In bl
We have thus proven that the spectrum of the family of operators {Rh } defined by ++ +t formulae ( l 0. 1 42) coincides with the set { A U A U A } on the complex plane. The foregoing algorithm for computing the spectrum of the family of operators {Rh} is, in fact, quite general. We have illustrated it using a particular example of the operators defined by formulae ( 10. 1 42). However, for all other scalar finite difference schemes the spectrum of the family of operators {R,,} can be obtained by performing the same BabenkoGelfand analysis of Section 1 0.5. 1 . The key idea is to
take into account other candidate modes that may be prone to developing the insta bility, besides the eigenmodes {eiam} of the pure Cauchy problem that are accounted for by the von Neumann analysis. When systems of finitedifference equations need to be addressed as opposed to the scalar equations, the technical side of the procedure becomes more elaborate. In this case, the computation of the spectrum of a family of operators can be reduced to studying uniform bounds for the solutions of certain ordinary difference equations with matrix coefficients. A necessary and sufficient condition has been obtained in [Rya64] for the existence of such uniform bounds. This condition is given in terms of the roots of the corresponding characteristic equation and also involves the analysis of some determinants originating from the matrix coefficients of the system. For further detail, we refer the reader to [GR87, § 4 & § 45] and [RM67, § 6.6 & § 6.7], as well as to the original joumal publication by Ryaben'kii [Rya64]. 10.5.3
The Energy Method
For some evolution finitedifference problems, one can obtain the 12 estimates of the solution directly, i.e., without employing any special stability criteria, such as spectral. The corresponding technique is known as the method of energy estimates. It is useful for deriving sufficient conditions of stability, in particular, because it can often be applied to problems with variable coefficients on finite intervals. We illus trate the energy method with several examples. In the beginning, let us analyze the continuous case. Consider an initial boundary
FiniteDifference Schemes for Partial Differential Equations
403
value problem for the first order constantcoefficient hyperbolic equation:
au au = O O :S; x :S; l , O < t :S; T, at _ ax ' u(x,O) = lJI(x), u(l,t) = O.
(10. 170)
Note that both the differential equation and the boundary condition at x = 1 in prob lem (10.170) are homogeneous. Multiply the differential equation of (10. 170) by u = u (x, t) and integrate over the entire interval 0 :s; x :s; 1 : I I � u2 (x,t) dx � u2 (x,t) dx dt o 2 . ax 2 0 :i Il u( , ,t) lI� u2 ( I ,t) + u2 (0, t)  0 ,  dt 2 2 2
j
J
(
_
_

) 1 /2
I where l I u( , ,t) 1 1 2 fo u2 (x,t)dx is the L2 norm of the solution in space for a given moment of time t. According to formula ( 10. 1 70), the solution at x = 1 van ishes: u(1 ,t) = 0, and we conclude that ;]', I u( . ,t) II � :s; 0, which means that Il u e ,t) 1 12 is a nonincreasing function of time. Consequently, we see that the L2 norm of the solution will never exceed that of the initial data:
def =
( 10. 1 7 1 ) Inequality (10.171) is the simplest energy estimate. It draws its name from the fact that the quantities that are quadratic with respect to the solution are often interpreted as energy in the context of physics. Note that estimate (10.171) holds for all t � 0 rather than only 0 :s; t :s; T. Next, we consider a somewhat more general formulation compared to (10.170), namely, an initial boundary value problem for the hyperbolic equation with a variable coefficient:
au au ar  a(x,t) ax = O ' O :S; x :S; l, O < t :S; T, u(x,O) = lJI(x), u(I ,t) = O.
(10.172)
We are assuming that \Ix E [0, 1] and \It � 0 : a (x, t) � ao > 0, so that the characteris tic speed is negative across the entire domain. Then, the differential equation renders transport from the right to the left. Consequently, setting the boundary condition u( t) = 0 at the right endpoint of the interval 0 :s; x :s; is legitimate. Let us now multiply the differential equation of (10. 172) by u = u(x,t) and inte grate over the entire interval 0 :s; x :s; 1 , while also applying integration by parts to
I,
I
A Theoretical Introduction to Numerical Analysis
404 the spatial term:
I
� u2 (x, t) dx �dt ./I u2 (x,t) . t) a(x dx J dx 2 2 o
0
'
2 ;1, at (x,t ) u2 (x2 , t) d _ = 0 = �dt ll u( "2 t) I I� _ a ( 1 ,t ) u (21 , t ) +a (0 ,t ) u2 (O,t) + 2 '
x
.
o
u( I ,t) = 0, we find: �dt 11[/ ( ·2, t) 11 22 = a(O' t) u2 (O,t) _ JI a'. (x t) u2 (x., t) dx = O. 2 x ' 2 o
Using the boundary condition
The first term on the righthand side of the previous equality is always nonpositive. As far as the second term, let us denote sup(X.l ) Then we have:
A=
[a�(x, t)l. d 2 dt I l u( , ,t) 1I2 ::; A ll u( , , t) lb ?
which immediately yields:
A < 0, the previous inequality implies that the L2 norm of the solution decays as t +00. If A = 0, then the norm of the solution stays bounded by the norm of the initial data. To obtain an overall uniform estimate of Il u( · ,t) 1 1 for A ::; 0 and all t ::::: 0, we need to select the maximum value of the constant: maxI2eAI/2 = 1 , and then the desired inequality will coincide with ( 0 1 7 I ). For A > 0, a uniform estimate can only be obtained for a given fixed interval 0 ::; t ::; T, so that altogether we can write: If
>
1
.
(10.173) (10.173)
Similarly to inequality ( 1 0. 1 7 1 ) , the energy estimate also provides a bound for the norm of the solution in terms of the norm of the initial data. However, when > the constant in front of is no longer equal to one. Instead, grows exponentially as the maximum time T elapses, and therefore estimate for A > may only be considered on a finite interval ::; T rather than for
L A2 0 0
11 1fI 1 1 2
L2
eAT/2 (10.173) t ::::: O.
0 ::; t
(10.170)
x=
In problems and ( 1 0. 172) the boundary condition at 1 was homoge neous. Let us now introduce yet another generalization and analyze the problem:
du  a(x, t) du = O , O ::; x ::; l , O < t ::; T, dt dX u(x,O) lfI(x) , u(l , t) X(t) =
=
(10.174)
FiniteDifference Schemes for Partial Differential Equations
405
= that differs from ( 10. 1 72) by its inhomogeneous boundary condition: Otherwise everything is the same; in particular, we still assume that E [0, 1 ] and 2 0 : > 0 and denote sup(x,t ) [  � , l . Multiplying the differential equation of ( 10. 1 74) by and integrating by parts, we obtain:
X (t).
Vt
a(x,t) ::: ao
A
u(x, t)
dt
2
'
2
'
a (x t)
=
� [ [ u (  ,t) [ I � = a(O t) u2 (0, t) + a( Lt ) X 2 (t) 2
u ( 1 ,t) Vx
_
I u2 (x,t) J a.'" (x. t) 2 dx o
=
O.
Consequently,
d I I u ( · ,t) II� :::; A llu ( · , t) II� + a(1,t)x2 (t). dt Multiplying the previous inequality by eAr, we have: which, after integrating over the time interval 0
:::; 8 :::; t, yields:
f
" u (  ,t) [ I� :::; eAf lllJllI� + eAt J eAe a( l , 8 )X2 ( 8 )d8 . o As in the case of a homogeneous boundary condition, we would like to obtain a uniform energy estimate for a given interval of time. This can be done if we again distinguish between A :::; 0 and > O. When 0 we can consider all 2 0 and when A > 0 we can only have the estimate on some fixed interval 0 :::; :::;
A
A :::;
t t T:
When deriving inequalities ( 1 0 . 1 75), we obviously need to assume that the integrals on the righthand side of ( 10. 1 75) are bounded. These integrals can be interpreted as weighted norms of the boundary data Clearly, energy estimate ( 10. l 75) includes the previous estimate ( 1 0. 173) as a particular case for == O. All three estimates ( 1 0. l 7 1 ), ( 10. 1 73), and ( 10. 1 75) indicate that the correspond ing initial boundary value problem is in the sense of Qualitatively, wellposedness means that the solution to a given problem is only weakly sensitive to perturbations of the input data (such as initial and/or boundary data). In the case of linear evolution problems, wellposedness can, for example, be quantified by means of the energy estimates. These estimates provide a bound for the norm of the solution in terms of the norms of the input data. In the finitedifference context, similar esti mates would have implied stability in the sense of 12 , provided that the corresponding constants on the righthand side of each inequality can be chosen independent of the grid. We will now proceed to demonstrate how energy estimates can be obtained for finitedifference schemes.
L2
X(t).
wellposed
X (t) L2 .
A Theoretical Introduction to Numerical Analysis Consider the first order upwind scheme for problem (10.170): p+l  U U +  U h = 0, ,. m = O, I , . . . , M  l, = 0, 1, . . . , [T l ]  I,
406
p m
p m l
p m
Um
p
(10.176)
(10.176), m 0, 1,2, .
To obtain an energy estimate for scheme let us first consider two arbitrary functions { m } and { vm } on the grid = . . , M. We will derive a formula that can be interpreted as a discrete analogue of the classical continuous integration by parts. In the literature, it is sometimes referred to as the summation by parts:
u
M I
L
m=O
um ( vm+ !  vm
M I
M I
M
M I
m=O
m=O
m=
m=O
) = L UmVm+1  L Um Vm = L Um  I Vm  L UmVm I M
=  L (u m= 1
M I
m  um
m Vm 1 + UMVM  UOVO
m
=  L (U + I  U m=O
d vm + UMVM  UO VO
M I
=  L (Um+1 m=O
)
 um vm
)
(l0.177)
+
M I
m
m
 L (Um+ 1  U } (V + 1 m=O
 vm
)
(10.176) as um I  umP r (um+ 1 urn) ' r = h,. const, m 0 1 , . . . M  1 , square both sides, and take the sum from m = 0 to m = M  1. This yields: M I M I M I 1 M l L (u�+ ) 2 = L (u� ) 2 +r2 L (u�+ l _ u�) 2 + 2r L u� (u�+1  u�) .
Next, we rewrite the difference equation of +
p+
m=O
p
m=O
_
p
=
=
m=O
,
,
m=O
To transform the last term on the righthand side of the previous equality, we apply formula
(10.177): M J M J M I M J 1 + 2 = u (u ) 2 + r +r u ) �) � 2 � 2 L (U�+ I L ( L L U� (U�+I  u�) m=O
m=O
+r
m=O
m=O
[�� (U�+I  u�) u�  �� (U�+I  u�)2 + (u�)2  (ug)2]
M J
= L
m=O
M l
(u� ) 2 +r(r  l) L (U�+ I  u� )2 +r (u�)2  r (ug)2 . m =O
FiniteDifference Schemesfor Partial Differential Equations 407 Let us now assume that r :S 1 . Then, using the conventional definition of the 12 norm: I u l 1 2 [h � I U mI 2] 1/2 and employing the homogeneous boundary condition uft = 0 =
of ( 10. 1 76), we obtain the inequality:
which clearly implies the energy estimate: ( 10. 1 78) The discrete estimate ( 10. 178) is analogous to the continuous estimate ( 10. 1 7 1 ). To approximate the variablecoefficient problem ( 10. 172), we use the scheme:
uf:,+1 u{;,  a!:, u�+ 1  uf:, = 0, h r p = 0, 1 , . . . , [T /r]  I , m 0 , 1 , . . . ,M  1 , u� ljIm , uft 0 , where a�, a(xm , tp ) . Applying a similar approach, we obtain: =
=
( 10. 179)
=
==
Next, we notice that = Substituting this ex pression into the previous formula, we again perform summation by parts, which is analogous to the continuous integration by parts, and which yields:
am+ 1 Um+1 aml m+1 + (am+l am) Um+l .
Ml u + l 2 Ml (u 2 Ml (a ) 2 u u MI u m=LO ( fn ) = m=LO �i + r m=LO {;, ( �+1  �i + r m=LO af:,uf:, ( �+ 1  uf:,) +r  �� (afn (U�'+ l  u!:,) + (a�+ J  a{;,) u�+I) U�, Ml  L (afn (U�+I  ufn ) + (a�+l  afn) u�+ I ) (U�+l  uf:, ) m=O + aft (uft) 2 ag (ug ) 2 ] MJ Ml = L (ufn ) 2 + L (r2 (afn ) 2  rafn ) (U�+I  U{;,) 2 m=O m=O
[
A Theoretical Introduction to Numerical Analysis M I P 1  aPm ) ( UmP +l ) 2 +raPM ( UMP ) 2  raoP ( uoP) 2 .  r m='" (am+ O Let us now assume that for all m and p we have: ra;" :::; 1 . Equivalently, we can require that r :::; [SUP (X,I ) a(x,t)j I . Let us also introduce: a�+ J  af" . A = sup h (m,p) Then, using the homogeneous boundary condition ulJ.t = 0 and dropping the a priori .. ' P  Ump ) 2 , we btam: nonposItIve term � ""Mm=IO raPm ( ramP  1 ) ( Um+l M I + 2 M I 2 M L 1 :::; L (1l ) rA L h (u ) 2  (ug)2  rAh (ug) 2 . m=O (u� ) m=O � + m=O � rag If A > 0, then for the last two terms on the righthand side of the previous inequality we clearly have:  r (ug)2 (ag + Ah) O. Even if A :::; 0 we can still claim that r (ug ) 2 (ag + Ah) 0 for sufficiently small h. Consequently, on fine grids the 408
LJ
}
{
0
0, p 0, 1 , 2, . . . , [ T jr]: A ::::; O , [lIlj1l l�j2 + � 1 rat l (Xk 1 ) 2 , k ( 10. 1 82) [ ll uP I I�f ::::; AT I I k, r ) h) A > 0. (ate [l l ljIl l�F + [Tlr] (X +A 2 k�1 P=
{
=
(
)
The discrete estimate ( 10. 1 82) is analogous to the continuous estimate ( 1 0 . 1 75). To use the norms II . 1 1 2 instead of in ( 1 0 . 1 82), w e only need to add a bounded quantity on the righthand side. Energy estimates ( 1 0. 1 78), ( 1 0. 1 80), and ( 1 0 . 1 82) imply the 12 stability of the schemes ( 10. 176), ( 10. 179), and ( 1 0. 1 8 1 ), respectively, in the sense of the Defini tion 10.2 from page 3 12. Note that since the foregoing schemes are explicit, stability is not unconditional, and the Courant number has to satisfy ::::; 1 for scheme ( 1 0. 1 76) and [sup (x,t ) for schemes ( 1 0. 1 79) and ( 10. 1 8 1). In general, direct energy estimates appear helpful for studying stability of finite difference schemes. Indeed, they may provide sufficient conditions for those difficult cases that involve variable coefficients, boundary conditions, and even mUltiple space dimensions. In addition to the scalar hyperbolic equations, energy estimates can be obtained for some hyperbolic systems, as well as for the parabolic equations. For detail, we refer the reader to [GK095, Chapters 9 & 1 1 ] , and to some fairly recent journal publications IStr94, 0ls95a, 01s95b]. However, there is a key non trivial step in proving energy estimates for finitedifference initial boundary value problems, namely, obtaining the discrete summation by parts rules appropriate for a given discretization [see the example given by formula ( 10. 1 77)]. Sometimes, this step may not be obvious at all; otherwise, it may require using the alternative norms based on specially chosen inner products.
h (XO ) 2 + h(XP) 2
r ::::;
10.5.4
a(x,t)] I
II . I I �
r
A Necessary and Sufficient Condition of Stability. The Kreiss Criterion
In Section 1 0.5.2, we have shown that for stability of a finitedifference initial boundary value problem it is necessary that the spectrum of the family of transition operators R" belongs to the unit disk on the complex plane. We have also shown, see Theorem 1 0.9, that this condition is, in fact, not very far from a sufficient one,
A Theoretical Introduction to Numerical Analysis
410
as it guarantees the scheme from developing a catastrophic exponential instability. However, it is not a fully sufficient condition, and there are examples of the schemes that satisfy the Oodunov and Ryaben'kii criterion of Section 1 0.5.2, i.e., that have their spectrum of {Rh } inside the unit disk, yet they are unstable. A comprehensive analysis of necessary and sufficient conditions for the stability of evolutiontype schemes on finite intervals is rather involved. In the literature, the corresponding series of results is commonly referred to as the Oustafsson, Kreiss, and Sundstrom (OKS) theory, and we refer the reader to the monograph [OK095, Part II] for detail. A concise narrative of this theory can also be found in [Str04, Chapter 1 1 ]. As often is the case when analyzing sufficient conditions, all results of the OKS theory are formulated in terms of the 12 norm. An important tool used for obtaining stability estimates is the Laplace transform in time. Although a full account of (and even a selfcontained introduction to) the OKS theory is beyond the scope of this text, its key ideas are easy to understand on the qualitative level and easy to illustrate with examples. The following material is es sentially based on that of Section 10.5.2 and can be skipped during the first reading. Let us consider an initial boundary value problem for the first order constant co efficient hyperbolic equation:
a u _ au = O O ::; x ::; I , O < t ::; T, at ax ' u(x,O) = lJI(x), u(I,t) = 0. We introduce a uniform grid: Xm mh, m 0, I, . . . , M, h
( 1 0. 1 83)
= = 1 1M; 0 , 1 , 2, . . . , and approximate problem ( 10. 1 83) with the leapfrog scheme: =
p+ !
p !
p
p
um+1  um _ 1 = 0 2h m = I , 2, . . . , M  l , p = I , 2 . . . , [Tlr]  I , U�, lJI(Xm ), u;n = lJI(Xm + r), m = 0, 1, . . . , M, lug+ ! = 0, up+ M 1 = 0 , p 1 , 2, . . . , [T Ir]  l. Lim  Um 2r
=
'
,
=
tp pr, p =
( 1 0 . 1 84)
=
Notice that scheme ( 10. 1 84) requires two initial conditions, and for simplicity we use the exact solution, which is readily available in this case, to specify for = 0, . . , M 1 . Also notice that the differential problem ( 1 0. 1 83) does not require any boundary conditions at the "outflow" boundary 0, but the discrete problem ( 1 0. 1 84) requires an additional boundary condition that we symbolically denote We will investigate two different outflow conditions for scheme ( 10. 1 84):
m
I,
u�
.

x=
lug+1 = O.
( l 0. 1 85a) and ( 1 0 . 1 85b)
41 1 FiniteDifference Schemesfor Partial Differential Equations where we have used our standard notation r � const. Let us first note that scheme (l 0. 1 84) is not a onestep scheme. Therefore, to =
=
reduce it to the canonical form ( 10. 1 4 1 ) so that to be able to investigate the spectrum of the family of operators {R,,}, we would formally need to introduce additional vari ables (i.e., transform a scalar equation into a system) and then consider a onestep finitedifference equation, but with vector unknowns. It is, however, possible to show that even after a reduction of that type has been implemented, the BabenkoGelfand procedure of Section 1 0.5. 1 applied to the resulting vector scheme for calculating the spectrum of {Rh } will be equivalent to the BabenkoGelfand procedure applied directly to the multistep scheme (10. 1 84). As such, we will skip the formal reduc tion of scheme ( 10. 1 84) to the canonical form (10. 1 4 1 ) and proceed immediately to computing the spectrum of the corresponding family of transition operators. We need to analyze three model problems that follow from ( 10. 1 84): A problem with no lateral boundaries: p l
p+ l
Um  Um Um+1  Um  1 = 0, 2h 2r m = 0, ± 1 , ±2, . . . , p
p
( 10. 1 86)
a problem with only the left boundary: p l
p+ l
Um  Um 2r
(10. 1 87)
m = 1 , 2, . . , IUo  0 , .
p+ l
and a problem with only the right boundary: p+ l
p l
Um  Um Um+l  Um1 = 0, 2h 2r m M  l , M 2, . . . , 1 , 0,  1 , . . . UM  0 . p
p
=
,
( 10. 1 88)
p+ l
Substituting a solution of the type:
into the finitedifference equation:
r rjh, =
that corresponds to all three problems ( 10. 1 86), ( 1 0 . 1 87), and ( 10. 188), we obtain the following second order ordinary difference equation for the eigenfunction
{ um }:
( 10. 1 89)
A Theoretical Introduction to Numerical Analysis
412
Its characteristic equation: ( l0. 1 90a) has two roots: ql = ql (A) and q2 = q2 (A), so that the general solution of equation ( 1 0. 1 89) can be written as
lim = CI q'[I + c2q'2,
m = 0, ± I , ±2, . . . , CI
=
const,
C2 = const.
It will also be convenient to recast the characteristic equation ( 1O. 190a) in an equiv alent form:
q2 
A  Ar
r
q  l = O.
( l0. 190b)
From equation ( 10 . 1 90b) one can easily see that q r q2 =  1 and consequently, unless both roots have unit magnitude, we always have Iqr (A) 1 < 1 and Iq2(A ) 1 > 1 . The solution of problem ( 10. 1 86) must be bounded: l l const for m = 0, ± 1 , ±2, . . .. We therefore require that for this problem Iq" = Iq2 1 = 1 , which means q l = 0 :S: a < 2n, and q2 = The spectrum of this problem was calculated in Example 5 of Section 10.3.3:
um :S:
eiCl,
{
_eiCl.
A = A (a)
r :S:
=
ir sin a ± J1  r2 sin2 a 1 0 :S: a < 2n } .
( 1 0. 1 91)
+>
Provided that 1 , the spectrum A given by formula ( 1 0. 1 9 1 ) belongs to the unit circle on the complex plane. For problem ( 10. 1 88), we must have � 0 as m �  00 Consequently, its general solution is given by:
lim
lI;', = C2A Pq'2' ,
.
m = M, M  I , . . . , 1 , 0,  1 , . . . .
The homogeneous boundary condition f.t r = 0 of ( 10. 1 84) implies that a nontrivial eigenfunction = c2qi may only exist if A = O. From the characteristic equation ( l0 . 1 90a) in yet another equivalent form (A 2  l )q  rA (q2  1 ) = 0, we conclude that if A = 0 then q = 0, which means that problem ( 1 0. 1 88) has no eigenvalues:
lim
lI
( 1 0. 1 92) To study problem ( 10. 1 87), we first consider boundary condition ( 10. 1 85a), known as the extrapolation boundary condition. The solution of problem ( 10. 1 87) must � 0 as m � 00. Consequently, its general form is: satisfy
lim
lI;,, = CI APq'] , m = 0, 1 , 2, . . . . The extrapolation condition ( 1O. 1 85a) implies that a nontrivial eigenfunction lim = CJq'{' may only exist if either A = 0 or C r ( 1  q l ) = O. However, we must have Iq l l < 1 for problem ( l 0. 1 87), and as such, we see that this problem has no eigenvalues either: + ( 10. 193 ) A = 0.
FiniteDifference Schemes for Partial Differential Equations
413
Combining formulae ( 1 0. 1 9 1 ), (10. 192), and ( 1 0. 193), we obtain the spectrum of the family of operators: �
�
1
r+
A= A UAUA = A. We therefore see that according to formula ( 10. 1 9 1 ), the necessary condition for stability (Theorem 10.8) of scheme ( 10. 1 84), ( 1O. 1 85a) is satisfied when :::: 1 . Moreover, according to Theorem 1 0.9, even if there is no uniform bound on the powers of the transition operators for scheme ( 1 0. 1 84), ( 10. 85a), their rate of growth will be slower than any exponential function. Unfortunately, this is precisely what happens. Even though the necessary condition of stability holds for scheme ( 1 0 . 1 84), ( 1O. 1 85a), it still turns out unstable. The actual proof of instability can be found in [GK095, Section 1 3. 1 ] or in [Str04, Section 1 1 .2]. We omit it here and only corroborate the instability with a numerical demonstration. In Figure 10. 1 4 we show the results o f numerical integration o f problem ( 1 0. 1 83) using scheme ( 10. 1 84), ( 1 0. 1 85a) with = 0.95. We specify the exact solution of the problem as = cos 2n + and as such, supplement problem ( 1 0. 1 83) with an inhomo geneous boundary condition = cos2n ( 1 + In order to analyze what may have caused the instability of scheme ( 10. 1 84), ( 1 O. 1 85a), let us return to the proof of Theorem 1 0.9. If we were able to claim that the entire spectrum of the family of operators {Rh } lies strictly inside the unit disk, then a straightforward modification of that proof would immediately yield a uniform bound on the powers Rf,. This situation, however, is generally impossible. Indeed, in all our previous examples, the spectrum has always contained at least one point on the unit circle: A = 1. It is therefore natural to assume that since the points A inside the unit disk present no danger of instability according to Theorem 10.9, then the potential "culprits" should be sought on the unit circle. Let us now get back to problem ( 1 0. 1 87). We have shown that this problem has no nontrivial eigenfunctions in the class Um 0 as 00 and accordingly, it has no eigenvalues either, see formula ( 1 0. 193). As such, it does not contribute to the overall spectrum of the family of operators. However, even though the boundary condition ( 1 0. 1 85a) in the form c I ( 1 = 0 is not satisfied by any function Um = I where < we see that it is "almost satisfied" if the root is close to one. Therefore, I II the function Um = is "almost an eigenfunction" of problem ( 1 0. 1 87), and the smaller the quantity I the more of a genuine eigenfunction it becomes. To investigate stability, we need to determine whether or not the foregoing "almost an eigenfunction" can bring along an unstable eigenvalue, or rather "almost an eigen value," IA I > 1 . By passing to the limit 1 , we find from equation ( 1 0. 1 90a) that A = 1 or A =  1 . We should therefore analyze the behavior of the quantities A and in a neighborhood of each of these two values of A , when the relation between A and is given by equation ( 1O. 1 90a). First recall that according to formula ( 10. 1 91 ), if = 1 , then I A I = 1 (provided that :::: 1 ). Consequently, if IA I > 1 , then 1 , i.e., there are two distinct roots: < and > 1 . In particular, when A is near ( 1 , 0 ) , there are still two roots: One with the magnitude greater than one and the other with the magnitude less than one. When I A  1 1 > 1 and 1 . We, however, 0 we will clearly have
r
I
u(x, t)
(x t)
r
u ( I , t)
t).
>
q
m >
qI )
I,
C ql{',
qI
CI ql' I  ql l ,
q l >
q
r
q
Iql l I
Iql =1=
I q2 1
>
Iq l l
Iq l
Iq2 1 >
A Theoretical Introduction to Numerical Analysis
4 14
M=JOO, T==/.{}
M=200, T=I.O
I.,
0.5
\
0..
1 5 � ' ��0.4 20 �� ��� ���� ���L 0.� 0. ' O.6
2 ���� ��;;0� �� ����L 0 . ' �� 0' O4 O.6
M=!(){). T=2.0
.
2
0
0.2
().4
0.6
M=200. T=2.0
0.'
2
0
0.2
0.4
0.6
0'
,H=2()(}, T=.W
M=!OO, {=J,t)
1..\
" 0.5
0' 0 0.5
·0.5
1.5
 1 .5
2 {)
02
OA
0,
M= /{JO, f,:::.4.0
08

2
0
02
0.4
0.'
M=l()(}, T=4.0
O.S
FIGURE 1 0. 14: Solution of problem ( 1 0. 1 83) with scheme ( 1 0. 1 84), ( 10.1 85a). don't know ahead of time which of the two possible scenarios actually takes place: ( 1 0. 1 94a
FiniteDifference Schemesfor Partial Differential Equations
415
or ( l0. I 94b)
ql (A) q2 (A) A q
To find this out, let us notice that the roots and are continuous (in fact, analytic) functions of Consequently, if we take in the form 1 + 1], where 11 ] I « 1 , and if we want to investigate the root that is close to one, then we can say that = 1 + r;, where I r; I « 1 . From equation (1 0 . I90a) we then obtain:
A.
q( A)
A=
( 1 0. 1 95) Consider a special case of real 1] > 0, then r; must obviously be real as well. From the previous equality we find that r; > 0 (because > 0), i.e., > 1 . As such, we see that if > 1 and > 1 , then
IA I
r
A
Iq l
==;.
{q = q(A) > I } { I q l > I } . Indeed, for real 1] and r;, we have Iq l 1 + r; > 1 ; for other 1] and r; the same result follows by continuity. Consequently, it is the root q2 that approaches ( 1 , 0 ) when A > 1 , and the true scenario is given by ( 1 0. I 94b) rather than by ( 1O. I94a). We therefore see that when a potentially "dangerous" unstable eigenvalue I A I > 1 approaches the unit circle at ( 1 , 0 ) : A > 1 , it is the grid function Um = c2 q'2 , I q2 1 > 1 , that will almost satisfy the boundary condition ( 10. 1 85a), because C2 ( 1  q2 ) > O. This grid function, however, does not satisfy the requirement Um > 0 as m > 00, i.e., it does not belong to the class of functions admitted by problem ( 1 0. 1 87). On the other hand, the function Um = C I qT, I q l > 1 , that satisfies Um > 0 as m > 00, will be very far from satisfying the boundary condition ( l 0. 1 85a) because ql >  1 . Next, recall that we actually need to investigate what happens when ql > 1 , Le., when C l qT is almost an eigenfunction. This situation appears opposite to the one we have analyzed. Consequently, when ql > 1 we will not have such a A (ql ) > where IA (ql ) I > 1 . Qualitatively, this indicates that there is no instability associated with "almost an eigenfunction" Um = clqT, I qd > I , of problem ( 10. 1 87). In the framework of the OKS theory, this assertion can be proven rigorously. Let us now consider the second case: A >  1 while I A I > 1 . We need to deter =
l
I
mine which of the two scenarios holds: lim
IAI> I , A>  I
ql (A) = l,
lim
IAI> I , A >  I
q2 (A) =  I
( 1O. I 96a)
or ( 1O. I 96b) Similarly to the previous analysis, let  1 + 1] , where 11] I « 1 , then also 1 + r;, where I r; 1 « 1 (recall, we are still interested in > 1 ). Consider a particular
A=
q
q(A) =
416
A Theoretical Introduction to Numerical Analysis
case of real 11 < 0, then equation ( 1 0 . 1 95) yields ' < 0, i.e., Iql < 1 . Consequently, if I A I > 1 and A  1 , then
>
{q = q(A)
> I }
�
{ Iql < I } .
In other words, this time it is the root q I that approaches ( 1 , 0 ) as A  1 , and the scenario that gets realized is ( 1 0. 1 96a) rather than ( 1 0. 1 96b). In contradistinction to the previous case, this presents a potential for instability. Indeed, the pair (A , q l ) , where Iql l < 1 and IA I > I , would have implied the instability in the sense of Sec tion 10.5.2 if clq'r were a genuine eigenfunction of problem ( 1 0. 1 87) and A if were the corresponding genuine eigenvalue. As we know, this is not the case. However, according to the first formula of ( 1 0. 196a), the actual setup appears to be a limit of the admissible yet unstable situation. In other words, the combination of "almost an eigenfunction" lim = C I q'll , Iq l l < 1 , that satisfies Um 0 as with "al most an eigenvalue" A = A (qJ ) , IA I > 1 , is unstable. While remaining unstable, this combination becomes more of a genuine eigenpair of problem ( 1 0. 1 87) as A  1. Again, a rigorous proof of the instability is given in the framework of the GKS theory using the technique based on the Laplace transform. Thus, we have seen that two scenarios are possible when A approaches the unit circle from the outside. In one case, there may be an admissible root q of the charac teristic equation that almost satisfies the boundary condition, see formula ( 1 0. 1 96a), and this situation is prone to instability. Otherwise, see formula (1 0. 194b), there is no admissible root q that would ultimately satisfy the boundary condition, and as such, no instability will be associated with this A . In the unstable case exemplified by formula ( 10. 196a), the corresponding limit value of A is called see [GK095, Chapter 13]. In partic ular, A =  I is a generalized eigenvalue of problem ( 1 0 . 1 87). We reemphasize that it is not a genuine eigenvalue of problem ( 1 0. 1 87), because when A =  1 then ql = 1 and the eigenfunction Um = cq'[I does not belong to the admissible class: Um 0 as In fact, it is easy to see that I I u l12 = However, it is precisely the generalized eigenvalues that cause the instability of a scheme, even in the case when the entire spectrum of the family of operators {Ril } belongs to the unit disk. Accordingly, requires that the spectrum of the family of operators be confined to the unit disk as before, and additionally, that the scheme have no generalized eigenvalues I A I = 1 . Scheme ( 10. 1 84), ( 1 0. 1 85a) violates the Kreiss condition at A =  1 and as such, is unstable. The instability manifests itself via the computations presented in Figure 10. 14. As, however, this instability is only due to a generalized eigenvalue with IAI = I , it is relatively mild, as expected. On the other hand, if we were to replace the marginally unstable boundary condition ( 10. 1 85a) with a truly unstable one in the sense of Section 10.5.2, then the effect on the stability of the scheme would have been much more drastic. Instead of ( 1 0. 1 85a), consider, for example:
>
>
m > 00
>
the generalized eigenvalue,
m > 00.
00.
>
the Kreiss necessary and sufficient condition of stability
lIOp+ 1
=
1 . 05 . IIpI + 1 .
( 1 0. 1 97)
FiniteD(fference Schemes for Partial D(fferential Equations 417 This boundary condition generates an eigenfunction 11m clq71 of problem ( 1 0. 1 87) with ql 1.�5 < 1 . The corresponding eigenvalues are given by: =
=
2
l+� 4
(q l   ) 2 , I
qi
and for one of these eigenvalues we obviously have I A I > 1 . Therefore, the scheme is unstable according to Theorem 1 0.8. M=JOO,
T=O.2
M=200,
l .s
r �r
T=O.2
0.5

[_. I
M=JOI.J, T=O.5
�
exaC! SOlu computed solutIOn \
0'
U.•
M=IOO, T=/.O
0.5
M=200. T=0.35
OR
06
OR
0.6
(l.S
M=200, 1"=(}.5
__ ,. r� ,.� =t
1
r: =e xact solutioncomputed solution
·o .s
'"
n.:!
FIGURE 10. 1 5: Solution of problem ( 10. 1 83) with scheme ( 10. 1 84), ( 10 . 1 97). In Figure 1 0. 1 5, we are showing the results of the numerical solution of problem ( 1 0. 1 83) using the unstable scheme ( 1 0 . 1 84), (10. 1 97). Comparing the plots in Fig ure 10. 1 5 with those in Figure 10. 14, we see that in the case of boundary condition
A Theoretical Introduction to Numerical Analysis
418
( 10 . 1 97) the instability develops much more rapidly in time. Moreover, comparing the left column in Figure 10.15 that corresponds to the grid with M = 1 00 cells with the right column in the same figure that corresponds to M = 200, we see that the instability develops more rapidly on a finer grid, which is characteristic of an exponential instability. Let now now analyze the second outflow boundary condition ( 1O. 185b):
uPo + [  0+ ( I o)· Unlike the extrapolationtype boundary condition ( 1 0. 1 85a), which to some extent is arbitrary, boundary condition ( 1 0. 1 85b) merely coincides with the first order upwind approximation of the differential equation itself that we have encountered previously on multiple occasions. To study stability, we again need to investigate three model problems: ( 10. 1 86), ( 1 0 . 1 87), and ( 10. 1 88). Obviously, only problem ( 10. 1 87) changes due to the new boundary condition, where the other two stay the same. To find the corresponding A and q we need to solve the characteristic equation ( 1 0. 1 90a) along with a similar equation that stems from the boundary condition ( 10. 1 85b):
uP r uP  uP
A=
1  r+rq.
( 1 0. 1 98)
r
Consider first the case = I , then from equation ( 10. 1 90a) we find that A = q or A =  1 / q, and equation ( 1 0. 1 98) says that A = q. Consequently, any solution £If:, = APq'r , = 0, 1 , 2, . . . , of problem ( 10. 1 87) will have I A I < 1 as long as I q l < 1 , i.e., there are only stable eigenfunctions/eigenvalues in the class £1m + 0 as + 00. Moreover, as I A I > 1 is always accompanied by I q l > I , we conclude that there are no generalized eigenvalues with I A I = 1 either. In the case < 1 , substituting A from equation ( 1 0. 1 98) into equation ( 10. 190a) and subsequently solving for q, we find that there is only one solution: q = 1 . For the corresponding A, we then have from equation ( 1 0. 1 98): A = l . Consequently, for < 1 , problem ( ] 0. 187) also has no proper eigenfunctions/eigenvalues, which means ) that we again have A = 0. As far as the generalized eigenvalues, we only need to check one value of A : A = I (because A = 1 does not satisfy equation ( 1 0. 1 98) for q = 1 ). Let A = 1 + 11 and q = ] + (, where 1 11 1 « 1 and 1 ( 1 « l . We then arrive at the same equation ( 10. 195) that we obtained in the context of the previous analysis and conclude that A 1 does not violate the Kreiss condition, because I A I > I implies I q l > l . As such, the scheme ( 1 0. 1 84), ( 1 O. 1 85b) is stable.
c[
m
m
r
r

=
Exercises
1. For the scalar LaxWendroff scheme [cf. formula ( 10.83)]:
p+1  UmP
Um
r p = 0, 1 , . . . , [T Ir]  1 ,
..
m = 1 , 2, . , M  l ,
Mh = 1 ,
u� = Ij/(xm) , m = 0 , 1 , 2, . . . M + 1 p u°  u0P _ _ uPI __  u0P _ = 0 ' upM+ I  O p = 0, 1 , . . . , [T Ir]  1 , r h ,
,
,
FiniteDifference Schemesfor Partial Differential Equations
419
that approximates the initial boundary value problem:
au au = o < t s:: T, at _ ax ' O s:: x s:: l , O u(x,O) = lJI(x), u(l ,t) = 0,
on the uniform rectangular grid: Xm = mh, m = 0, I , . . . , M, Mh = I, t = pr, p = P 0, I , . . . , [T /rj , find out when the BabenkoGelfand stability criterion holds. Answer. r = r/h s:: I . scheme for the acoustics system, with the additional boundary conditions set using extrapolation of Riemann invariants along the characteristics. I could not solve it, stum bled across something in the algebra, and as such, removed the problem altogether. It is in the Russian edition of the book. 2.* Prove Theorem 1 0.6. a) Prove the sufficiency part. b) Prove the necessity part. 3.* Approximate the acoustics Cauchy problem: 7ft  A
au
u(x,t) =
au = cp(x,t), ax u(x,O) = lJI(x) , V(X,t) (x) w(x,t) , cp(x) = cp(2) (x)
 00
[ ]
[CPt!) ]
00 ,
s:: x s:: 0 < t s:: T, s:: x s::
 00
00,
,
with the LaxWendroff scheme:
P Upm+  Um _ A U;n+1 um_1 I
P
r
f)
_
_
�A2 uPm+1
_
p + upm_1 2Um
2h 2 h2 p = O, I , . . . , [T/rj  I , m = 0, ± I , ±2 , . . .
U� = lJI(Xm), Define uP = {u;;'} and cpP =
= cp;;' ,
,
m = 0, ± I , ±2, . . . .
{ cpf:,}, and introduce the norms as follows:
where
lI uP I I 2 = L ( I vI:. 1 2 + 1 wI:. 1 2 ) , t t lJl ll 2 = L ( l lJI( l l (xm) 1 2 + I lJI(2) (xm) 1 2 ) , m m tlcpP II 2 L ( t cp( ! ) (xm, tp )t 2 + tcp(2) (xm, tp ) t 2) . m =
a) Show that when reducing the LaxWendroff scheme to the canonical form (10. 1 4 1 ), inequalities ( 1 0. 1 43) and ( 1 0 . 1 44) hold.
b) Prove that when r = f,
s:: 1 the scheme is 12 stable, and when r > 1 it is unstable.
A Theoretical Introduction to Numerical Analysis
420
Hint. To prove estimate ( 10. 145) for the norms variables (called the Riemann invariants):
[[Rf, I I , first introduce the new unknown
I IIIl( ) = vm + wm and fm(2) = Vm  WIII, and transform the discrete system accordingly, and then employ the spectral criterion of Section 1 0.3. 4. Let the norm in the space U/, be defined in the sense of 12:
lIull2
=
[\,��
� lum l
2
] 1 /2
Prove that in this case all complex numbers A (a) = 1  r + reia, 0 � a < 2,. [see formula ( 10. 148)], belong to the spectrum of the transition operator R" that corresponds to the difference Cauchy problem ( 10. 147), where the spectrum is defined according to Definition 10.7. Hint. Construct the solution u = {ulIl }, m = 0, ± I , ±2, . . . , to the inequality that appears >O m. , , where q i = ( I  8 )eW, in Definition 10.7 in the form: !l1Il q2 = � ' 1 Q2 1l , m < 0, ( 1  8 ) eia, and 8 > 0 is a small quantity. =
{�
5. Prove sufficiency in Theorem 10.7. Hint. Use expansion with respect to an orthonormal basis in U' composed of the eigenvectors of R". 6. Compute the spectrum of the family of operators {R,, }, v = R"u, given by the formulae:
Vm = ( I  r)um + ru",+ I , m = 0, I , . . . , M  I , VM = O. Assume that the norm is the maximum norm. 7. Prove that the spectrum of the family of operators {R,,}, v
=
R"u, defined as:
VIII = ( 1  r + yh)um + rIl1l/+ I , m = 0, I , . . . ,M  I , VM = lIM , does not depend on the value of y and coincides with the spectrum computed in Sec tion 10.5.2 for the case y = O. Assume that the norm is the maximum norm. Hint. Notice that this operator is obtained by adding yhl' to the operator R" defined by formulae (l0. 1 42a) & ( 1 O . l 42b), and then use Definition 1 0.6 directly. Here I' is a
modification of the identity operator that leaves all components of the vector 1I intact except the last component liM that is set to zero. 8. Compute the spectrum of the family of operators {R,,}, v = R" lI, given by the formulae:
V1l/ = ( I  r) um + r(lIm_ 1 + um+ I ) /2 , m = 1 , 2, . . . , M  l , VM = 0 , avo + bvi = 0,
where a E lR and b
E lR are known and fixed. Consider the cases lal > [b[ and [al < [bl· 9.* Prove that the spectrum of the family of operators {R,, }. v = R"u, defined by formulae ( l O. 142a)
&
( l0. 1 42b) and analyzed in Section 10.5.2:
VIII = ( I  r)um + rUm+ I , m = 0, I , VM = UM,
. . .
•M
I,
FiniteDifference Schemesjar Partial Differential Equations will not change if the [h 2 1/2 L,lI/ lIm ] .
C
norm: Ilull
=
maxm Ium l is replaced by the
421
12 norm:
lIuli =
1 0. For the first order ordinary difference equation:
m = 0, ± I , ±2, . . . ,
avm + bvm+ 1 = fm ,
the fundamental solution Gill is defined as a bounded solution of the equation: aGm + hGm
+1
=
a) Prove that if l a/b l < l , then Gm = b) Prove that if la/bl > 1 , then Gm
=
==
Om
{o� {I
{
= 0, i= 0.
m
m ::; O ,
l. m � (_ * ) , m ? m m ::; O , ( a)
a  /j
0,
m
I, 0,
,
m ? l.
c) Prove that Vm = L, Gm kfk' k== I I . Obtain energy estimates for the implicit first order upwind schemes that approximate problems ( 1 0. 1 70), ( 1 0. 1 72)*, and ( I 0. l74)*. 1 2.* Approximate problem ( 1 0. 1 70) with the CrankNicolson scheme supplemented by one sided differences at the left boundary x = 0:
p+ p [ p+ I
11m
l oP+1
I +I  Um _ � um + l  u1'm_1 2 2h r
[ lIPI +1  /11'0+ 1
r
_
]
=0
'
p = O, I , . . . , [T/rj  l ,
m = I , 2, . . . , M  I ,  liP0
+
p P]
!lm +l  um _ 1 2h
1  '� + 2 h
I'
1'
!I I  110 h
_

°,
I'
uM
= 0,
( 1 0. 1 99)
p = O , I , . . . , [T/rj  I , u?n = If/m ,
m
= 0, 1 , 2, . . . , M.
a) Use an alternative definition of the 12 norm: Ilull�
=
� ( 116 + �) + h 1I
M I
L, u;n and m= l
develop an energy estimate for scheme ( 1 0. J 99). Multiply the equation by uC,+ 1 + lift, and sum over the entire range of m. b) Construct the schemes similar to (10.199) for the variablecoefficient problems ( 1 0. 172) and ( 1 0. 1 74) and obtain energy estimates. Hint.
1 3 . Using the Kreiss condition, show that the leapfrog scheme ( 10.184) with the boundary condition: ( l 0. 200a) uo1'+ 1 = uPI
is stable, whereas with the boundary condition:
_ 1/01'  1 + 2 11o1' + 1 
r
it is unstable.
l1
( l  uP0 )
( l 0. 200b)
A Theoretical Introduction to Numerical Analysis
422
14. Reproduce on the computer the results shown in Figures 1 0. 14 and 10. 15. In addition, conduct the computations using the leapfrog scheme with the boundary conditions ( l O . 185b), ( 1O.200a), and ( lO.200b), and demonstrate experimentally the stability and instability in the respective cases. 1 5.* Using the Kreiss condition, investigate stability of the CrankNicolson scheme ap plied to solving problem ( 10. 1 83) and supplemented either with the boundary condition (lO.1 85b) or with the boundary condition ( 1O.200a).
10.6
Maximum Principle for the Heat Equation
Consider the following initial boundary value problem for a variablecoefficient heat equation, a(x, t) > 0:
() u
d2 u = cp(x,t), O :S; x :S; 1 , O :s; t :S; T, x2 u(x, O) = 1JI(x), O :s; x :S; 1 , u(O,t) tJ (t), u(l,t) = X(t), O :s; t :S; T.
Tt  a(x,t) ()
( 1 0.20 1 )
=
To solve problem ( 1 0.201 ) numerically, we can use either an explicit o r an implicit finitedifference scheme. We will analyze and compare both schemes. In doing so, we will see that quite often the implicit scheme has certain advantages compared to the explicit scheme, even though the algorithm of computing the solution with the help of an explicit scheme is simpler than that for the implicit scheme. The advantages of using an implicit scheme stem from its unconditional stability, i.e., stability that holds for any ratio between the spatial and temporal grid sizes. 10.6.1
An Explicit Scheme
We introduce a uniform grid on the interval [0, 1 ] : Xm = mh, m = O, I , . . . ,M, and build the scheme on the fournode stencil shown in Figure 1 O.3(left) (see page 331):
Mh = I ,
Ump+1  Ump  a (xm ,t ) ump + 1  2Ump + lImp  1 cp(xm ,t ) , p p h2 rp m = 1 , 2, . . . , M  l , ( 1 0.202) U� = 1JI(xm ) == 1JIm , m = 0 , 1 , . . . ,M, uS + ! = tJ (tp+ I ) ' uft' = X (tp+ J ) , p 2: 0, to = O, tp = ro + rl + . . . + rp I , p = I , 2 , . . . . If the solution U�l ' m = 0, 1 , . . . ,M, is already known for k = 0, 1 , . . . , p , then, by virtue of ( 10.202), the values of u!:r+ ! at the next time level t = tp+ ! = tp + rp can be 
=
FiniteDifference Schemesfor Partial Differential Equations
423
computed with the help of an explicit formula:
ump+l  ump + h'rp2 a (xm, tp ) ( ump + l _
_
p + ump _ l ) + 'rp CP (xm,tp ) , 2 Um
( 10.203)
m = 1 , 2, . . . , M  1 .
explicit.
This explains why the scheme ( 10.202) is called Formula ( 1 0.203) allows one to easily march the discrete solution uf:, from one time level to another. According to the the principle of frozen coefficients (see Sections 10.4. 1 and 1 0.5. 1 ), when marching the solution by formula ( 1 0.203), one may only expect sta bility if the time step 'rp satisfies: ( 10.204) Let us prove that in this case scheme ( 10.202) is indeed stable, provided that the norms are defined as follows:
I l u (h
) ll
u,
,
=
max max l uf:, I ,
p m I l lh) I I FI = max { max (t ) m l lfl (xm ) I , max m,p l cp (xm tp ) I } , , p l 1J p I , max p l x (tp ) l , max ,
f(ll)
where u(ll) and are the unknown solution and the righthand side in the canon ical form Lh u(ll ) = lh ) of the scheme ( 10.202). First, we will demonstrate that the following inequality holds known as the maximum principle: max m l uf:,+ l ::::: max { max l � ( p ) I , max l x ( p ) I ,
l
t
p
t
p
( 10.205)
From formula ( 10.203) we have:
ump+ I 
_
(1
_
h2 a (xm ,tp ) ) um + h'r2 a (xm, tp ) (um+ 1 + um _ 1 ) + 'rp CP (xm, tp ) ,
2 'rp
p
p
P
p
m = 1 , 2, . . . , M  1 .
If the necessary stability condition ( 10.204) i s satisfied, then 1 and consequently,
I U�,+ I I :::::
+
( 1  2 �i a (Xm, tp ) ) mfx l uf l ' k lur l + max k l url ) hri a (xm , tp ) ( max
= max lufl + 'rp max I CP (Xk tp ) I , k k,p ,
m = 1 , 2, . . . , M  1 .
� 
+
2
�i a (x
'rp max I CP(Xk, tp ) 1 �p
n"
tp ) 2: 0,
A Theoretical Introduction to Numerical Analysis Then, taking into account that ug+ 1 � (tp+ l ) and uf t X(tp+ l ) , we obtain the 424
=
=
maximum principle (10.205). Let us now split the solution u( il) of the problem Lh u(h ) = fh) into two compo nents: u (il) = v(il) + w(ll), where v(h) and w(h) satisfy the equations:
For the solution of the first subproblem, the maximum principle (10.205) yields: lx(tp ) l , max l v;" i } , max lv;',+ I I .:::; max{max p l � (tp ) l, max P In
III
)I max lv,�,l ':::; max{max p l � (tp) l , max p lx(tp) l , max m l lJl(xm }· III
For the solution of the second subproblem, we obtain by virtue of the same estimate ( 10.205): max I w;,,+ 1 I .:::; max IIP (xm , fp ) Iwfn l + !p max m In l1 , lp
.:::; max m.p I IP(xm , tp ) m I wfn 1 1 + ( !p + !p  l ) max .:::; max IIP (xm ,tp ) Iw� 1 +(!p + !p  l + . . . + !o ) max m, m p
'v'" =0
, .:::; T max m,p I IP(xm tp ) , where T is the terminal time, see formula ( 10.20 1 ). From the individual estimates established for v;,,+ 1 and wfn+ l we derive: p+ p+ < max max l ump+ 1 1 = max l vmp+ 1 + wmp + 1 1 m lvm l l + max l wm 1 1 ( 1 0.206) 111
I'n
III
where c = 2max{ 1 , T } . Inequality ( 10.206) holds for all p. Therefore, we can write: ( 10.207) which implies stability of scheme (10.202) in the sense of Definition 10.2. As such, we have shown that the necessary condition of stability ( 10.204) given by the princi ple of frozen coefficients is also sufficient for stability of the scheme ( 10.202).
FiniteDijjerence Schemesjor Partial Dijjerential Equations
425
a(xm, tp )
as Inequality ( 10.202) indicates that if the heat conduction coefficient sumes large values near some point (x, t), then computing the solution at time level Therefore, advancing will necessitate taking a very small time step ! the solution until a prescribed value of T is reached may require an excessively large number of steps, which will make the computation impractical. Let us also note that the foregoing restriction on the time step is of a purely numer ical nature and has nothing to do with the physics behind problem ( 10.20 1 ). Indeed, this problem models the propagation of heat in a spatially onedimensional structure, e.g., a rod, for which the heat conduction coefficient = may vary along the rod and also in time. Large values of in a neighborhood of some point (x, t) merely imply that this neighborhood can be removed, i.e., "cut off," from the rod without changing the overall pattern of heat propagation. In other words, we may think that this part of the rod consists of a material with zero heat capacity.
t = tp+I
t
= !p.
=
a a(x, t)
a(x, t)
1 0. 6 . 2
An Implicit Scheme
Instead of scheme (10.202), we can use the same grid and build the scheme on the stencil shown in Figure 1O.3(right) (see page 3 3 1 ):
p+II  211mp+I + lImp+I_ limp+I  Ump  a (xm,tp ) IIm+ I = cp (Xm, tp+I ), !p /12 m = 1 , 2 , . . . ,M  1 , ( 10.208) lI� = lJfm , m = O, I, . , M, ug+I = 6 (tp+I ) , uf t = X (tp+ I ), P 2: 0, to = O, tp = !O + !I + . . . + !pI , p = 1 ,2, . . . Assume that the solution u�z, m = 0, 1 , . . . , M, at the time level t = tp is already l known. According to formula ( 10.208), in order to compute the values of u�+ , m = 0, 1 , . . . ,M, at the next time level t tp+ l = tp + !p we need to solve the following system of linear algebraic equations with respect to lim u�,+ I : uo = 6(tp+l ) , amUm  I + f3mllm + ItnUm+I = jm, m = 1 , 2, . . . , M  1 , ( 10.209) UM = X (tp+I ), . .
.
=
==
where
13m = 1 + 2 h'rp2 a (xm, tp), m = 1 , 2 , . . . ,M  l , ')tJ = aM = 0, f30 = 13M
=
It i s therefore clear that
m
=
0, 1 , 2, . . . , M,
1,,, 1.
= ue, + !pCP (Xm, tp+I ) ,
426
A Theoretical Introduction to Numerical Analysis
(10.209) 5.4.2.
and because of the diagonal dominance, system can be solved by the algo rithm of tridiagonal elimination described in Section Note that in the case of scheme there are no explicit formulae, i.e., closed form expressions such as formula that would allow one to obtain the solution ufn+ 1 at the upper time level given the solution ufn at the lower time level. Instead, when marching the solution in time one needs to solve systems repeatedly, i.e., on every step, and that is why the scheme is called implicit. In Section (see Example we analyzed an implicit finitedifference scheme for the constantcoefficient heat equation and demonstrated that the von Neumann spectral stability condition holds for this scheme for any value of the ra tio r = 'f /h2 . By virtue of the principle of frozen coefficients (see Section the spectral stability condition will not impose any constraints on the time step 'f even when the heat conduction coefficient a (x,t ) varies. This makes implicit scheme unconditionally stable. It can be used efficiently even when the coefficient a (x, t ) assumes large values for some (x, l). For convenience, when computing the solution of problem with the help of scheme one can choose a constant, rather than variable, time step 'fp = 'f. To conclude this section, let us note that unconditional stability of the implicit scheme can be established rigorously. Namely, one can prove (see § that the solution ufn of system satisfies the same maximum princi ple as holds for the explicit scheme Then, estimate for scheme can be derived the same way as in Section
(10.208) (10.203), 10.3.3
(10.208)
(10.209)
7),
10.4.1),
(10.208)
(10.201)
(10.208) 28]) (10.205) (10.208)
(10.208),
(10.208) (10.202).
[GR87, (10.207)
10.6.1.
Exercise
I . Let the heat conduction coefficient in problem ( 1 0.20 I) be defined as a = I + u2 , so that problem ( 1 0.20 1 ) becomes nonlinear.
a) Introduce an explicit scheme and an implicit scheme for this new problem. b) Consider the following explicit scheme:
uP  2uP + ll u"'p+ l  UPrn  [ I + (ufn) 2 j m+1 h�1! m  I = 0, 'rp m = 1 , 2, . . . , M  I , u� = lJ!(xm ) lJ!m , In = 0 , I , . . . ,M, ug + 1 = tI (tp+ l l , u'f.t+1 = X (tp+ I ), p '2 0 , to = O, tp = 'rO + 'rI + . . . + 'rp_ l , p = I , 2, . . . . How should one choose 'rp , given the values of the solution uf" at the level p? ==
c) Consider an implicit scheme based on the following finitedifference equation:
p+1 p+1 p+ 1 Ump+ 1  Ump  [ 1 + ( up+ 1 ) 2 j um+ 1  211m + um  1 = 0 . m �
� How should one modify this equation in order to employ the tridiagonal elim ination of Section 5.4 for the transition from u!:" m = 0 1 . . . , M, to uf,,+I , m = O, I , . . . , M? ,
,
Chapter 1 1 Discontinuous Solu tions and Methods of Their Compu ta tion
A number of frequently encountered partial differential equations, such as the equa tions of hydrodynamics, elasticity, diffusion, and others, are in fact derived from the conservation laws of certain physical quantities (e.g., mass, momentum, energy, etc.). The conservation laws are typically written in an integral form, and the forego ing differential equations are only equivalent to the corresponding integral relations if the unknown physical fields (i.e., the unknown functions) are sufficiently smooth. In every example considered previously (Chapters 9 and 10) we assumed that the differential initial (boundary) value problems had regular solutions. Accordingly, our key approach to building finitedifference schemes was based on approximating the derivatives using difference quotients. Subsequently, we could establish consistency with the help of Taylor expansions. However, the class of differentiable functions of ten appears too narrow for adequately describing many important physical phenom ena and processes. For example, physical experiments demonstrate that the fields of density, pressure, and velocity in a supersonic flow of inviscid gas are described by functions with discontinuities (known as either shocks or contact discontinuities in the context of gas dynamics). It is important to emphasize that shocks may appear in the solution as time elapses, even in the case of smooth initial data. From the standpoint of physical conservation laws, their integral form typically makes sense for both continuous and discontinuous solutions, because the corre sponding discontinuous functions can typically be integrated. However, in the dis continuous case one can no longer obtain an equivalent differential formulation of the problem, because discontinuous functions cannot be differentiated. As a con sequence, one can no longer enjoy the convenience of studying the properties of solutions to the differential equations, and can no longer use those for building finite difference schemes. Therefore, prior to constructing the algorithms for computing discontinuous solutions of integral conservation laws, one will need to generalize the concept of solution to a differential initial (boundary) value problem. The objective is to make it meaningful and equivalent to the original integral conservation law even in the case of discontinuous solutions. We will use a simple example below to illustrate the transition from the problem formulated in terms of an integral conservation law to an equivalent differential prob lem. The latter will only have a generalized (weak) solution, which is discontinuous. We will also demonstrate some approaches to computing this solution. Assume that in the region (strip) 0 :::; t :::; T we need to find the function u = u(x,/ ) 427
428
A Theoretical Introduction to Numerical Analysis
that satisfies the integral equation:
lr dx uk  Uk+ ] dt = O r
k+ I
k
( 1 l . l a)
for an arbitrary closed contour r. The quantity k in formula ( 1 1 . 1 a) is a fixed positive integer. We also require that u = u(x,t) satisfies the initial condition:
ll(X,O) = If/(x),
00
]dx  t/>2dt. n
r
( 1 1 .3)
429 Discontinuous Solutions and Methods of Their Computation Identity ( 1 1 .3) means that the integral of the divergence aJ,1 + a� of the vector field a cf> = [ cf> l , I/>2 V over the domain Q is equal to the flux of this vector field through the boundary r = an.
uk+1
(uk+ I )]
Using formula ( 1 1 .3), we can write: [/
11 [ ( )
a uk + a  dxdt. ( 1 1 .4) dx dt = J k k + 1 n at k ax k + 1 r Equality ( I 1 .4) implies that if a smooth function u = u(x,t) satisfies the Burgers
l ) (uk) (uk au u ( � a + uk1 at + u ax at k + ax k + ) = 0.
equation, see fonnula ( 1 1 .2), then equation ( I 1 . 1 a) also holds. Indeed, if the Burgers equation is satisfied, then we also have: ==
�
( 1 1 .5)
1
Consequently, the righthand side of equality ( I 1 .4) becomes zero. The converse is also true: If a smooth function satisfies the integral conservation law (I 1 . 1 a), then at every point (i, i) of the strip 0 < < equation ( 1 1 .5) holds, and hence equation ( 1 1 .2) is true as well. To justify that, let us assume the opposite, and let us, for definiteness, take some point (i, i) for which:
u = u (x, t)
a at
t T
( uk ) + a (uk+ ) I
1
k
ax k +
I
(i,i)
> 0.
Then, by continuity, we can always find a sufficiently small disk Q c T} centered at (i, i) such that
�
(uk)
( �) I
� at k + ax k+ 1
oJ
(x,tlEn
> 0.
k+ 11 [ a ( uk ) + a (uk+ I ) ] dxdt at k ax k+
Hence, combining equations ( I L Ia) and ( 1 1 .4) we obtain (recall, r
=
r
{(x, t) 1 0 < t
O.
The contradiction we have just arrived at, 0 > 0, proves that for smooth functions problem ( 1 1 . 1 a), ( 1 1 . 1 b) and the Cauchy problem ( 1 1 .2) are equivalent.
u = u(x, t), 11. 1.2
The Mechanism o f Formation o f Discontinuities
= u (x, t) of the Cauchy problem ( 1 1 .2) is x = x(t) defined by the following ordinary
Let us first suppose that the solution II smooth. Then we introduce the curves differential equation:
dx u(x.t). ' dt

=
( 1 1 .6)
A Theoretical Introduction to Numerical Analysis
430
These curves are known as characteristics of the Burgers equation: Ut + uUx = O. Along every characteristic x = x(t), the solution u u (x, t) can be considered as a function of the independent variable t only:
=
u(x,t) = u(x(t) , t) = u(t). Therefore, using 0 1 .6) and 0 1 .2), we can write:
du ou ou dx = O. dt = at + ox dt Consequently, the solution is con stant along every characteristic x x(t) defined by equation 0 1 .6): u(x, t) I x=X (I = const. In ) turn, this implies that the charac teristics of the Burgers equation are straight lines, because if u const, then equation 0 1 .6) yields:
=
=
( 1 1 .7) x = ut + xo. In formula ( 1 1 .7), Xo denotes the abscissa of the point (xo,O) on the (x, t) plane from which the characteristic originates, and u lfI(xo) = tan a denotes its slope FIGURE 1 1 . 1 : Characteristic ( 1 1 . 7). with respect to the vertical axis t, see Figure 1 1 . 1 . Thus, we see that by specifying the initial function u(x, 0) = lfI(x)
x
=
in the definition of problem ( 1 1 .2), we fully determine the pattern of characteristics, as well as the values of the solution u = u(x, t), at every point of the semiplane t > O.
o
x (a) ",'(x)
>0
o
x (b) 'I"(x)
0, then the angle a (see Figure 1 1 . 1 ) also increases as a function of Xo , and the characteristics do not intersect, see Fig ure 1 1 .2(a). However, if IjI = ljI(x) happens to be a monotonically decreasing func tion, i.e., if IJf' (x) < 0, then the characteristics converge toward one another as time elapses, and their intersections are unavoidable regardless of what regularity the function ljI(x) has, see Figure 1 1.2(b). In this case, the smooth solution of problem ( 1 1 .2) ceases to exist and a discontinuity forms, starting at the moment of time t = t when at least two characteristics intersect, see Figure 1 1 .2(b). The corresponding graphs of the function u = u(x,t) at the moments of time t = 0, t = t/2, and t = t are schematically shown in Figure 1 1 .3. 1 1 . 1 .3
Condition at the D iscontinuity
Assume that there is a curve L = { (x, t) Ix = x(t)} inside the domain of the solution U = u(x,t) of prob
t
lem 0 1 . l a), 0 1 . 1 b), on which this solution undergoes a discontinu ity of the first kind, alternatively called the jump. Assume also that when approaching this curve from the left and from the right we ob tain the limit values:
u(x, t) = Uleft (X, t) and
u(x,t) = Uright (X,t),
B
L
D o
x
FIGURE 1 1 .4: Contour o f integration.
432
A Theoretical Introduction to Numerical Analysis
respectively, see Figure 1 1 .4. It turns out that the values of Uleft (x, t) and Uright (X, t) are related to the velocity of the jump i = dx/dt in a particular way, and altogether these quantities are not independent. Let us introduce a contour ABCD that straddles a part of the trajectory of the jump L, see Figure 1 1 .4. The integral conservation law (I 1 . 1 a) holds for any closed contour r and in particular for the contour ABCDA: II uk+ 1 = 0. d dt x k k+ 1 ABCDA
J
( 1 1.8)
Next, we start contracting this contour toward the curve L, i.e., start making it nar rower. In doing so, the intervals BC and DA will shrink toward the points E and respectively, and the corresponding contributions to the integral ( 1 1.8) will obviously approach zero, so that in the limit we obtain:
F,
J
L'
or alternatively:
[ Uk ] dx  [ uk+ ] dt = O, k
l
k+ I
[: ] �;  [::� ]) dt = O. ( J
L'
Here the rectangular brackets: [zl � Zright  Zleft denote the magnitude of the jump of a given quantity Z across the discontinuity, and L' denotes an arbitrary stretch of the jump trajectory. Since L' is arbitrary, the integrand in the previous equality must be equal to zero at every point: k uk+ I U dx k dt k + I (x,t)EL = 0, and consequently,
([ ] [ ]) I �; = [ ::� ] . [:k r
I
( 1 1 .9)
Formula ( 1 1 .9) indicates that for different values of k we can obtain different condi tions at the trajectory of discontinuity L. For example, if k = I we have:
dx dt
Uleft + Uright
2
( 1 1 . 10)
and if k = 2 we can write:
dx � ufeft + UleftUrighl + U;ight dt 3 Uleft + Uright We therefore conclude that the conditions that any discontinuous solution of problem ( 1 1 . Ia), ( 1 1 . 1 b) must satisfy at the jump trajectory L depend on k.
Discontinuous Solutions and Methods of Their Computation 1 1 . 1 .4
433
Generalized Solution of a Differential P roblem
Let us define a generalized solution to problem ( 1 1 .2). This solution can be dis continuous, and we simply identify it with the solution to the integral conservation law ( 1 1 . 1 a), ( 1 1 . 1 b). Often the generalized solution is also called a weak solution. In the case of a solution that has continuous derivatives everywhere, we have seen (Section 1 1 . 1 . 1 ) that the weak solution, i.e., solution to problem ( 1 1 . 1 a), ( 1 1 . 1 b), does not depend on k and coincides with the classical solution to the Cauchy problem ( 1 1 .2). In other words, the solution in this case is a differentiable function u = u(x, t) that turns the Burgers equation Ut + UUx ° into an identity and also satisfies the initial condition u(x,O) = ljI(x). We have also seen that even in the continuous case it is very helpful to consider both the integral formulation ( 1 1 . 1 a), ( 1 1 . 1 b) and the differential Cauchy problem ( 1 1 .2). By staying only within the framework of problem ( 1 1 . 1 a), ( 1 1 . 1 b), we would make it more difficult to reveal the mechanism of the formation of discontinuity, as done in Section 1 1 . 1 .2. In the discontinuous case, the definition of a weak solution to problem ( 1 1 .2) that we have just introduced does not enhance the formulation of problem ( 1 1 . 1 a), ( 1 1 . 1 b) yet; it merely renames it. Let us therefore provide an alternative definition of a generalized solution to problem ( I 1 .2). In doing so, we will only consider bounded solutions to problem ( 1 1 . 1 a), ( 1 1 . 1 b) that have continuous first partial derivatives everywhere on the strip ° < t < T, except perhaps for a set of smooth curves x = x(t) along which the solution may undergo discontinuities of the first kind (jumps). =
DEFINITION 1 1 . 1 The function u u(x, t) is called a generalized (weak) solution to the Cauchy problem (1 1 . 2) that cDrTesponds to the integral conser vation law (1 1 . 1 a) if: =
1 . The function u = u(x, t) satisfies the BurgeTs equation (.g ee formula (1 1 . 2)J at every point of the strip 0 < t < T that does not belong to the curves x = x(t) which define the jump trajectories. 2. Condition (1 1 . 9) holds at the jump trajectory.
3. FOT every x fOT which the initial function IjI = ljI(x) is continuous, the solution u = u (x, t) is continuous at the point (x, 0) and satisfies the initial condition u(x,O) = ljI(x) .
The proof of the equivalence of Definition 1 1 . 1 and the definition of a generalized solution given in the beginning of this section is the subject of Exercise I . Let us emphasize that in the discontinuous case, the generalized solution of the Cauchy problem ( 1 1 .2) is not determined solely by equalities ( I 1 .2) themselves. It also requires that a particular conservation law, i.e., particular value of k, be specified that would relate the jump velocity with the magnitude of the jump across the dis continuity, see formula ( 1 1 .9). Note that the general formulation of problem ( 1 1 . l a), ( I L l b) we have adopted provides no motivation for selecting any preferred value of k. However, in the problems that originate from realworld scientific applications,
434
A Theoretical Introduction to Numerical Analysis
the integral conservation laws analogous to ( 1 1 . 1 a) would normally express the con servation of some actual physical quantities. These conservation laws are, of course, well defined. In our subsequent considerations, we will assume for definiteness that the integral conservation law ( 1 1 . 1 a) that corresponds to the value of k = 1 holds. Accordingly, condition ( 1 1 . 10) is satisfied at the jump trajectory. In the literature, pioneering work on weak solutions was done by Lax [Lax54]. 11.1.5
The Riemann Problem
Having defined the weak solutions of problem ( 1 1 .2), see Definition 1 1 . 1 , we will now see how a given initial discontinuity evolves when governed by the Burgers equation. In the literature, the problem of evolution of a discontinuity specified in the initial data is known as the Riemann problem. Consider problem ( 1 1 .2) with the following discontin dx/dt=3/2 uous initial function:
Ijf(X)
u=2
=
{2, 1,
x < 0, x > O.
The corresponding solution is shown in Figure 1 1 .5. The evolution of the initial L:.�"'_""'____I"'�' _L___L discontinuity consists of its x o propagation with the speed i = (2 + 1 )/2 = 3/2. This FIGURE 1 1 .5: Shock. speed, which determines the slope of the jump trajectory in Figure 1 1 .5, is obtained according to formula ( 1 1 . 1 0) as the arithmetic mean of the slopes of characteristics to the left and to the right of the shock. As can be seen, the characteristics on either side of the discontinuity impinge on it. In this case the discontinuity is called a shock; it is similar to the shock waves in the flows of ideal compressible fluid. One can show that the solution from Figure 1 1 .5 is stable with respect to the small perturbations of initial data. Next consider a different type of initial discontinuity: u=]
_
___ _______
Ijf(X)
=
{ I, 2,
x < 0, x > O.
(1 1 . 1 1)
One can obtain two alternative solutions for the initial data ( 1 1 . 1 1). The solution shown in Figure 1 1 .6(a) has no discontinuities for t > O. It consists of two regions with u = 1 and u = 2 bounded by the straight lines i = 1 and i = 2, respectively, that originate from (0,0). These lines do not correspond to the trajectories of discon tinuities, as the solution is continuous across both of them. The region in between these two lines is characterized by a family of characteristics that all originate at the
Discontinuous Solutions and Methods of Their Computation
435
dxldf=! axldf=312
x
(b) Unstable discontinuity.
(a) Fan.
FIGURE 1 1 .6: Solutions of the Burgers equation with initial data ( 1 1 . 1 1). same point (0, 0). This structure is often referred to as a fan (of characteristics); in the context of gas dynamics it is known as the rarefaction wave. The solution shown in Figure 1 1 .6(b) is discontinuous; it consists of two regions u = I and u = 2 separated by the discontinuity with the same trajectory i = ( I + 2) /2 = 3/2 as shown in Figure 1 1 .5. However, unlike in the case of Figure 1 1 .5, the characteristics in Figure 1 1 .6(b) emanate from the discontinuity and veer away as the time elapses rather than impinge on it; this discontinuity is not a shock. To find out which of the two solutions is actually realized, we need to incorporate additional considerations. Let us perturb the initial function 1JI(x) of ( I Ll I ) and consider: x < 0, ( 1 1 . 1 2) 1JI(x) = l + x/e, O ::; x ::; e, x > e. 2,
{I ,
The function 1JI(x) of 0 1 . 1 2) is continuous, and the cor responding solution u(x, t) of problem 0 1 .2) is deter mined uniquely. It is shown in Figure 1 1 .7. When e tends to zero, this solution approaches the continuous fantype solution of prob lem ( 1 1 .2), ( 1 1 . 1 1 ) shown in Figure 1 1 .6(a). At the x E o same time, the discontin uous solution of problem FIGURE 1 1 .7: Solution of the Burgers equation ( 1 1 .2), ( l U I ) shown in Fig with initial data ( 1 1 . 12). ure I 1.6(b) appears unstable with respect to the small perturbations of initial data. Hence, it is the continuous solution with the fan that should
A Theoretical Introduction to Numerical Analysis
436
be selected as the true solution of problem ( 1 1 .2), ( 1 1 . 1 1), see Figure 1 1 .6(a). As for the discontinuous solution from Figure l 1.6(b), its exclusion due to the instability is similar to the exclusion of the socalled rarefaction shocks that appear as mathemati cal objects when analyzing the flows of ideal compressible fluid. Unstable solutions of this type are prohibited by the socalled entropy conditions that are introduced and analyzed in the theory of quasilinear hyperbolic equations, see, e.g., [RJ83] . Exercises
1 . Prove that the definition of a weak solution to problem ( 1 1 .2) given in the very begin ning of Section 1 1 . 1 .4 is equivalent to Definition 1 1 . 1 . 2.* Consider the following auxiliary problem: au au a2u  + u  = J1 a a a
O < t < T,  oo < x < oo, x2 ' <x < u (x, O) = ljI(X),
x
t
00
( 1 1 . 1 3)
00,
where J1 > ° is a parameter which is similar to viscosity in the context of fluid dynam ics. The differential equation of ( 1 1 . 1 3) is parabolic rather than hyperbolic. It is known to have a smooth solution for any smooth initial function ljI(x). If the initial function is discontinuous, the solution is also known to become smoother as time elapses. Let ljI(x) = 2 for x < ° and ljI(x) = 1 for x > 0. Prove that when J1 t 0, the so lution of problem ( 1 1 . 1 3) approaches the generalized solution of problem ( 1 1 .2) (see Definition 1 1 . 1 ) that corresponds to the conservation law ( 1 1 . 1 a) with k = 1 . Hint. The solution u u(x,t) of problem ( 1 1 . 1 3) can be calculated explicitly with the help of the Hopf formula: =
where
). (x, � ,t) =
S J
(X _ � ) 2 + ljI( T) )dT). 2t o

More detail can be found in the monograph [RJ83], as well as in the original research papers [Hop50] and I CoI5 1 ] .
11.2
Construction o f D ifference S chemes
In this section, we will provide examples of finitedifference schemes for comput ing the generalized solution (see Definition I Ll ) of problem ( 1 1 .2): dU
dt +
<x< u ddXll 0, 0 < t < T, u (x,O ) lJI (x) ,  00 < x < =
=
 00
00 ,
00 ,
that corresponds to the integral conservation law ( l I . Ia) with k
= 1.
Discontinuous Solutions and Methods o/Their Computation
437
Let us assume for definiteness that o/(x) > O. Then, u(x, t) > 0 for all x and all t > O. Naturally, the first idea is to consider the simple upwind scheme: Ump+ 1  Ump + up Ump  upm_ 1 = 0, lI/ h T 0 1 . 14) m = O, ± I , ±2, . . . , p = 0, I , . . . , U�, = o/(mh). By freezing the coefficient ul:, at a given grid location, we conclude that the result
ing constant coefficient difference equation satisfies the maximum principle (see the analysis on page 3 I 5) when going from time level tp to time level tp+ I , provided that the time step Tp = tp+ l  tp is chosen to satisfy the inequality:
=
Tp I . < rp h sUPm l ump l
Then, stability of scheme ( 1 1 . 14) can be expected. Furthermore, if the solution of problem ( 1 1 .2) is smooth, then scheme ( 1 1 . l4) is clearly consistent. Therefore, it should converge, and indeed, numerical computations of the smooth test solutions corroborate the convergence. However, if the solution of problem ( 1 1 .2) is discontinuous, then no convergence of scheme ( 1 1 . 14) to the generalized solution of problem 0 1 .2) is expected. The reason is that no information is built into the scheme ( 1 1 . 14) as to what specific con servation law is used for defining the generalized solution. The Courant, Friedrichs, and Lewy condition is violated here in the sense that the generalized solution given by Definition 1 1 . 1 depends on the specific value of k that determines the conservation law ( 1 1 . 1 a), while the finitedifference solution does not depend on k. Therefore, we need to use special techniques for the computation of generalized solutions. One approach is based on the equation with artificial viscosity )1 > 0:
du + u du = )1 d 2u . at dx dx2
As indicated previously (see Exercise 2 of Section 1 1 . 1), when )1 7 0 this equation renders a correct selection of the generalized solution to problem ( 1 1 .2), i.e., it selects the solution that corresponds to the conservation law ( 1 1 . 1 a) with k I . Moreover, it is also known to automatically filter out the unstable solutions of the type shown in Figure 1 1 .6(b). Alternatively, we can employ the appropriate conservation law explicitly. In Sections 1 1 .2.2 and I 1.2.3 we will describe two different techniques based on this approach. The main difference between the two is that one technique introduces special treatment of the discontinuities, as opposed to all other areas in the domain of the solution, whereas the other technique employs uniform formulae for computation at all grid nodes.
=
1 1 .2.1
Artificial Viscosity
Consider the following finitedifference scheme that approximates problem 1 0 . 1 3) and that is characterized by the artificially introduced small viscosity )1 > °
438
A Theoretical Introduction to Numerical Analysis
[which is not present in the otherwise similar scheme ( 1 1 . 14)] : p+ I Um Um  P
r
P
P
um  um_ 1 + upm
P
p
= J1 um+l  2hU2m
h m = 0, ± 1 , ±2, . . . , p = 0, 1 , . . . , u� = lJI(mh) .
+ um _ 1
p
( 1 1 . 15)
Assume that h > 0, and also that the sufficiently small time step r = r(h, J1 ) is cho sen accordingly to ensure stability. Then the solution u(h ) = {uf:,} of problem ( 1 1 . 1 5) converges to the generalized solution of problem ( 1 1 .2), ( 1 1 . 1 0). The convergence is uniform in space and takes place everywhere except on the arbitrarily small neighbor hoods of the discontinuities of the generalized solution. To ensure the convergence, the viscosity J1 = J1 ( h) must vanish as h > ° with a certain (sufficiently slow) rate. Various techniques based on the idea of artificial dissipation (artificial viscosity) have been successfully implemented for the computation of compressible fluid flows, see, e.g., [RM67, Chapters 1 2 & 1 3] or [Th095, Chapter 7]. Their common shortcoming is that they tend to smooth out the shocks. As an alternative, one can explicitly build the desired conservation law into the structure of the scheme used for computing the generalized solutions to problem ( 1 1 .2). 1 1.2.2
The Method o f Characteristics
In this method, we use special formulae to describe the evolution of discontinuities that appear in the process of computation, i.e., as the time elapses. These formulae are based on the condition ( 1 1 . 1 0) that must hold at the location of discontinuity. At the same time, in the regions of smoothness we use the differential form of the conservation law, i.e., the Burgers equation itself: Ut + UUx 0. The key components of the method of characteristics are the following. For sim plicity, consider a uniform spatial grid Xm mh, m 0, ± 1 , ±2, . . .. Suppose that the function lJI(x) from the initial condition U (x, 0 ) lJI(x) is smooth. From every point (xm ,O), we will "launch" a characteristic of the differential equation Ut + uUx 0. In doing so, we will assume that for the given function lJI(x), we can always choose a sufficiently small r such that on any time interval of duration r, every characteristic intersects with no more than one neighboring characteristic. Take this r and draw the horizontal grid lines t tp pr, p 0, 1 , 2, . . .. Consider the intersection points of all the characteristics that emanate from the nodes (xm , O) with the straight line t r, and transport the respective values of the solution u (xm , O) lJI(xm ) along the characteristics from the time level t = ° to these intersection points. If no two characteristics intersect on the time interval ° ::; t ::; r, then we perform the next step, i.e., extend all the characteristics until the time level t = 2r and again transport the values of the solution along the characteristics to the points of their intersection with the straight line t = 2r. If there are still no intersections between the characteristics for r ::; t ::; 2 r, then we perform yet another step and continue this way until on some interval tp ::; t ::; tp+ l we find two characteristics that intersect. =
=
= =
=
=
=
=
=
=
Discontinuous Solutions and Methods of Their Computation
439
Next assume that the two in tersecting characteristics emanate from the nodes (xm , O) and (Xm+ I ' O) , see Figure 1 1 .8. P+! Then, we consider the mid Qm point ofthe interval Q��II Q!:,+l as the point that the incip ient shock originates from. Subsequently, we replace the two different points Q!:,+ I and Q��ll by one point Q (the mid FIGURE 1 1 .8: The method of characteristics. point), and assign two different values of the solution Uleft and Uright to this point
=
Uleft( Q) U( Q!:, ) , Uright ( Q) = u( Q� I ) ' + From the point Q, w e start up the trajectory of the shock until it intersects with the horizontal line t p +2 , see Figure 1 1 .8. The slope of this trajectory with respect to the time axis is determined from the condition at the discontinuity ( 1 1 . 1 0):
=t
Uleft + Uright 2 From the intersection point of the shock trajectory with the straight line t tp+2, we draw two characteristics backwards until they intersect with the straight line tp+ I . The slopes of these two characteristics are Uleft and Uright from the previous time level, i.e., from the time level t = tp+ 1 . With the help of interpolation, we find the values of the solution U at the intersection points of the foregoing two characteristics with the line t = tp+ I . Subsequently, we use these values as the left and right values of the solution, respectively, at the location of the shock on the next time level t tp+2 . This enables us to evaluate the new slope of the shock as the arithmetic mean of the new left and right values. Subsequently, the trajectory of the shock is extended one more time step r, and the procedure repeats itself. The advantage of the method of characteristics is that it allows us to track all of the discontinuities and to accurately compute the shocks starting from their incep tion. However, new shocks continually form in the course of the computation. In particular, the nonessential (lowintensity) shocks may start intersecting so that the overall solution pattern will become rather elaborate. Then, the computational logic becomes more complicated as well, and the requirements of the computer time and memory increase. This is a key disadvantage of the method of characteristics that singles out the discontinuities and computes those in a nonstandard way. tan a
=
=t = =
1 1 .2.3
Conservative Schemes. The G odunov Scheme
Finitedifference schemes that neither introduce the artificial dissipation (Sec tion 1 1 .2. 1 ), nor explicitly use the condition at the discontinuity ( 1 1 . 10) (Sec tion 1 1 .2.2), must rely on the integral conservation law itself.
440
A Theoretical Introduction to Numerical Analysis
J
(m,p+ l ) ,   ,
Consider two families of straight lines on the plane (x,t): the horizontal lines t = (m+ 1I2,p+ 1/2) (m1I2,p+ __ pr, p = 0, 1 , 2, . . . , and the __ vertical lines x = (m + 1 /2)h, (m,p) m = 0, ± 1 , ±2, . . .. These lines partition the plane into FIGURE 1 1.9: Grid cell. rectangular cells. On the sides of each cell we will mark the respective midpoints, see Figure 1 1 .9, and com pose the overall grid Dh of the resulting nodes (we are not showing the coordinate axes in Figure 1 1.9).
1121
The unknown function [ul h will be defined on the grid Dh . Unlike in many pre vious examples, when [ul h was introduced as a mere trace of the continuous exact solution u(x,t) on the grid, here we define [ul h by averaging the solution u(x, t) over the side of the grid cell (see Figure 1 1.9) that the given node belongs to:
h
I
[Ll1 (xm,lp )
[U1 h
I(
_
X.,
l I/2.lp+ I/2)
Xm+ ! /2
� ump  hI .f u (x,tp )dx, Xm I /2
+1 12 = I = Ump+1/ 2 r
def
'p+ 1
J U (xm+ 1 /2 ,t ) dt.
tp
The approximate solution u(h ) of our problem will be defined on the same grid Dh . The values of u ( lt) at the nodes (xm,tp ) of the grid that belong to the horizontal sides of the rectangles, see Figure 1 1 .9, will be denoted by U�" and the values of the solution at the nodes (xm + 1 /2 , tp+ 1/2 ) that belong to the vertical sides of the rectangles
.
p+I
/2 . Will be denoted by Um+1/2
Instead of the discrete function u(lt) defined only at the grid nodes (m, p) and (m + 1 /2,p + 1 /2), let us consider its extension to the family piecewise constant
functions of a continuous argument defined on the horizontal and vertical lines of the grid. In other words, we will think of the value U�, as associated with the entire horizontal side { (x, t) IXm  1 /2 < x < xll1+ 1/2 , t = tp } of the grid cell that the node
utz:: ;� will be defined
(xm,tp ) belongs to, see Figure 1 1 .9. Likewise, the value on the entire vertical grid interval { (x,t) lx = xm+ 1 /2 , tp < t
tp+ d. The relation between the quantities ufn and utz::;�, where m = 0, ± 1 , ±2, . . . and p = 0, 1 , 2, . . . , will be established based on the integral conservation law ( 1 1 . 1 a) for k = 1:
f udx  2 dt = 0 . U
r
2
0, m 0, ± 1 , ±2 , . . . , then u:n::;; Uleft = UCI for all m = 0, ± 1 , ±2, . . . , and scheme 0 1 . 17) becomes: P ) 2 ( UmP ) 2 ump+ I  UmP 1 � _1 _ ° Assume that at the initial moment of time
=
=
=
+
+
=
=
+
't'
_
h
[(
2
2
]
=
=
'
xm+l/2
u� =
�J
lfI(x)dx,
Xm I/2
p 1 + Ump Ump+1  Ump ulIl_
or alternatively:
It is easy to see that when
't'
+
= 0.
't'
r = h O. /1=1 11= 1 £..
Consequently, linear system (1 2.30) always has a unique solution. Its solution {CI ' C2, . . . , cn} obtained for a given q> = q>(x,y) determines the approximate solution WN (X,y) = I�= I Cn W�n ) of problem ( 1 2. 1 5) with lfI (s) = 0. In Section 1 2.2.4, we provide a specific example of the basis ( 1 2.27) that yields a particular realization of the finite element method based on the approach by Ritz. In general, it is clear that the quality of approximation by WN (X,y) =
I�= I Cnw�l) is determined by the properties of the approximating space: W(N ) = N N span { wI ) , w� ) , . . . , wt) } that spans the basis ( 1 2.27). The corresponding group of questions that also lead to the analysis of convergence for finite element approxima tions is discussed in Section 1 2.2.5. 1 2 . 2 .3
The Galerkin Method
In Section 1 2.2. 1 we saw that the ability to formulate a solvable minimization problem hinges on the positive definiteness of the operator �, see formula ( 1 2. 1 9). This consideration gives rise to a concern that if the differential operator of the boundary value problem that we are solving is not positive definite, then the cor responding functionals may have no minima, which will render the variational ap proach of Section 1 2.2. 1 invalid. A typical example when this indeed happens is provided by the Helmholtz equation:
=
( l 2.32a)
This equation governs the propagation of timeharmonic waves that have the wavenumber k > 0 and are driven by the sources q> q>(x,y). We will consider a Dirichlet boundary value problem for the Helmholtz equation ( l2.32a) on the do main Q with the boundary r = oQ, and for simplicity we will only analyze the
Discrete Methodsfor Elliptic Problems
461
homogeneous boundary condition: ( l2.32b) Problem ( 1 2.32) has a solution u = u(x,y) for any given righthand side cp(x,y). This solution, however, is not always unique. Uniqueness is only guaranteed if the quan tity _k2, see equation ( 1 2.32a), is not an eigenvalue of the corresponding Dirichlet problem for the Laplace equation. Henceforth, we will assume that this is not the case and accordingly, that problem ( 1 2.32) is uniquely solvable. Regarding the eigenvalues of the Laplacian, it is known that all of them are real and negative and that the set of all eigenvalues is countable and has no finite limit points, see, e.g., [TS63]. It is also known that the eigenfunctions that correspond to different eigenvalues can be taken orthonormal. Denote the Helmholtz operator by L: L � + k2], and consider the following expression: o (Lw, w) == (�w, w) + k2 (w, w), w E W. ( 1 2.33) =
We will show that unlike in the case of a pure Laplacian, see formula ( 1 2. 1 9), we can make this expression negative or positive by choosing different functions w. First, consider a collection of eigenfunctions2 Vj = v , y) E W of the Laplacian:
Ax �Vj = AjVj, such that for the corresponding eigenvalues we have: Lj Aj + k2 < O. Then, for w = Lj Vj the quantity (Lw, w) of ( 1 2.33) becomes negative: (Lw, w) = (�I, Vj, I, Vj) + k2 ( I, vj, I, Vj) J
=
J
J
J
I,AA vj, Vj) + k2 I, ( vj, Vj) < 0, j j
where we used orthonormality of the eigenfunctions: ( Vi, Vj) = 0 for i i j and (Vj, Vj) = 1 . On the other hand, among all the eigenvalues of the Laplacian there is clearly the largest (smallest in magnitude): AI < 0, and for all other eigenvalues we have: Aj < AI , j > 1. Suppose that the wavenumber k is sufficiently large so that Lj� I Aj + k2 > 0, where jk ;::: 1 depends on k. Then, by taking w as the sum of either a few or all of the corresponding eigenfunctions, we will clearly have (Lw, w) > O. We have thus proved that the Helmholtz operator L is neither negative definite o
0
W. Moreover, once we have found a particular w E W for which (Lw, w) < 0 and another w E W for which (Lw, w) > 0, then by scaling these
nor positive definite on
o
functions (mUltiplying them by appropriate constants) we can make the expression (Lw, w) of ( 1 2.33) arbitrarily large negative or arbitrarily large positive. By analogy with ( 1 2.23), let us now introduce a new functional: o
J(w) =  (Lw, w) + 2(cp, w), w E W. 2/( may be just one eigenfunction or more than one.
( 1 2.34)
A Theoretical Introduction to Numerical Analysis
462
Assume that u = u (x , y) is the solution of problem ( 1 2.32), and represent an arbitrary o
0
u + � , where � E W. Then we can write: l ( u + � )  (�( u + � ),u + � )  k2( u + � , u + �) + 2 ( cp, u + � ) (�u, u) _ k2 ( u, u) + 2 (cp, u) (�u, � )  ( �� , u )  (�� , � ) =J(u) according to (12.33), (12.34) _ k2 (U , � ) _ k2 ( � , u ) _ k2 ( � , � ) + 2(cp, � ) = l (u )  2 (�u, �)  (�� , � )  2k2( u, � )  k2 ( � , � ) + 2 (�u + k2u, � ) = l(u)  (L � , � ),
w E W in the form W = =
=
"
v
.,
where we used symmetry of the Laplacian on the space W : (�u, � ) = (�� , u ), which immediately follows from the second Green's formula ( 1 2. 1 7), and also substituted cp = �u + k2u from ( 1 2.32a). Equality 1 ( u + � ) = 1 (u)  (L� , � ) allows us to conclude that the solution u = u (x, y) of problem ( 1 2.32) does not deliver an extreme value to the functional l(w) of ( 1 2.34), because the increment  (L � , � ) =  (L(w u ) , w u) can assume both pos itive and negative values (arbitrarily large in magnitude). Therefore, the variational formulation of Section 1 2.2. 1 does not apply to the Helmholtz problem ( 1 2.32), and one cannot use the Ritz method of Section 1 2.2.2 for approximating its solution. The foregoing analysis does not imply, of course, that if one variational formula tion does not work for a given problem, then others will not work either. The example of the Helmholtz equation does indicate, however, that it may be worth developing a more general method that would apply to problems for which no variational for mulation (like the one from Section 1 2.2. 1 ) is known. A method of this type was proposed by Galerkin in 1 9 1 6. It is often referred to as the projection method. We will illustrate the Galerkin method using our previous example of a Dirichlet problem for the Helmholtz equation ( 1 2.32a) with a homogeneous boundary condi tion ( 1 2.32b). Consider the same basis ( 1 2.27) as we used for the Ritz method. As before, we will be looking for an approximate solution to problem ( 1 2.32) in the form of a linear combination: o

WN(X, y) _= WN (X, y, C] , C2 , ' " , CN)
_


N L CnWn(N) . n =]
( 1 2.35)
Substituting this expression into the Helmholtz equation ( 1 2.32a), we obtain:
where DN (X, y) == DN (X, y, cl , C2, ' " , CN) is the corresponding residual. This residual can only be equal to zero everywhere on Q if WN (X,y) were the exact solution. Oth erwise, the residual is not zero, and to obtain a good approximation to the exact solution we need to minimize DN (X, y) in some particular sense.
Discrete Methods for Elliptic Problems
463
If, for example, we were able to claim that ON (X,y) was orthogonal to all the functions in W in the sense of the standard scalar product ( . , . ) , then the residual would vanish identically on n, and accordingly, WN(X, y) would coincide with the exact solution. This, however, is an ideal situation that cannot be routinely realized in practice because there are too few free parameters c" C2 , . . . , CN available in formula ( 1 2.35). Instead, we will require that the residual ON (X, y) be orthogonal not to the
entire space but rather to the same basis functions w�) E for building the linear combination ( 1 2.35):
N
(ON, Wm( ) ) = 0,
m
=
W, m
=
1 , 2, . . . , N, as used
I , 2 , . . . ,N.
In other words, we require that the projection of the residual ON onto the space W (N) be equal to zero. This yields:
rr }} n
( ePWdx2N + ddy2W2N ) Wm(N) dxdy + k2
rr rr WN Wm(N) dxdy = qJWm(N) dxdy, }} }} n
n
m = 1 , 2, . . . , N.
With the help of the first Green's formula ( 1 2. 1 6), and taking into account that all o
the functions are in W, the previous system can be rewritten as follows:
We therefore arrive at the following system of N linear algebraic equations with respect to the unknowns c" C2 , . . . , CN :
m = 1 , 2, . .
.
( 12.36) , N.
Substituting its solution into formula ( 12.35), we obtain an approximate solution to the Dirichlet problem ( 1 2.32). As in the case of the Ritz method, the quality of this approximation, i.e., the error as it depends on the dimension N, is determined by the
N
properties of the approximating space: W (N) = span { wI ) , wf') , . . . , wt) } that spans the basis ( 12.27). In this regard, it is important to emphasize that different bases can be selected in one and the same space, and the accuracy of the approximation does not depend on the choice of the individual basis, see Section 1 2.2.5 for more detail.
464
A Theoretical Introduction to Numerical Analysis
Let us also note that when the wavenumber k = 0, the Helmholtz equation ( 1 2.32a) reduces to the Poisson equation of ( 12. 1 5). In this case, the linear system of the Galerkin method ( 1 2.36) automatically reduces to the Ritz system ( 12.30).
12.2.4
A n Example o f Finite Element D iscretization
A frequently used approach to constructing finite element dis cretizations for twodimensional problems involves unstructured triangular grids. An example of the grid of this type is shown in Figure 1 2. 1 . All cells of the grid have triangular shape. The grid is obtained by first approximating the boundary r of the domain Q by a closed polygonal line with the vertexes on r, and then by parti tioning the resulting polygon into a collection of triangles. The triangles of the grid never overlap. FIGURE 1 2. 1 : Unstructured triangular grid. Each pair of triangles either do not intersect at all, or have a common vertex, or have a common side. Suppose that alto gether there are N vertexes inside the domain Q but not on the boundary r, where a common vertex of several triangles is counted only once. Those vertexes will be the nodes of the grid, we will denote them Z,��) , m = 1 , 2, . . , N. Once the triangular grid has been built, see Figure 1 2. 1 , we can specify the basis N N functions ( 1 2.27). For every function w� ) = w� ) (x,y), n = 1 , 2, . . . , N, we first define its values at the grid nodes: .
(N) (ZIn(N) ) _
WIl
{ I, 0,
n = In, n, m = 1 , 2 , . . . , N. n I m,
In addition, at every vertex that happens to lie on the boundary r we set w�N) = N This way, the function w� ) gets defined at all vertexes of all triangles of the grid. Next, inside each triangle we define it as a linear interpolant, i.e., as a linear function of two variables: x and y, that assumes given values at all three vertexes of the N triangle. Finally, outside of the entire system of triangles the function w� ) is set to N be equal to zero. Note that according to our construction, the function w}, ) is equal N to zero on those triangles of the grid that do not have a given Zl� ) as their vertex. N N On those triangles that do have Z� ) as one of their vertexes, the function w� ) can be represented as a fragment of the plane that crosses through the point in the 3D space that has elevation 1 precisely above Zl�N) and through the side of the triangle
o.
465 Discrete Methodsfor Elliptic Problems N opposite to the vertex Z,� ) . The finite elements obtained with the help of these basis
functions are often referred to as piecewise linear. In either Ritz or Galerkin method, the approximate solution is to be sought in the form of a linear combination:
N WN (X , y) = L C/l WIl(N) (x , y) . /1 = 1 N In doing so, at a given grid node Z,� ) only one basis function, w�) , is equal to one
whereas all other basis functions are equal to zero. Therefore, we can immediately N see that Cn WN (Z,\ ) ) . In other words, the unknowns to be determined when com puting the solution by piecewise linear finite elements are the same as in the case of finite differences. They are the values of the approximate solution WN (X, y) at the N grid nodes Z,� ) , n = 1 , 2, . . . , N. Once the piecewise linear basis functions have been constructed, the Ritz system of equations ( 1 2.30) for solving the Poisson problem ( 1 2. 1 5) can be written as: =
( )(
)
(
)
N WN Z/I(N) Will(N) , Wn(N) '   ({" w(N) , m = l , 2, . . . , N, n= 1 � £.J
_
",
( 12.37)
and the Galerkin system of equations ( 12.36) for solving the Helmholtz problem ( 1 2.32) can be written as:
( )[ (
)
(
N WN Z/I(N)  W/I(N) , Will(N) ' + k2 W/I(N) , Will(N) /1= 1 m = 1 , 2, . . . , N. � £.J
)]  (({" Will(N) ) , _
( 1 2.38)
Linear systems ( 1 2.37) and ( 1 2.38) provide specific examples of implementation of the finite element method for twodimensional elliptic boundary value problems. The coefficients of systems ( 1 2.37) and ( 1 2.38) are given by scalar products [en ergy product ( . , . ) ' and standard product ( . , . ) 1 of the basis functions. It is clear that only those numbers (W}{") , wrt,) ) ' and (w�) , wrt,) ) can differ from zero, for which
N
)
the grid nodes Z,� ) and ZI\� happen to be vertexes of one and the same triangle. N Indeed, if the nodes Z,� ) and Z��) are not neighboring in this sense, then the regions ) where w�) t o and W},� t o do not intersect, and consequently:
(WII1(N) , Wn(N) ) , = £i [ oWIIl(N) OWn(N) .
and
(Will(N) , W/I(N) ) =
Q
:l
uX
(N (N
:l
uX
+
ff) W'" ) Wn ) dxdy = O.
I
Q
]
oWm(N) OWn(N) dXdY = 0 uy uy :l
:l
0 2.39a)
0 2.39b)
A Theoretical Introduction to Numerical Analysis
466
In other words, equation number m of system ( 1 2.37) [or, likewise, of system ( 12.38)] connects the value of the unknown function WN (Z!,;I) ) with its values WN (Z,\N) ) only at the neighboring nodes Z�N) . This is another similarity with con ventional finitedifference schemes, for which every equation of the resulting system connects the values of the solution at the nodes of a given stencil. The coeffic ients of systems ( 12.37) and ( 1 2.38) are rather easy to evaluate. Indeed, either integral ( 1 2.39a) or ( 1 2.39b) is, in fact, an integral over only a pair of grid N triangles that have the interval [z,!;:) , Z,\ ) ] as their common side. Moreover, the integral over each triangle is completely determined by its own geometry, i.e., by the length of its sides, but is not affected by the orientation of the triangle. The integrand in ( 12.39a) is constant inside each triangle; it can be interpreted as the dot product of two gradient vectors (normals to the planes that represent the linear functions w};:) and wf") :
(N) dWn(N) dWm(N) aWn(N) dX ax + dY dy
dWm
 (gradWm(N) , gradWI(!N) ) _
•
The value of the integral ( 1 2.39a) is then obtained by multiplying the previous quan tity by the area of the triangle. Integral ( 12.39b) is an integral of a quadratic function inside each triangle, it can also be evaluated with no difficulties. This completes the construction of a finite element discretization for the specific choice of the grid and the basis functions that we have made. Convergence o f Finite Element Approximations
12.2.5
o
Consider the sequence of approximations WN (X,y) == WN (x,y, c l ,C2 , . . . ,CN ) E W, N = 1 , 2, . . . , generated by the Ritz method. According to formula ( 12.26), this se quence will converge to the solution u u(x,y) of problem ( 12. 15) with lfI(s) == 0 in the Sobolev norm ( 1 2. 14) provided that / (wN ) 7 I(u) as N 7 DO: =
( 1 2.40) The minuend on the righthand side of inequality ( 1 2.40) is defined as follows:
I(wN ) =
inf
WE W(N)
I(w) ,
where W iN) is the span of the basis functions ( 12.27). Consequently, we can write:

IIwN u ll ?v � const .
i� f [I(w)
wE W(N)
o
 I(u)]
==
·
const i� f
wEW(N)
[ IIw  u[ [ f ,
( 1 2.41)
where [[ . [[ ' is the energy norm on W. In other words, we conclude that convergence of the Ritz method depends on what is the best approximation of the solution u by
Discrete Methodsfor Elliptic Problems 467 the elements of the subspace W (N) . In doing so, the accuracy of approximation is to
be measured in the energy norm. A similar result also holds for the Galerkin method. Of course, the question of how large the righthand side of inequality ( 1 2.41 ) may actually be does not have a direct answer, because the solution u u(x,y) is not known ahead of time. Therefore, the best we can do for evaluating this righthand =
side is first to assume that the solution u belongs to some class of functions V c W, and then try and narrow down this class as much as possible using all a priori infor mation about the solution that is available. Often, the class V can be characterized in terms of smoothness, because the regularity of the data in the problem enable a certain degree of smoothness in the solution. For example, recall that the solution of problem (12. 1 5) was assumed twice continuously differentiable. In this case, we can say that V is the class of all functions that are equal to zero at r and also have continuous second derivatives on Q. Once we have identified the maximally narrow class of functions V that contains the solution It, we can write instead of estimate (12.41 ): o
I I WN

u ll ?v S; const · sup
i� f
l'E U w EW ( N)
[l w ll t 
v
'
(1 2.42)
Regarding this inequality, a natural expectation is that the narrower the class V, the closer the value on the righthand side of (12.42) to that on the rightside of (12.41 ). Next, we realize that the value on the righthand side of inequality (1 2.42) depends o
on the choice of the approximating space W (N) , and the best possible value therefore corresponds to:
/(N (U, W ) �
0
inf
0
sup
i�f
W(N)CW l' E U W E W(N)
II w

' vii ·
(12.43)
o
/(N (U, W) is called the Ndimensional Kolmogorov diameter of the set V with respect to the space W in the sense of the energy norm II . II' . We have first encountered this concept in Section 2.2.4, see formula (2.39) on page 41. Note
The quantity
o
that the norm in the definition of the Kolmogorov diameter is not squared, unlike in formula ( 1 2.42). Note also that the outermost minimization in (12.43) is performed
W (N) as a whole, and its result does not depend on the choice of a specific basis (12.27) in W (N). One and the same space
with respect to the Ndimensional space
can be equipped with different bases. Kolmogorov diameters and related concepts are widely used in the theory of ap proximation, as well as in many other branches of mathematics. In the context of finite elements they provide optimal, i.e., unimprovable, estimates for the convero
gence rate of the method. Once the space W, the norm
II . II ', and the subspacc
V c W have been fixed, the diameter (1 2.43) depends on the dimension N. Then, the optimal convergence rate of a finite element approximation is determined by how o
rapidly the diameter
o
/(N (U, W) decays as N t 00.
Of course, in every given case
A Theoretica/ Introduction to Numerical Analysis there is no guarantee that a particular choice of W (N) will yield the rate that would 46 8
come anywhere close to the theoretical optimum provided by the Kolmogorov dio
ameter /(N (U, W). In other words, the rate of convergence for a specific Ritz or Galerkin approximation may be slower than optimal3 as N + 00. Consequently, it is the skill and experience of a numerical analyst that are required for choosing the
approximating space W ( N) is such a way that the actual quantity: sup
inf
vE U l\'E W I N )
Il w  v ii'
that controls the righthand side of ( 1 2.42) would not be much larger than the Kolo
mogorov diameter /(N (U, W) of ( 1 2.43). In many particular situations, the Kolmogorov diameters have been computed. o
For example, for the setup analyzed in this section, when the space W contains all piecewise continuously differentiable functions on n c ]R2 equal to zero at r = dO., o
the class U contains all twice continuously differentiable functions from W, and the norm is the energy norm, we have:
(1 2.44) Under certain additional conditions (nonrestrictive), it is also possible to show that the piecewise linear elements constructed in Section 1 2.2.4 converge with the same asymptotic rate fJ(N I /2 ) when N increases, see, e.g., [GR87, § 39]. As such, the piecewise linear finite elements guarantee the optimal convergence rate of the Ritz method for problem ( 1 2. 1 5), assuming that nothing else can be said about the solu tion II = u(x,y) except that it has continuous second derivatives on n. Note that as we have a total of N grid nodes on a twodimensional domain 0., for a regular uniform grid it would have implied that the grid size is h = 6'(N I /2). In other words, the convergence rate of the piecewise linear finite elements appears to be 6'(h), and the optimal unimprovable rate ( 1 2.44) is also fJ(h). At a first glance, this seems to be a deterioration of convergence compared, e.g., to the standard central difference scheme of Section 1 2. I , which converges with the rate O(h2). This, how ever, is not the case. In Section 1 2. 1 , we measured the convergence in the norm that did not contain the derivatives of the solution, whereas in this section we are using the Sobolev norm ( 1 2. 14) that contains the first derivatives. Finally, recall that the convergence rates can be improved by selecting the approx imating space that would be right for the problem, as well as by narrowing down the class of functions U that contains the solution u. These two strategies can be combined and implemented together in the framework of one adaptive procedure. Adaptive methods represent a class of rapidly developing, modern and efficient, ap proaches to finite element approximations. Both the size/shape of the elements, as J Even though formula ( 1 2.42) is written as an inequality. the norms [[ . [[II' and [[ , 1[' are, in fact, equivalent.
Discrete Methodsfor Elliptic Problems
469
well as their order (beyond linear), can be controlled. Following a special (multigrid type) algorithm, the elements are adapted dynamically, i.e., in the course of compu tation. For example, by refining the grid and/or increasing the order locally, these elements can very accurately approximate sharp variations in the solution. In other words, as soon as those particular areas of sharp variation are identified inside n, the class U becomes narrower, and at the same time, the elements provide a better, "finetuned," basis for approximation. Convergence of these adaptive finite element approximations may achieve spectral rates. We refer the reader to the recent mono graph by Demkowicz for further detail [Dem06] . To conclude this chapter, we will compare the method of finite elements with the method of finite differences from the standpoint of how convergence of each method is established. Recall, the study of convergence for finitedifference approx imations consists of analyzing two independent properties, consistency and stability, see Theorem 10.1 on page 3 14. In the context of finite elements, consistency as defined in Section 10. 1 .2 or 1 2. 1 . 1 (small truncation error on smooth solutions) is no longer needed for proving convergence. Instead, we need approximation of the
class U 3 u by the functions from see estimate ( 1 2.42). Stability for finite elements shall be understood as good conditioning (uniform with respect to N) of the Ritz system matrix ( 1 2. 3 1 ) or of a similar Galerkin matrix, see ( 1 2.36). The ideal case here, which cannot normally be realized in practice, is when the basis w}{") , n = 1 , 2, . . . ,N, is orthonormal. Then, the matrix A becomes a unit matrix. Stability still remains important when computing with finite elements, although not for justifying convergence, but for being able to disregard the small roundoff errors.
W(N),
(N)
Exercises
1. Prove that the quadratic form of the Ritz method, i.e., the first term on the righthand side of is indeed positive definite. Hint.
(12.28),
=
WN (X,y) L�= l cl1w�N). 2. Consider a Dirichlet problem for the elliptic equation with variable coefficients: Apply the Friedrichs inequality ( 1 2.25) to
;r (a(x,y) �;) + :v (b(X,y) ��) cp(x,y), (x,y) E n, U[
r
=
ljI(s) ,
r=
=
an,
a(x,y) ;:: ao > 0 and b(x,y) ;:: bo > O. Prove that its solution minimizes the J( w) JJ [a (�:r + b ( �:r + 2CP W]dXdY
where functional:
=
n
on the set of all functions w E

W that satisfy the boundary condition:
2.
w l r = ljI(s) .
3. Let ljI(s) == 0 in the boundary value problem of Exercise Apply the Ritz method and obtain the corresponding system of linear algebraic equations.
4. Let ljI(s) == 0 in the boundary value problem of Exercise 2. Apply the Galerkin method and obtain the corresponding system of linear algebraic equations.
470
A Theoretical Introduction to Numerical Analysis
5. Let the geometry of the two neighboring triangles that have common side [ZI�N),ZI\N)l be known. Evaluate the coefficients ( 12.39a) and ( 12.39b) for the finite element systems ( 12.37) and (12.38).
6. Consider problem (12. 1 5) with lJf(s) == 0 on a square domain Q. Introduce a uniform Cartesian grid with square cells on the domain Q. Partition each cell into two right triangles by the diagonal; in doing so, use the same orientation of the diagonal for all cells. Apply the Ritz method on the resulting triangular grid and show that it is equivalent to the centraldifference scheme ( 12.5), (12.6).
Part IV The Methods of Boundary Equations for the Numerical Solution of Boundary Value Problems
47 1
The finitedifference methods of Part III can be used for solving a wide variety of initial and boundary value problems for ordinary and partial differential equations. Still, in some cases these methods may encounter serious difficulties. In particular, finite differences may appear inconvenient for accommodating computational do mains of an irregular shape. Moreover, finite differences may not always be easy to apply for the approximation of general boundary conditions, e.g., boundary condi tions of different types on different portions of the boundary, or nonlocal boundary conditions. The reason is that guaranteeing stability in those cases may require a considerable additional effort, and also that the resulting system of finitedifference equations may be difficult to solve. Furthermore, another serious hurdle for the ap plication of finitedifference methods is presented by the problems on unbounded domains, when the boundary conditions are specified at infinity. In some cases the foregoing difficulties can be overcome by reducing the original problem formulated on a given domain to an equivalent problem formulated only on its boundary. In doing so, the original unknown function (or vectorfunction) on the domain obviously gets replaced by some new unknown function(s) on the boundary. A very important additional benefit of adopting this methodology is that the geomet ric dimension of the problem decreases by one (the boundary of a twodimensional domain is a onedimensional curve, and the boundary of a threedimensional domain is a twodimensional surface). In this part of the book, we will discuss two approaches to reducing the problem from its domain to the boundary. We will also study the techniques that can be used for the numerical solution of the resulting boundary equations. The first method of reducing a given problem from the domain to the boundary is the method of boundary integral equations of classical potential theory. It dates back to the work of Fredholm done in the beginning of the twentieth century. Dis cretization and numerical solution of the resulting boundary integral equations are performed by means of the special quadrature formulae in the framework of the method of boundary elements. We briefly discuss it in Chapter 1 3 . The second method of reducing a given problem to the boundary of its domain leads to boundary equations of a special structure. These equations contain projec tion operators first proposed in 1 963 by Calderon [CaI63]. In Chapter 1 4, we describe a particular modification of Calderon's boundary equations with projections. It en ables a straightforward discretization and numerical solution of these equations by the method of difference potentials. The latter was proposed by Ryaben 'kii in 1 969 and subsequently summarized in [Rya02]. The boundary equations with projections and the method of difference potentials can be applied to a number of cases when the classical boundary integral equations and the method of boundary elements will not work. Hereafter, we will only describe the key ideas that provide a foundation for the method of boundary elements and for the difference potentials method. We will also identify their respective domains of applicability, and quote the literature where a more comprehensive description can be found.
473
Chapter 1 3 Bound ary Integral Equ a tions and the Method of Bound ary Elements
In this chapter, we provide a very brief account of classical potential theory and show how it can help reduce a given boundary value problem to an equivalent inte gral equation at the boundary of the original domain. We also address the issue of discretization for the corresponding integral equations, and identify the difficulties that limit the class of problems solvable by the method of boundary elements.
13.1
Reduction of Boundary Value Problems t o Integral Equations
To illustrate the key concepts, it will be sufficient to consider the interior and exterior Dirichlet and Neumann boundary value problems for the Laplace equation: a2 u OX I
a2u OX 2
a2u OX3
/!"u == � + � + � = o . Let n be a bounded domain of the threedimensional space 1R3 , and assume that its boundary r = an is sufficiently smooth. Let also n l be the complementary domain: n l = 1R3 \Q. Consider the following four problems:
/!"u = O, X E n,
u l r = cp (x) lxEP
/!"U = o, x E n,
aU = x) [ an r cp ( xEr'
/!,.U = 0, /!"U = o,
I
U [ r = cp (x) IXEr' aU = (X ) an r CP IXEr'
I
( 1 3 . 1 a) ( 1 3. 1 b) ( l 3. 1 c) ( l 3 . 1 d)
where x = (XI ,X2 ,X3 ) E 1R3 , n is the outward normal to r, and cp (x) is a given func tion for x E r. Problems ( 1 3. 1 a) and ( l 3 . 1c) are the interior and exterior Dirichlet problems, respectively, and problems ( 1 3 . 1 b) and ( 1 3 . 1 d) are the interior and exterior Neumann problems, respectively. For the exterior problems ( 1 3 . 1 c) and ( 1 3. 1 d), we also need to specify the desired behavior of the solution at infinity: u(x) + 0, as Ixl == (xT + x� + X� ) 1 /2 + 00. ( 1 3.2)
475
476
A Theoretical Introduction to Numerical Analysis
In the courses of partial differential equations (see, e.g., [TS63]), it is proven that the interior and exterior Dirichlet problems ( 1 3 . 1 a) and ( 1 3 . 1 c), as well as the exterior Neumann problem ( 1 3. 1 d), always have a unique solution. The interior Neumann problem ( 1 3 . l b) is only solvable if
/ CPdS r
=
0,
( 1 3.3)
where ds is the area element on the surface r = an. In case equality ( 1 3.3) is satisfied, the solution of problem ( 1 3 . 1 d) is determined up to an arbitrary additive constant. For the transition from boundary value problems ( 1 3. 1 ) to integral equations, we use the fundamental solution of the Laplace operator, which is also referred to as the freespace Green's function: 1 1 G (x) = �. 4rr It is known that every solution of the Poisson equation:
/',u = f(x) that vanishes at infinity, see ( 1 3.2), and is driven by a compactly supported righthand side f(x), can be represented by the formula:
u(x) =
/f/ G(x y)f(y)dy,
where the integration in this convolution integral is to be performed across the entire
JR.3 . In reality, of course, it only needs to be done over the region where f(y) 1 0.
As boundary value problems ( 1 3 . 1 ) are driven by the data cp(x), x E r, the corre sponding boundary integral equations are convenient to obtain using the single layer potential: v (x)
and the double layer potential: w (x)
/ G(x y)p (y)dsy
( 1 3.4)
/ aG�Xf1y y) a (y)dsy
( 1 3.5)
=
=
r
r
Both potentials are given by convolution type integrals. The quantity dsy in formulae ( 1 3.4) and ( 1 3.5) denotes the area element on the surface r = an, while f1y is the normal to r that originates at y E r and points toward the exterior, i.e, toward n , . The functions p = p (y) and a = a(y),y E r, in formulae ( 1 3.4) and ( 1 3.5) are called densities of the singlelayer potential and doublelayer potential, respectively. It is easy to see that the fundamental solution G(x) satisfies the Laplace equation /',G = 0 for x I O. This implies that the potentials V (x) and W (x) satisfy the Laplace equation for x " r, i.e., that they are are harmonic functions on n and on n , . One can say that the families of harmonic functions given by formulae ( 1 3.4) and ( 1 3.5) are parameterized by the densities p = p(y) and a = a(y) specified for y E r.
Boundary Integral Equations and the Method of Boundary Elements
477
Solutions of the Dirichlet problems ( l 3. 1 a) and ( 1 3. 1c) are to be sought in the form of a doublelayer potential ( 1 3.5), whereas solutions to the Neumann problems ( 1 3 . l b) and ( l 3. l d) are to be sought in the form of a singlelayer potential ( 1 3.4). Then, the the socalled Fredholm integral equations of the second kind can be ob tained for the unknown densities of the potentials (see, e.g., [TS63]). These equations read as follows: For the interior Dirichlet problem ( 1 3 . 1 a):
I
(J(X)  2n
j' (J (y) a r
I

any jx  yj
I
dsy =   q>(x) , x E r, 2n
( 1 3.6a)
for the interior Neumann problem ( 1 3 . 1 b):
I 2n
p (x)  
/
r
a I d a ny jx  yj
p (y)   sy
=
1 2n
  q> (x) , x E r,
( 1 3.6b)
for the exterior Dirichlet problem ( 1 3. 1 c):
1 / (J(Y ) :;a 1 dsy
(J (x) + 2n
r
ol1y j X  y j
= 2
I
n
q>(x) , x E r,
( 1 3.6c)
and for the exterior Neumann problem ( 1 3 . 1 d): ( 1 3.6d) In the framework of the classical potential theory, integral equations ( 1 3.6a) and ( 1 3.6d) are shown to be uniquely solvable; their respective solutions (J(x) and p (x) exist for any given function q> (x), x E r. The situation is different for equations ( 1 3.6b) and ( 1 3.6c). Equation ( l 3.6b) is only solvable if equality ( 1 3.3) holds. The latter constraint reflects on the nature of the interior Neumann problem ( 1 3 . l b), which also has a solution only if the additional condition ( 1 3.3) is satisfied. It turns out, however, that integral equation ( 1 3 .6c) is not solvable for an arbitrary q> (x) ei ther, even though the exterior Dirichlet problem ( 1 3. 1 c) always has a unique solution. As such, transition from the boundary value problem ( 1 3. l c) to the integral equation ( l 3.6c) appears inappropriate. The loss of solvability for the integral equation ( 1 3.6c) can be explained. When we look for a solution to the exterior Dirichlet problem ( 13. l c) in the form of a double layer potential ( 1 3.5), we essentially require that the solution u(x) decay at infinity as 6'( jxj 2), because it is known that W (x) = 6'( jxj 2) as jxj + =. However, the original formulation ( 13. 1c), ( 1 3.2) only requires that the solution vanish at infinity. In other words, there may be solutions that satisfy ( 1 3 . l c), ( 1 3.2), yet they decay slower than 6'(jxj2) when jxj + = and as such, are not captured by the integral equation ( 1 3.6c). This causes equation ( 13.6c) to lose solvability for some q> (x) .
A Theoretical Introduction to Numerical Analysis
478
Using the language of classical potentials, the loss of solvability for the integral equation ( 1 3.6c) can also be related to the socalled resonance of the interior do main. The phenomenon of resonance manifests itself by the existence of a nontrivial (i.e., nonzero) solution to the interior Neumann problem ( l 3 . 1b) driven by the zero boundary data: qJ(x) == 0, x E f. Interpretation in terms of resonances is just another, equivalent, way of saying that the exterior Dirichlet problem ( l 3 . Ic), ( 1 3.2) may have solutions that cannot be represented as a doublelayer potential ( 1 3.5) and as such, will not satisfy equation ( l 3.6c). Note that in this particular case equation ( l 3.6c) can be easily "fixed," i.e., replaced by a slightly modified integral equation that will be fully equivalent to the original problem ( 1 3 . I c), ( 1 3.2), see, e.g., [TS63]. Instead of the Laplace equation Au = 0, let us now consider the Helmholtz equa tion on JR3 :
o.
that governs the propagation of timeharmonic waves with the wavenumber k > The Helmholtz equation needs to be supplemented by the Sommerfeld radiation boundary conditions at infinity that replace condition ( 1 3.2): ou(X) . � lku(x) = o( l x l  I ), I xi * 00 . 
( 1 3.7)
The fundamental solution of the Helmholtz operator is given by:
G(x) =
x

1 eikl l �. 4n
The interior and exterior boundary value problems of either Dirichlet or Neumann type are formulated for the Helmholtz equation in the same way as they are set for the Laplace equation, see formulae ( 1 3 . 1 ). The only exception is that for the exte rior problems condition ( 1 3.2) is replaced by ( 1 3.7). Both Dirichlet and Neumann exterior problems for the Helmholtz equation are uniquely solvable for any qJ(x), x E f, regardless of the value of k. The singlelayer and doublelayer potentials for the Helmholtz equation are defined as the same surface integrals ( 1 3.4) and ( 1 3.5), respectively, except that the fundamental solution G(x) changes. Integral equations ( 1 3.6) that correspond to the four boundary value problems for the Helmholtz equa tion also remain the same, except that the quantity Ix�YI in every integral is replaced iklxyl
•
by e1x_Y 1 (accordmg to the change of G). Still, even though, say, the exterior Dirichlet problem for the Helmholtz equa tion has a unique solution for any qJ(x), the corresponding integral equation of type ( 1 3.6c) is not always solvable. More precisely, it will be solvable for an arbitrary qJ(x) provided that the interior Neumann problem with qJ(x) == 0 and the same wavenumber k has no solutions besides trivial. This, however, is not always the case. The situation is similar regarding the solvability of the integral equation of type ( 1 3.6d) that corresponds to the exterior Neumann problem for the Helmholtz equation. It is solvable for any qJ(x) only if the interior Dirichlet problem for the same k and qJ(x) == 0 has no solutions besides trivial, which may or may not be
Boundary Integral Equations and the Method of Boundary Elements
479
true. We thus see that the difficulties in reducing boundary value problems for the Helmholtz equation to integral equations are again related to interior resonances, as in the case of the Laplace equation. These difficulties may translate into serious hurdles for computations. Integral equations of classical potential theory have been explicitly constructed for the boundary value problems of elasticity, see, e.g., [MMP95] , for the Stokes system of equations (low speed flows of viscous fluid), see [Lad69], and for some other equations and systems, for which analytical forms of the fundamental solutions are available. Along with exploiting various types of surface potentials, one can build boundary integral equations using the relation between the values of the solution u l r and its normal derivative �:: I r at the boundary r of the region Q. This relation can be obtained with the help of the Green's formula:
u (x ) 
/ (G(x  y) dU(Y)
.
r
')  
dG(x  y) ')
o ny
o ny
u
)
(y ) d
Sy
( 1 3.8)
that represents the solution u(x) of the Laplace or Helmholtz equation at some inte rior point x E Q via the boundary values u l r and �:: I r' One gets the desired integral equation by passing to the limit x Xo E r while taking into account the disconti nuity of the doublelayer potential, i.e., the jump of the second term on the righthand side of ( 1 3.8), at the interface r. Then, in the case of a Neumann problem, one can substitute a known function �:: I r = cp(x) into the integral and arrive at a Fredholm integral equation of the second kind that can be solved with respect to u l r. The advantage of doing so compared to solving equation ( 1 3.6b) is that the quantity to be computed is u l r rather than the auxiliary density p (y) .
+
13.2
D iscretization o f Integral Equations and Boundary Elements
Discretization of the boundary integral equations derived in Section 1 3 . 1 encoun aG ters difficulties caused by the singular behavior of the kernels any and G. In this section, we will use equation ( 1 3.6a) as an example and illustrate the construction of special quadrature formulae that approximate the corresponding integrals. First, we need to triangulate the surface r = dQ, i.e., partition it into a finite num ber of nonoverlapping curvilinear triangles. These triangles are called the boundary elements, and we denote them rj, j = 1 , 2, 1 They may only intersect by their sides or vertexes. Inside every boundary element the unknown density (J(Y) is as sumed constant, and we denote its value (Jj, j = 1 , 2, 1 Next, we approximately replace the convolution integral in equation ( 13.6a) by . . . ,
.
. . . ,
.
A Theoretical Introduction to Numerical Analysis
480 the following sum:
J
r
d
J
crj cr (Y ) :; dsy = L ally 1 X _ Y I '_ 1 J1
d 1 _1 ds . J :;n X I
r.I·
U
y
Y
y
Note that the integrals on the righthand side of the previous equality depend only on rj and on X as a parameter, but do not depend on crj . Let also Xk be some point from the boundary element rb and denote the values of the integrals as:
akj =
d l dS . J :;n 1Xk
rj
u
y
I
Y
y
Then, integral equation ( l 3 .6a) yields the following system of J linear algebraic equations with respect to the J unknowns cr( J
2 7r crk  L akpj = CP (Xk ) ' k = 1 , 2, . . . , J.
( 1 3.9)
j= 1
When the number of boundary elements J tends to infinity and the maximum diameter of the boundary elements ri simultaneously tends to zero, the solution of system ( 1 3.9) converges to the density cr(y ) of the doublelayer potential that solves equation ( 1 3.6a). The unknown solution u (x) of the corresponding interior Dirichlet problem ( 1 3. l a) is then computed in the form of a doublelayer potential ( 1 3.5) given the approximate values of the density obtained by solving system ( 1 3.9). Note that the method of boundary elements in fact comprises a wide variety of techniques and approaches. The computation of the coefficients akj can be conducted either exactly or approximately. The shape of the boundary elements rj does not necessarily have to be triangular. Likewise, the unknown function cr(y) does not necessarily have to be replaced by a constant inside every element. Instead, it can be taken as a polynomial of a particular form with undetermined coefficients. In doing so, linear system ( 1 3.9) will also change accordingly. A number of established techniques for constructing boundary elements is avail able in the literature, along with the developed methods for the discretization of the boundary integral equations, as well as for the direct or iterative solution of the re sulting linear algebraic systems. We refer the reader to the specialized textbooks and monographs [BD92, Ha194, Poz02] for further detail.
13.3
The Range of Applicability for B oundary Elements
Computer codes implementing the method of boundary elements are available to day for the numerical solution of the Poisson and Helmholtz equations, the Lame sys tem of equations that governs elastic deformations in materials, the Stokes system of
Boundary Integral Equations and the Method of Boundary Elements
48 1
equations that governs low speed flows of incompressible viscous fluid, the Maxwell system of equations for timeharmonic electromagnetic fields, the linearized Euler equations for timeharmonic acoustics, as well as for the solution of some other equa tions and systems, see, e.g., lPoz02]. An obvious advantage of using boundary inte gral equations and the method of boundary elements compared, say, with the method of finite differences, is the reduction of the geometric dimension by one. Another clear advantage of the boundary integral equations is that they apply to boundaries of irregular shape and automatically take into account the boundary conditions, as well as the conditions at infinity (if any). Moreover, the use of integral equations some times facilitates the construction of numerical algorithms that do not get saturated by smoothness, i.e., that automatically take into account the regularity of the data and of the solution and adjust their accuracy accordingly, see, e.g., [Bab86], [BeI89,BKO l ] . The principal limitation o f the method o f boundary elements is that i n order to directly employ the apparatus of classical potentials for the numerical solution of boundary value problems, one needs a convenient representation for the kernels of the corresponding integral equations. Otherwise, no efficient discretization of these equations can be constructed. The kernels, in their own turn, are expressed through fundamental solutions, and the latter admit a simple closed form representation only for some particular classes of equations (and systems) with constant coefficients. We should mention though that the method of boundary elements, if combined with some special iteration procedures, can even be used for solving certain types of nonlinear boundary value problems. Another limitation manifests itself even when the fundamental solution is known and can be represented by means of a simple formula. In this case, reduction of a boundary value problem to an equivalent integral equation may still be problematic because of the interior resonances. We illustrated this phenomenon in Section using the examples o f a Dirichlet exterior problem and a Neumann exterior problem for the Helmholtz equation.
13.1
Cbapter 1 4 Bound ary Equ a tions with Projections and the Method of Difference Potentials
The method of difference potentials is a technique for the numerical solution of in terior and exterior boundary value problems for linear partial differential equations. It combines some important advantages of the method of finite differences (Part Ill) and the method of boundary elements (Chapter 13). At the same time, it allows one to avoid certain difficulties that pertain to these two methods. The method of finite differences is most efficient when using regular grids for solv ing problems with simple boundary conditions on domains of simple shape (e.g., a square, circle, cube, ball, annulus, torus, etc.). For curvilinear domains, the method of finite differences may encounter difficulties, in particular, when approximating boundary conditions. The method of difference potentials uses those problems with simple geometry and simple boundary conditions as auxiliary tools for the numeri cal solution of more complicated interior and exterior boundary value problems on irregular domains. Moreover, the method of difference potentials offers an efficient alternative to the difference approximation of boundary conditions. The main advantage of the boundary element method is that it reduces the spatial dimension of the problem by one, and also that the integral equations of the method (i.e., integral equations of the potential theory) automatically take into account the boundary conditions of the problem. Its main disadvantage is that to obtain those integral equations, one needs a closed form fundamental solution. This requirement considerably narrows the range of applicability of boundary elements. Instead of the integral equations of classical potential theory, the method of dif ference potentials exploits boundary equations with projections of Calderon's type. They are more universal and enable an equivalent reduction of the problem from the domain to the boundary, regardless of the type of boundary conditions. Calderon's equations do not require fundamental solutions. At the same time, they contain no integrals, and cannot be discretized using quadrature formulae. However, they can be approximated by difference potentials, which, in turn, can be efficiently computed. A basic construction of difference potentials is given by difference potentials of the Cauchy type. It plays the same role for the solutions of linear difference equations and systems as the classical Cauchy type integral:
483
484
A Theoretical Introduction to Numerical Analysis
plays for the solutions of the CauchyRiemann system, i.e., for analytic functions. In this chapter, we only provide a brief overview of the concepts and ideas re lated to the method of difference potentials. To illustrate the concepts, we analyze a number of model problems for the twodimensional Poisson equation: ( 1 4. 1 ) These problems are very elementary themselves, but they admit broad generaliza tions discussed in the recent specialized monograph by Ryaben'kii [Rya02 1. Equation ( 1 4. 1 ) will be discretized using standard central differences on a uniform Cartesian grid: ( 1 4.2) The discretization is symbolically written as (we have encountered it previously, see Section 1 2. 1 ) : ( 1 4.3) aml1 Un == 1m , IlENm
L
where Nm is the simplest fivenode stencil:
( 1 4.4) Note that we have adopted a multiindex notation for the grid nodes: m = (m l , m2 ) and n = (n l ' n2 ). The coefficients amI! of equation ( 1 4.3) are given by the formula: if if
n = m, n = (m l ± 1 , m2 )
or
n = (111 1 , 1112 ± I ) .
The material in the chapter is organized as follows. In Section 1 4. 1 , we formulate the model problems to be analyzed. In Section 1 4.2, we construct difference poten tials and study their properties. In Section 1 4.3, we use the apparatus on Section 1 4.2 to solve model problems of Section 14.1 and thus demonstrate some capabilities of the method of difference potentials. Section 1 4.4 contains some general comments on the relation between the method of difference potentials and some other methods available in the literature; and Section 1 4.5 provides bibliographic information.
14.1
Formulation of Model P roblems
In this section, we introduce five model problems, for which their respective ap proximate solutions will be obtained using the method of difference potentials.
Boundary Equations with Projections and the Method ofDifference Potentials 14.1.1
Interior Boundary Value P roblem
Find a solution u
=
u(x,y) of the Dirichlet problem for the Laplace equation: f..u = 0, (x,y) E D, u l r = cp(s) ,
where D c ]R2 is a bounded domain with the boundary r = dD, and function the argument s, which is the arc length along r.
14.1.2
485
( 1 4.5)
cp(s) is a given
Exterior Boundary Value P roblem
Find a bounded solution u = u(x,y) of the Dirichlet problem:
f..u = 0, (x,y) E ]R2 \D , u l r = cp(s).
( 1 4.6)
The solution of problem ( 1 4.6) will be of interest to us only on some bounded neigh borhood of the boundary r. Let us immerse the domain D into a larger square DO , see Figure 1 4. 1 , y and instead of problem ( 1 4.6) con sider the following modified prob lem:
f..u = O, (x,y) E D� , u l r = cp(s) , u l aDo = 0, ( 1 4.7) where D� � DO \ D is the region be
tween r = dD and the outer bound ary of the square dDo. The larger the size of the square DO compared to the size of D == D+ , the better problem ( 1 4.7) approximates prob lem ( 1 4.6) on any finite fixed neigh borhood of the domain D+. Here after we will study problem ( 1 4.7) instead of problem ( 1 4.6).
14.1.3
r o
x
FIGURE 1 4. 1 : Schematic geometric setup.
Problem of Artificial Boundary Conditions
Consider the following boundary value problem on the square DO :
Lu = f(x,y) , (x,y) E DO , ( 1 4.8a) ( 1 4.8b) u l;wo = 0, and assume that it has a unique solution u = u(x,y) for any righthand side f(x,y). Suppose that the solution u needs to be computed not everywhere in DO , but only on
A Theoretical Introduction to Numerical Analysis a given subdomain D+ C DO , see Figure 1 4. 1 . Moreover, suppose that outside D+ the operator L transforms into a Laplacian, so that equation ( 1 4.8a) becomes:
486
( 1 4.9) Let us then define the artificial boundary as the the boundary r = a D+ of the computational subdomain D + . In doing so, we note that the original prob lem ( 1 4.8a), ( 1 4.8b) does not contain any artificial boundary. Having introduced r, we formulate the problem of constructing a special relation lu I I = ° on the artificial boundary 1, such that for any f(x,y) the solution of the new problem:
Lu = f(x,y), (x,y) E D+ , lull = 0,
( 14. lOa) ( 14. l Ob)
will coincide on D + with the solution of the original problem ( 1 4.8a), ( 14.8b), ( 1 4.9). Condition ( 14.1 Ob) is called an artificial boundary condition (ABC). One can say that condition ( 14. lOb) must equivalently replace the Laplace equa tion ( 1 4.9) outside the computational subdomain D + along with the boundary condi tion ( 1 4.8b) at the (remote) boundary aDO of the original domain. In our model prob lem, condition ( 14.8b), in turn, replaces the requirement that the solution of equation ( 1 4.8a) be bounded at infinity. One can also say that the ABC ( 14. lOb) is obtained by transferring condition ( l 4.8b) from the remote boundary of the domain DO to the artificial boundary a D + of the chosen computational subdomain D + .
14. 1 .4
Problem of Two Sub domains
Let us consider the following Dirichlet problem on DO : ( 1 4. 1 1 ) Suppose that the righthand side f(x,y) is unknown, but the solution u(x,y) itself is known (say, it can be measured) on some neighborhood of the interface r between the subdomains D + and D , see Figure 14. 1 . For an arbitrary domain Q C ]R2 , define its indicator function:
{
e(Q) = l , 0,
if if
(x,y) E Q, (x,y) � Q.
Then, the overall solution can be represented as the sum of two contributions: u+ + u  , where u+ = u + (x,y) is the solution due to the sources outside D+ :
u=
( 14. 1 2) and u 
=
u  (x,y) is the solution due to the sources inside D+ : ( 14. 13)
Boundary Equations with Projections and the Method of Difference Potentials 487 The problem is to find each term u + and u  separately. r:
More precisely, given the overall solution and its normal derivative at the interface
[:u] [: ] [: ] an
r
+ u+
an
r
+
u
an
r
(14.14)
find the individual terms on the righthand side of (14. 14). Then, with no prior knowledge of the source distribution f(x, y), also find the individual branches of the solution, namely: u + (x,y), for (x,y) E D+ , which can be interpreted as the effect of D  on D + , see formula (14.12), and
(x,y) E D , which can be interpreted as the effect of D+ on D  , see formula (14.13). u  (x,y),
14.1.5
for
Problem of Active Shielding
As before, let the square DO be split into two subdomains: D + and D , as shown in Figure 14.1. Suppose that the sources f(x,y) are not known, but the values of the solution u I r and its normal derivative �� I r at the interface r between D+ and D are explicitly specified (e.g., obtained by measurements). Along with the original problem (14. 1 1), consider a modified problem:
(14. 15) Llw = f(x,y) + g(x,y), (x,y) E DO , w l aDo = 0, where the additional term g = g(x , y) o n the righthand side o f equation (14. 15) rep
resents the active control sources, or simply active controls. The role of these active controls is to protect, or shield, the region D+ C DO from the influence of the sources located on the complementary region D = VO\D+ . In other words, we require that the solution w = w(x,y) of problem (14. 1 5) coincide on D+ with the solution v = v(x,y) of the problem:
{
+ �v = f(X' Y ) ' (x,y) E D , (x,y) E D , 0,
v = laoO = 0,
(14. 16)
which is obtained by keeping all the original sources on D + and removing all the sources on D  . The problem of active shielding consists of finding all appropriate active controls g = g(x,y). This problem can clearly be interpreted as a particular inverse source problem for the given differential equation (Poisson equation). REMARK 1 4 . 1 Let us emphasize that we are not interested in finding the solution u(x,y) of problem ( 14. 1 1), the solution w(x,y) of problem (14. 1 5), or the solution v(x,y) of problem ( 14. 16). In fact, these solutions cannot even
f(x,y)
A Theoretical Introduction to Numerical Analysis be obtained on DO because the righthand side is not known. Our objective is rather to find all g(x,y) that would have a predetermined effect on the solution u(x,y), given only the trace of the solution and its normal derivative on r. The desired effect should be equivalent to that of the removal of all sources on the subdornain D  , i . e . , outside D+ . 0
488
An obvious particular solution to the foregoing problem of active shielding is given by the function:
f(x,y)
g(x,y) ==
{�f(X,y),
(x,y) E D+ , (x,y) E D .
(14.17)
This solution, however, cannot be obtained explicitly, because the original sources are not known. Even if they were known, active control ( 14. 17) could still be difficult to implement in many applications.
14.2
Difference Potentials
In this section, we construct difference potentials for the simplest fivenode dis cretization (14.3) of the Poisson equation (14. 1). All difference potentials are ob tained as solutions to specially chosen auxiliary finitedifference problems. 14. 2 . 1
Auxiliary D ifference Problem
Let the domain DO C ]R2 be given, and let MJ be the subset of all nodes m = (mJ h, m2 h) of the uniform Cartesian grid ( 1 4.2) that belong to D(O) . We consider difference equation (14.3) on the set MJ :
L amn un = frn, m E tvfJ .
(14. 1 8)
nE Nm
The lefthand side of this equation is defined for all functions uNJ = { un } , n E � , where � = UNm , m E MJ, and Nm is the stencil ( 14.4). We also supplement equation (14. 18) by an additional linear homogeneous (boundary) condition that we write in the form of an inclusion:
(14.19)
In this formula, UNJ denotes a given linear space of functions on the grid � . The only constraint to satisfy when selecting a specific UNJ is that problem ( 1 4. 1 8), = {frn }, (14.19) must have a unique solution for an arbitrary righthand side m E MJ. If this constraint is satisfied, then problem (14. 1 8), (14. 19) is an appropriate auxiliary difference problem.
fMJ
Boundary Equations with Projections and the Method of Difference Potentials 489 Note that a substantial flexibility exists in choosing the domain DO , the corre sponding grid domains � and NJ = UNm , In E lvI), and the space UNO . De
pending on the particular choices made, the resulting auxiliary difference prob lem ( 14. 1 8), (14. 1 9) may or may not be easy to solve, and accordingly, it may or may not appear well suited for addressing a given application. To provide an example of an auxiliary problem, we let DO be a square with its sides parallel to the Cartesian coordinate axes, and its width and height equal to kh, where k is integer. The corresponding sets � and NJ are shown in Figure 14.2. The set � in Figure 14.2 consists of the black bullets, and the set NO = UNm , In E � , additionally contains y the hollow bullets. The four corner nodes do not belong to either set. Next, we define the space UfIJ as the space of all functions ufIJ = y {ulI}, n E NJ, that vanish on the A sides of the square DO , i.e., at the nodes n E NO denoted by the hol Y y low bullets in Figure 14.2. Then, the A auxiliary problem (14.18), (14.19) is a difference Dirichlet problem for the Poisson equation subject to the y homogeneous boundary condition. X 0 This problem has a unique solu tion for an arbitrary righthand side fMJ = {fin } , see Section 12. 1 . The FIGURE 14.2: Grid sets � and NJ . solution can be efficiently computed using separation of variables, i.e., the Fourier method of Section 5.7. In particular, if the dimension of the grid in one coordinate direction is a power of 2, then the computational complexity of the corresponding algorithm based on the fast Fourier transform (see Section 5.7.3) will only be d(h 2 1 1 n h l ) arithmetic operations.
� 2
� �

14.2.2
The Potential
u + = P+ vy
Let D + be a given bounded domain, and let M+ be the set of grid nodes In that belong to D+ . Consider the restriction of system (14.3) onto the grid M+ :
L amn ulI = fm , In E M+ . liEN., The lefthand side of ( 14.20) is defined for all functions UN+
( 14.20) =
{ un } , n E N+ , where
We will build difference potentials for the solutions of system (14.20). Let DO be a larger square, D+ C DO , and introduce an auxiliary difference problem of the type
A Theoretical Introduction to Numerical Analysis (14. I 8), ( 14. 19). Denote D = DO \ [j+ , and let M be the set of grid nodes that
490
belong to D  :
Along with system ( 14.20), consider the restriction of ( 14.3) onto the set M :
( 14.2 1) L alllll UII = fin, m E M . IlENm The lefthand side of (14.2 1) is defined for all functions UN = { Un }, n E N , where Thus, system (14.18) gets split into two subsystems: (14.20) and (14.2 1), with their solutions defined on the sets N+ and N , respectively. Let us now define the bound ary y between the grid domains N+ y and N  (see Figure 14.3):
1
V
J:
:I X
M+�
It is a narrow fringe of nodes that straddles the continuous boundary r. We also introduce a linear space Vy of all grid functions Vy de fined on y. These functions can be considered restrictions of the func tions UNO E UNIl to the subset y c �. The functions Vy E Vy will be re ferred to as densities. We can now introduce the potential u+ = P+ vy with the density Vy.
"
l\:
J
�
0
FIGURE
r
Mr
r
Y  I '
X
14.3: Grid boundary.
{
Consider' the grid function:
DEFINITION 14.1
Vn  vy l l1 , O,
if if
n E y, n tf y,
and define: if if
m E M+ , m E M ,
(1 4.22)
( 14.23)
Let ulfJ = { Un }, n E �, be the solution of the auxiliary difference prob lem ( 14. 18), ( 14. 19) driven by the r'ighthand side 1m of ( 14.23), (14.22). The difference potential u+ P+ Vy is the function UN+ = {un }, n E N+ , that coin cides with ulfJ on the grid domain N+ .
=
Boundary Equations with Projections and the Method of Difference Potentials
49 1
Clearly, to compute the difference potential u + = P+Vy with a given density Vy E one needs to solve the auxiliary difference problem ( 1 4. 1 8), ( 1 4. 1 9) with the righthand side (14.23), ( 1 4.22).
Vy,
THEOREM 14.1 The difference potential u+ =
UN+ = P+vy has the follow'iny properties.
1. There is a function UAfO E UN'l , such that UN+ = P+vy where e ( . ) denotes the indicator function of a set.
2. On the set M+ , the difference potential u+ neous equation:
I..
/lENm
amn lin =
0,
=
=
e (N+ ) uAfO [ N+ '
P+vy satisfies the homoge
m E M+ .
( 14.24)
3. Let vNo E UN'l , and let VN+ = e (N+ )vAfO [ N+ be a solution to the homoge neous equation ( 14.24). Let the density Vy be given, and let it coincide with VN+ on y: Vy = VN I [ y o Then the solution VN+ = {vll } , n E N+ , is unique and can be reconstructed from its values on the boundary y by the formula: ( 14.25) PRO OF The first implication of the theorem follows immediately from Definition 1 4 . 1 . The second implication is also true because the potential P+ vy solves problem (14. 1 8), ( 14. 1 9), and according to formula ( 14.23), the righthand side of this problem is equal to zero on M+ . To prove the third implication, consider the function vAfO E UAfO from the hypothesis of the theorem along with another function wAfO = e (N+ )vAfO E UAfl . The function wAfO satisfies equation ( 14. 1 8) with the righthand side given by ( 1 4.23), ( 14.22). Since the solution of problem ( 14. 1 8), ( 1 4. 1 9) is unique, its solution WAfO driven by the righthand side ( 14.23), ( 14.22) coincides on N+ with the difference potential VN+ = P+vy. Therefore, the solution VN+ == wAfO IN+ is indeed unique and can be represented by formula ( 14.25). 0
Let us now introduce another operator, P� : Vy ft Vy, that for a given density
y E Vy yields the trace of the difference potential P+vy on the grid boundary y:
( 1 4.26) THEOREM 14.2 Let Vy E Vy be given, and let Vy = VN+ [ y ' where VN+ is a solution to the homogeneous equation ( 1 4.24) and Vw = e (N+ )VNO [ N+ ' VAfO E UAfl . Then, P+ y Vy 
vy.
( 14.27)
492
A Theoretical Introduction to Numerical Analysis
Conversely, if equation (14.27) holds for a given Vy E Vy, then Vy is the trace of some sol'ut'i on VN+ of equation ( 14.24), and VN+ = e (N+ )vNo I N+ ' vNo E U� .
PROOF The first ( direct ) implication of this theorem follows immediately from the third implication of Theorem 14. 1 . Indeed, the solution VN·I of the homogeneous equation ( 1 4.24) that coincides with the prescribed Vy on y iH unique and is given by formula ( 14.25) . Considering the restriction of both sides of ( 14.25) onto y, and taking into account the definition ( 14.26), we arrive at equation ( 1 4.27). Conversely, suppOHe that for some Vy equation ( 1 4.27) holds. Let us eon sider the difference potential u+ = P+vy with the denHity Vy. Aecording to Theorem 14. 1 , u + is a solution of equation ( 1 4.24), and this solution can be extended from N+ to JfJ so that the extension belongs to UNO . On the other hand, formulae ( 14.26) and ( 1 4.27) imply that
u+ J Y = P+ vY J Y = PY+ vY
=
Vy ,
so that Vy arc the boundary values of thiH potential. As such, u+ can be taken ill the capacity of the desired VN� ; which completes the proof. 0 Theorems 14.1 and 14.2 imply that the operator P; is a projection:
In [Rya85] , [Rya02], it is shown that this operator can be considered a discrete counterpart of Calderon's boundary projection, see [CaI63j. Accordingly, equation ( 14.27) is referred to as the boundary equation with projection. A given density Vy E Vy can be interpreted as boundary trace of some solution to the homogeneous equation (14.24) if and only if it satisfies the boundary equation with projection ( 14.27), or alternatively, if and only if it belongs to the range of the projection P; . 14.2.3
Difference Potential
u  = P vy
The difference potential P Vy with the density Vy E Vy is a grid function defined on N . Its definition is completely similar and basically reciprocal to that of the potential P+ Vy, To build the potential P vy and analyze its properties, one only needs to replace the superscript "+" by "", and the superscript "" by "+," on all occasions when these superscripts are encountered in Definition 14. 1 ) as well as in Theorems 14. 1 and 14.2 and their proofs.
Boundary Equations with Projections and the Method of Difference Potentials 14.2.4
Cauchy Type Difference Potential
wn±
{
The function w±
DEFINITION 14.2 =
w:
= p+ Vy l ' _ wn = P Vy I II
=
w±
=
P±Vy
P±Vy defined for n E NO
n E N+ , �f n E N . if
II '
is called a Cauchy type difference potential w±
=
493
as:
( 1 4.28)
P±Vy with the density Vy.
This potential is, generally speaking, a twovalued function on y, because each node n E y simultaneously belongs to both N+ and N . Along with the Cauchy type difference potential, we will define an equivalent concept of the difference potential w± = P±Vy with the jump Vy. This new concept will be instrumental for studying properties of the potential w± = P±vy, as well as for understanding a deep analogy between the Cauchy type difference potential and the Cauchy type integral from classical theory of analytic functions. Moreover, the new definition will allow us to calculate all three potentials:
at once, by solving only one auxiliary problem ( 1 4. 1 8), ( 1 4. 1 9) for a special right hand side fm, m E �. On the other hand, formula ( 1 4.28) indicates that a straight forward computation of w± according to Definition 14.2 would require obtaining the potentials P+vy and PVy independently, which, in turn, necessitates solving two problems of the type ( 14. 1 8), ( 14. 1 9). To define difference potential w± = P±Vy with the jump vy, we first need to in troduce piecewise regular functions. The functions from the space U� will be called regular. Given two arbitrary regular functions 1:;0 and u ;.0, we define a piecewise regular function u; , n E �, as:
{u:, u:
Un =
±
II;; ,
uNo u
if n E N + , if n E N .
( 1 4.29)
Let U ± be a linear space of all piecewise regular functions ( 14.29). Any func tion ( 1 4.29) assumes two values and u,; at every node n of the grid boundary y. A singlevalued function Vy defined for n E y by the formula: n
E y,
u
( 1 4.30)
will be called a jump. Note that the space U� of regular functions can be considered a subspace of the space U ± ; this subspace consists of all functions ± E U± with zero jump: Vy = OrA piecewise regular function ( 1 4.29) will be called a piecewise regular solution of the problem: ± " E U±, ( 1 4.3 1 ) £.. amll l/ n = 0 , nENm
u±
A Theoretical Introduction to Numerical Analysis
494
if the functions ut , n E N+ and u;; , n E N satisfy the homogeneous equations: ,
and
respectively.
nENm nENm
,
2. amn ut = 0, m E M+ ,
( 14.32a)
2. amll u;; = 0, m E M ,
( 1 4.32b)
THEOREM 14.3 Problem (14.3 1 ) has a unique solution for any given jump Vy solution is obtained by the formula:
u± =
v± U
v±
E Vy.
 U,
This
( 1 4.33)
where the minuend on the righthand side of ( 14.33) is an arbitrary piecewise regular function E ± with the prescribed jump Vy, and the subtrahend is a regular solution of the auxiliary difference problem ( 1 4. 1 8), ( 14. 1 9) with the righthand side:
v± U±
( 1 4.34)
PROOF First, we notice that piecewise regular functions E with a given jump yy exist. Obviously, one of these functions is given by the formula:
vn± { vvn,t, n { vYin, v;; =
where
y+ = ==
_
if if
0,
if
0,
n E N+ , n E N , n E y e N+ , n E N+ \ y. n E N .
The second term on the righthand side of ( 1 4.33) also exists, since prob lem ( 14. 1 8), ( 1 4. 1 9) has one and only one solution for an arbitrary righthand side im , m E !vf!, in particular, for the one given by formula ( 1 4.34). Next, formula (14.33) yields:
[u±] = [y±]  [uJ = Vy  [uJ , and since
In
u is a regular function and has no jump: [uJ
=
0,
n E y,
Boundary Equations with Projections and the Method ofDifference Potentials 495 we conclude that the function u± of (14.33) is a piecewise regular function with the jump Vy. Let us show that this function u± is, in fact, a piecewise regular solution. To this end, we need to show that the functions:
(14.35a) and
(14.35b) satisfy the homogeneous equations (14.32a) and (14.32b), respectively. First, substituting the function u;; of (14.35a) into equation (14.32a), we obtain:
L amll u;; = L amll v;;  L amn un , m E M+ .
(14.36)
nENm
nENm
nENm
However, {un } is the solution of the auxiliary difference problem with the righthand side (14.34). Hence, we have:
L amn un = fm = L amn v;;, m E M+ . nEN..
nENm
Thus, the righthand side of formula (14.36) cancels out and therefore, equa tion (14.32a) is satisfied. The proof that the function (14.35b) satisfies equation ( 14.32b) is similar. It only remains to show that the solution to problem (14.31) with a given jump Vy is unique. Assume that there are two such solutions. Then the differ ence between these solutions is a piecewise regular solution with zero jump. Hence it is a regular function from the space UJVO . In this case, problem (14.31) coincides with the auxiliary problem (14. 1 8), (14. 19) driven by zero righthand side. However, the solution of the auxiliary difference problem ( 14. 18), (14. 19) is unique, and consequently, it is equal to zero. Therefore, two piecewise regular solutions with the same jump coincide. 0 Theorem 14.3 makes the following definition valid. DEFINITION 14.3 A piecewise regular solution u± of problem (14.3 1) with a given jump Vy E Vy will b e called a difference potential u± = P±vy with the jump Vy.
THEOREM 14.4 Definition 14.2 of the Cauchy type difference potential w± = P+Vy with the density Vy E Vy and Definition 14.3 of the difference potential u± = P±vy with the jump Vy are equivalent, i. e., w± = u± .
496
A Theoretical Introduction to Numerical Analysis
PRO OF It is clear the formula ( 1 4.28) yields a piecewise regular solution of problem ( 14.3 1 ). Since we have proven that the solution of problem ( 14.3 1 ) with a given jump Vy is unique, it only remains t o show that the piecewise regular solution (1 4.28) has jump Vy:
[w±] = ll�  uy = P� Vy  (Py V y) = P� Vy + Py Vy = vy . In the previous chain of equalities only the last one needs to be justified, i.e., we need to prove that ( 14.37) Recall that PtVy coincides on y with the difference potential u + = P+ vy, which, in turn, coincides on N + with the solution of the auxiliary difference problem: ( 14.38) m E tvfJ ,
{ o,
driven by the righthand side fill of ( 1 4.23), ( 1 4.22):
fill+ = LnENm Qmn vll ,
if m E M+ , if m E M .
( 1 4.39)
Quite similarly, the expression Py Vy coincides on y with the difference po tential u = PVy, which, in turn, coincides on N with the solution of the auxiliary difference problem:
L
nENm
amnu;; = !,;; ,
m E tvfJ , U;, E UNo ,
( 14.40)
dri ven by the righthand side: ( 1 4.4 1 ) Similarly to formula (14.39), the function VII in formula ( 14.41) is determined by the given density Vy according to (1 4.22). Let us add equations ( 1 4.38) and ( 14.40). On the righthand side we obtain:
!,� + !,;;
=
L
nENm
am"vn,
m E tvfJ.
Consequently, we have:
L all!ll (u: + u;; ) = L al/lll vn ,
/l ENm
flENm
The function vND determined by Vy according to ( 1 4.22) belongs to the space UNO . Since the functions ll�o and u� are solutions of the auxiliary
Boundary Equations with Projections and the Method of Difference Potentials
497
difference problems ( 1 4.38), (14.39) and ( 14.40), ( 14.41 ) , respectively, each of them also belongs to UNO . Thus, their sum is in the same space: ZNO == lI�o + lI;0l E UNO , and it solves the following problem of the type ( 1 4. 1 8), ( 14. 1 9):
L. a/Ill,:" = L. am" Vn ,
llENm
nENm
111 E
1'vfl,
ZND E UNO .
Clearly, if we substitute VND for the unknown :No on the lefthand side, the equation will be satisfied. Then, because of the uniqueness of the solution, we have: zND = vND or z" = lit + lI;; = v,, , 11 E NJ. This implies: lit + lI/;
=
p + vy l " + p  vy l ll
= vy l n,
11 E y e NO ,
0
which coincides with the desired equality ( 1 4.37).
Suppose that the potential ll ± = P± Vy for a given jump Vy E Vy is introduced according to Definition 1 4.3. In other words, it is given by formula ( 1 4.33) that require::; solving problem (14. 1 8), ( 1 4. 1 9) with the right halld side (14.34). Thus, the functions lI+ and u  are determined: REMARK 14.2
lI"± _

{
lit , lin , _
Consequently, the individual potentials P+ vy and P vy are also determined: if
n E N+ ,
if 11 E N  . by virtue of Theorem 14.4 and Definition 14.2.
14.2.5
o
Analogy with C lassical Cauchy Type Integral
Suppose that r is a simple closed curve that partitions the complex plane : = x + iy into a bounded domain D + and the complementary unbounded domain D  . The clas sical Cauchy type integral (where the direction of integration is counterclockwise): ( 1 4.42) can be alternatively defined as a piecewise analytic function that has zero limit at infinity and undergoes jump Vr = [ lI ± ]r across the contour r. Here 1/+ (z) and u  (: ) arc the values of the integral ( 1 4.42) for z E D + and : E D , respectively. Cauchy type integrals can be interpreted as potentials for the solutions of the CauchyRiemann system:
da dx
iJb iJy '
498
A Theoretical Introduction to Numerical Analysis
that connects the real part a and the imaginary part b of an analytic function a + ib. Thus, the Cauchy type difference potential: w± = P±Vy
plays the same role for the solutions of system (14. 18) as the Cauchy type integral plays for the solutions of the CauchyRiemann system.
14.3
Solution of Model Problems
In this section, we use difference potentials of Section lems of Section 14. 1 .
14.3.1
14.2 to solve model prob
Interior Boundary Value Problem
Consider the interior Dirichlet problem (14.5):
(x,y) E D ,
Au = O,
u [ , = cp (s) ,
( 14.43)
l = aD.
Introduce a larger square DO ::l D with its sides on the coordinate lines of the Carte sian grid (14.2). The auxiliary problem (14. 18) (14. 1 9) is formulated on DO:
L amll ull = jm ,
IlENm
m E !vf! ,
( 14.44)
UtvO [ II = 0,
We will construct two algorithms for solving problem ( 14.43) numerically. In either algorithm, we will replace the Laplace equation of (14.43) by the difference equation:
L amn u" = 0,
I1ENm
m E W e D.
=
( 14.45)
Following the considerations of Section 14.2, we introduce the sets N+ , N , and y = N+ n N , the space Vy of all functions defined on y, and the potential: ut P+ Vy. Difference Approximation of the Boundary Condition
In the first algorithm, we approximate the Dirichlet boundary condition of (14.43) by some linear difference boundary condition that we symbolically write as: l y=
u
Theorem
cp (") .
( 14.46)
14.2 implies that the boundary equation with projection: Uy = P� Uy
( 14.47)
Boundary Equations with Projections and the Method of Difference Potentials
499
is necessary and sufficient for U y E Vy to be a trace of the solution to equation ( 1 4.45) on the grid boundary Obviously, the trace U y of the solution to prob lem ( 1 4.45), ( 1 4.46) coincides with the solution of system ( 1 4.46), ( 14.47):
y e N.
U y  P U y = O,
�
l y=
u
qJ (h) .
( 14.48)
Once the solution U y of system ( 14.48) has been computed, the desired solution UN+ to equation ( 1 4.45) can be reconstructed from its trace U y according to the formula: UN+ = P+ U y , see Theorem 14. 1 . Regarding the solution of system ( 1 4.48), iterative methods (Chapter 6) can be applied because the grid function Vy  P� Vy can be eas ily computed for any Vy E Vy by solving the auxiliary difference problem ( 14.44). Efficient algorithms for solving system ( 1 4.48) are discussed in [Rya02, Part I]. Let us also emphasize that the reduction of problem ( 14.45), ( 1 4.46) to prob lem ( 1 4.48) drastically decreases the number of unknowns: from UN+ to Uy. Spectral Approximation of the Boundary Condition
In the second algorithm for solving problem ( 1 4.43), we choose a system of basis functions 0/1 (s) , 0/2 (S) , . . . on the boundary r of the domain D, and assume that the normal derivative �� I r of the desired solution can be approximated by the sum: ( 14.49) with any given accuracy. In formula ( 1 4.49), Ck are the coefficients to be deter mined, and K is a sufficiently large integer. We emphasize that the number of terms K in the representation ( 14.49) that is required to meet a prescribed toler ance will, generally speaking, depend on the properties of the approximating space: span { 0/1 (s) , . . . , o/n(s), . . . } , as well as on the class of functions that the derivative �� I r belongs to (i.e., on the regularity of the derivative, see Section 1 2.2.5). For a given function ul r = qJ(s) and for the derivative �� I r taken in the form ( 1 4.49), we construct the function Vy using the Taylor formula:
K ( 1 4.50) Vn = Vn (c l , . . . ,CK ) = qJ(sn ) + pn L.. Ck o/k (Sn), n E y. k=l In formula ( 14.50), Sn denotes the arc length along r at the foot of the normal dropped from the node n E y to r. The number PI! is the distance from the point Sn E r to the node n E y taken with the sign " +" if n is outside r and the sign " " if n is inside r. We require that the function Vy = Vy(Cl ' . . . ,CK ) == vy (c) satisfy the boundary equa
tion with projection ( 14.47):
(14.5 1 ) The number of equations in the linear system ( 14.5 1 ) that has K unknowns: C I ,C2 , . . . , CK , is equal to the number of nodes I yl of the boundary y. For a fixed K
500
A Theoretical Introduction to Numerical Analysis
and a sufficiently fine grid (i.e., if [y[ is large) this system is overdetermined. It can be solved by the method of least squares (Chapter 7). Once the coefficients c] , . . . , CK have been found, the function vy(c) is obtained by formula ( 1 4.50) and the solution UN + is computed as: ( 1 4.52) A strategy for choosing the right scalar product for the method of least squares is dis cussed in [Rya02, Part J]. Efficient algorithms for computing the generalized solution c" . . . , CK of system ( 14.5 1 ), as well as the issue of convergence of the approximate solution ( 1 4.52) to the exact solution, are also addressed in [Rya02, Part J].
14.3.2
Exterior Boundary Value Problem
First, we replace the Laplace equation of ( 1 4.6) by the difference equation:
L amll un = 0,
(14.53)
IIENm
Then, we introduce the auxiliary difference problem ( 14.44), the grid domains N+ and N , the grid boundary y, and the space Vy, in the same way as we did for the interior problem in Section 14.3. 1 . For the difference approximation of the boundary condition u J r = cp (s ) , we also use the same equation ( 1 4.46). The boundary equation with projection for the exte rior problem takes the form [cf. formula ( 1 4.27)] : (14.54) The function Uy satisfies equation (14.54) if and only if it is the trace of a solution to system ( 1 4.53) subject to the homogeneous Dirichlet boundary conditions at aDO , see ( 14.44). To actually find uy, one must solve equations ( 1 4.46) and ( 1 4.54) to gether. Then, one can reconstruct the solution UN  of the exterior difference problem from its boundary trace uy as UN  = p uy. Moreover, for the exterior problem ( 14.7) one can also build an algorithm that does not require difference approximation of the boundary condition. This is done similarly to how it was done in Section 14.3 . 1 for the interior problem. The only difference between the two algorithms is that instead of the overdetermined system (14.5 1 ) one will need to solve the system: Vy(c)  ryVy(c)
=
0,
and subsequently obtain the approximate solution by the formula ( 1 4.55) instead of formula (14.52).
Boundary Equations with Projections and the Method of Difference Potentials 501 14.3.3
Problem of Artificial Boundary Conditions
The problem of difference artificial boundary conditions is formulated as follows. We consider a discrete problem on DO :
(14.56) and use the same �, M+ , M, N+ , N , y, Vy, and UtfJ as in Sections 14.3. 1 and 14.3.2. We need to construct a boundary condition of the type Ivy = ° on the grid boundary y = N+ n N of the computational subdomain N+ , such that the solution {ull } of the problem:
L amll un = fm , m E M+ , (14.57) Iuy = 0, will coincide on N+ with the solution utfJ of problem ( 14.56). The grid set y and the boundary condition luy = ° are called an artificial boundary and an artificial nENm
boundary condition, respectively, because they are not a part of the original prob lem (14.56). They appear only because the solution is to be computed on the sub domain N+ rather than on the entire �, and accordingly, the boundary condition Un = 0, n E aDO , is to be transferred from aDo to y. One can use the exterior boundary equation with projection (14.54) in the capacity of the desired boundary condition luy = 0. Indeed, relation ( 14.54) is necessary and sufficient for the solution UN+ of the resulting problem ( 14.57) to admit an extension to the set N that would satisfy:
L amnUfl = 0,
nENm
Un = o,
14.3.4
if
Problem of Two Subdomains
A finitedifference counterpart of the continuous problem formulated in Sec tion 14. 1 .4 is the following. Suppose that utfJ == {un }, n E �, satisfies:
L all1l1 un = fm , m E � ) Un = o, if n E aDo .
nENm
(14.58)
The solution to ( 14.58) is the sum of the solutions u� and u;.o of the problems:
L all1nu� = OMJ (M )jm , m E � ) u� E UtfJ ,
nENm
(14.59)
502
A Theoretical Introduction to Numerical Analysis
and
L. amn u;; = ()/IfJ (M+)fm, m E AfJ, u� E UNO ,
llENm
(14.60)
respectively, where () ( . ) is the indicator function of a set. Assume that the solution to problem (14.58) is known on y, i.e., that we know Hy. We need to find the individual terms in the sum:
U y = uy+ + uy . In doing so, the values of fm, m E AfJ, are not known. Solution of this problem is given by the formulae:
u� = P�U y, uy = Py Uy,
(14.61 ) ( 14.62)
where
P�Uy = p+uy l y and PyU y = p U y l y . Formulae ( 14.61 ) and (14.62) will be justified a little later. Let u + {ut}, n E N+ , and u  = {u;}, n E N , be the restrictions of the solu tions of problems (14.59) and (14.60) to the sets N+ and N , respectively. Then, =
(14.63) ( 14.64)
{
To show that formulae (14.63) and ( 14.64) are true, we introduce the function:
wn±  ut , un ' _
Clearly, the function w;, n with the jump u y on y:
_
(14.65)
E lfJ, is a piecewise regular solution of problem ( 14.58)
[W ±] n = wn+  wn = U + + un = Un, n E y.
•
Then, according to Definition 14.3, the function ( 14.65) is a difference potential with the jump U y: w± = p± uy However, by virtue to Theorem 14.4, this difference poten tial coincides with the Cauchy type difference potential that has the same density Vy. Consequently,
( 14.66) Comparing formula (14.66) with formula (14.65), we obtain formulae ( 14.63), ( 14.64). Moreover, if we consider relations ( 14.63) and (14.64) only at the nodes n E y, we arrive at formulae (14.61) and ( 14.62).
Boundary Equations with Projections and the Method of Difference Potentials 14.3.5
503
P roblem of Act ive Shielding
Consider a difference boundary value problem on DO :
L amll ull = hll ' m E if, llENm UtfJ E UtfJ · For this problem, neither the righthand side hll ' m E if, nor the solution
( 1 4.67) £111 ,
n E Jfl,
are assumed to be given explicitly. The only available data are provided by the trace Uy of the solution utfJ to problem ( 14.67) on the grid boundary y. Let us introduce an additional term gm on the righthand side of the difference equation from (14.67). This term will be called an active control source, or simply an active control. Problem ( 14.67) then transforms into:
L amll wn = hll + gill , m E if, nENm WtfJ E VtfJ ·
( 1 4.68)
We can now formulate a difference problem of active shielding of the grid subdomain N+ C Jfl from the influence of the sources hll located in the subdomain M C if. We need to find all appropriate active controls gm such that after adding them on the righthand side of ( 1 4.67), the solution wtfJ of the new problem (14.68) will coincide on the protected region N+ C Jfl with the solution vtfJ of the problem: ( 1 4.69) VtfJ
E VtfJ ·
In other words, the effect of the controls on the solution on N+ should be equivalent to eliminating the sources fm on the complementary domain M . The following theorem presents the general solution for the control sources. THEOREM 14. 5 The solution WtfJ of problem ( 14.68) coincides on N+ C Jfl with the solu tion VtfJ of problem ( 1 4.69) if and only if the controls gill have the form:
( 1 4.70) where {Zn } = ZND E UtfJ is an arbitrary grid function with the only constraint that it has the same boundary trace as that of the original solution, i. e., Zy = Uy.
PROOF
utfJ E UtfJ of the problem: ( 1 4.7 1 ) L amn ull CPm , m E if, lttfJ E UtfJ , nENm
First, we show that the solution
=
504
A Theoretical Introduction to Numerical Analysis
vanishes on N+ if and only if
q>m has the form: (14.72)
o.
where �JVO E UJVO is an arbitrary grid function that vanishes on y: �y Indeed, let Un = 0, n E N+ . The value of q>m at the node m E M+ and the values of Un, n E Nm C N+ = UNm , m E M+ , are related via (14.71 ). Conse quently, q>m = 0, m E M+ . For the nodes m E M , relation (14.72) also holds, because we can simply substitute the solution UJVO of problem (14.71) for �JVO . Conversely, let q>m be given by formula (14.72), where �Il = 0, n E y. Consider an extension �JVO E UJVO of the grid function �N from N to NJ, such that =
(14.73) Then, we can use a uniform expression for
q>m instead of (14.72):
q>m = L amn�Il ' m E Nfl. n ENm Since the solution of problem (14.7 1) is unique, we have UJVO = �JVO . Therefore, according to (14.73), Un = 0 for n E N+ . In other words, the solution of (14.7 1 ) driven by the righthand side (14.72) with �y = 0 is equal to zero on N+ . Now we can show that formula (14.70) yields the general solution for active controls. More precisely, we will show that the solution WJVO of problem (14.68) coincides with the solution vJVO of problem (14.69) on the grid set N+ if and only if the controls gm are given by (14.70). Let us subtract equation (14.69) from equation (14.68). Denoting ZJVO WJVO  vJVO , we obtain: =
(14.74) Moreover, according to (14.67), we can substitute 1m = I,amn un , m E M , into (14.74). Previously we have proven that Zn = 0 for all n E N+ if and only if the righthand side of ( 14.74) has the form (14.72), where �y O. Writing the righthand side of ( 14.74) as: =
we conclude that this requirement is equivalent to the condition that the form (14.70), where Zy = U y .
gm has
D
Boundary Equations with Projections and the Method of Difference Potentials
14.4
505
General Remarks
The method of difference potentials offers a number of unique capabilities that cannot be easily reproduced in the framework of other methods available in the liter ature. First and foremost, this pertains to the capability of taking only the boundary trace of a given solution and splitting it into two components, one due to the interior sources and the other one due to the complementary exterior sources with respect to a given region (see Sections 1 4. 1 .4 and 14.3.4). The resulting split is unambiguous; it can be realized with no constraints on the shape of the boundary and no need for a closed form fundamental solution of the governing equation or system. Moreover, it can be obtained directly for the discretization, and it enables efficient (numerical) solution of the problem of artificial boundary conditions (Sections 1 4. 1 .3 and 14.3.3) and the problem of active shielding (Sections 1 4. 1 .5 and 1 4.3.5). The meaning, or physical interpretation, of the solutions that can be obtained and analyzed with the help of difference potentials, may vary. If, for example, the solution is interpreted as a wave field [governed, say, by the Helmholtz equation ( 12.32a)], then its component due to the interior sources represents outgoing waves with respect to a given region, and the component due to the exterior sources rep resents the incoming waves. Hence, the method of difference potentials provides a universal and robust procedure for splitting the overall wave field into the incoming and outgoing parts, and for doing that it only requires the knowledge of the field quantities on the interface between the interior and exterior domains. In other words, no knowledge of the actual waves' sources is needed, and no knowledge of the field beyond the interface is required either. In the literature, the problem of identifying the outgoing and incoming waves has received substantial attention in the context of propagation of waves over un bounded domains. When the sources of waves are confined to a predetermined compact region in space, and the waves propagate away from this region toward infinity, numerical reconstruction of the wave field requires truncation of the over all unbounded domain and setting artificial boundary conditions that would facili tate reflectionless propagation of the outgoing waves through the external artificial boundary. Artificial boundary conditions of this type are often referred to as radiation or nonreflecting boundary conditions. Pioneering work on nonreflecting boundary conditions was done by Engquist and Majda in the late seventies  early eighties, see [EM77, EM79, EM8 1 ]. In these papers, boundary operators that enable oneway propagation of waves through a given interface (radiation toward the exterior) were constructed in the continuous formulation of the problem for particular classes of governing equations and particular (simple) shapes of the boundary. The approaches to their discrete approximation were discussed as well.
506
14.5
A Theoretical Introduction to Numerical Analysis
B ibliography Comments
The method of difference potentials was proposed by Ryaben'kii in his Habili tation thesis in 1 969. The stateoftheart in the development of the method as of the year 2001 can be found in [Rya02]. Over the years, the method of difference potentials has benefited a great deal from fine contributions by many colleagues. The corresponding acknowledgments, along with a considerably more detailed bib liography, can also be found in [Rya02]. For further detail on artificial boundary conditions, we refer the reader to [Tsy98]. The problem of active shielding is discussed in [Rya95, LRTOl ] . Let us also mention the capacitance matrix method proposed by Proskurowski and Widlund [PW76, PW80] that has certain elements similar to those of the method of difference potentials.
Lis t of Figures
1.1
Unavoidable error. .
8
2. 1
Unstructured grid.
.
59
3.1 3.2 3.3 3A
Straightforward periodization. Periodization according to formula (3.55). Chebyshev polynomials. . . . . Chebyshev interpolation nodes. .
4. 1
Domain of integration. . . . . .
111
5.1 5.2
Finitedifference scheme for the Poisson equation. . Sensitivity of the solution of system (5.30) to perturbations of the data. . . . . . . . . . . . . . . . . . .
1 22
74 74 76 77
133
6. 1 6.2 6.3
Eigenvalues of the matrix B = I  rA. I vI I and I Vn I as functions of r. Multigrid cycles.
184 184 209
7.1
Experimental data.
21 1
8.1 8.2 8.3 8A 8.5
The method of bisection. The chord method. The secant method. . . . Newton's method. . . . . Two possible scenarios of the behavior of iteration (8.5).
233 235 235 236 238
9.1
The shooting method. .
10.1 10.2 10.3
Continuous and discrete domain of dependence. Stencils of the upwind, downwind, and the leapfrog schemes. Stencils of the explicit and implicit sche