Scientific Computing with Automatic Result Verification
This is volume 189 in MATHEMATICS IN SCIENCE AND ENGINEERING ...
8 downloads
298 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Scientific Computing with Automatic Result Verification
This is volume 189 in MATHEMATICS IN SCIENCE AND ENGINEERING Edited by William F. Ames, Georgia Instirute of Technology A list of recent titles in this series appears at the end of this volume.
SCIENTIFIC COMPUTING WITH AUTOMATIC RESULT VERIFICATION Edited B y
U.Kulisch IN~PORANGEWANDTEMATHEMATIK UNWERSlTAT W R U H E
KARLSR~EE,GEREAANY
ACADEMIC PRESS, INC. Harcourt Brace Jovanovich. Publishers
Boston San Diego New York London Sydney Tokyo Toronto
This book is printed on acid-free paper.@ Copyright 0 1993 by Academic Press, Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. ACADEMIC PRESS, INC. 1250 Sixth Avenue, San Diego, CA 92101-431 1 United Kingdom Edition published by ACADEMIC PRESS LIMITED 24-28 Oval Road, London NWI 7DX ISBN 0-12-044210-8 Printed in the United States of America 9 2 9 3 9 4 9 5 EB 9 8 7 6 5 4 3 2 1
Contents Contributors Prehce Acknowledgements
Vii
ix X
E. Adams, U. Kulisch Introduction
1
I. Language and Progmmmmg Supportfor Veri6ed Scienac Computation R Hammer, M. Neaga, D. Ratz
PA!X&XSC, New Concepts for Scientific Computation and Numerical Data Processing
15
Wofgang V. Walter ACIUTH-X!X, A Fortran-likeLanguage for Veiified ScientificComputing
45
Christian Law0 GXSC,A Programming Environmentfor Verified ScientificComputing and Numerical Data Processing
71
G. Bohlender, D. Codes, A Knofel, U. Kulisch, R Lohner, W.V. Walter Proposal for Accurate Floating-pointVector Arithmetic
87
11. Enclosure Methods and Algorithms with Automatic ResultVefication Hans-Christoph Fischer Automatic Differentiation and Applications
105
Rainer Kelch Numerical Quadiature by Extrapolation with Automatic Result Verifkation
143
Ulrike Storck Numerical Integration in Two Dimensions with Automatic Result Verification
187
V
vi
Contents
Hans-Jiirgen Dobner VerXed Solution of Integral Equations with Applications
225
Wolfram Klein Enclosure Methods for Linear and Nonlinear Systems of Fredholm Integral Equations of the Second Kind
255
W. Rufeger and E. Adams A Step Size Control for Lohner's Enclosure Algorithm for Ordinary DifferentialEquations with Initial Conditions
283
Rudolf J. Lohner Interval Arithmetic in Staggered Correction Format
301
In. Applications in the Engineering Sciences Walter W e r MultiplePrecisionComputationswith Result Verification
325
Beate Gross Verification of Asymptotic Stabilityfor Interval Matrices and Applications in ControlTheory
357
Wera U. Klein Numerical Reliability of MHD Flow Calculations
397
Emst Adams The Reliability Question for Discretizationsof Evolution Problems Part I Theoretical Consideration on Failures Part Ik Practical Failures
423
465
R Schiitz,W. Winter, G. Ehret KKR Bandstructure Calculations,A Challenge to Numerical Accuracy
527
Andreas Knofel A Hardware Kernel for ScientWEngineerhg Computations
549
Gerd Bohlender Bibliography on Enclosure Methods and Related Topics
571 609
Contributors E. Adams,IAM G. Bohlender, IAM D. Cordes, Landstr. 104A,D-6905 Schriesheim, Germany H.J. Dobner, Math. Inst. 11, Univ. Karlsruhe, D-7500Karlsruhe 1, Germany G. Ehret, Kernforschungszentrum, Postfach 3640, D-7500 Karlsruhe 1, Germany H. C. Fischer, Tauberstr. 1 , D-7500 Karlsruhe 1, Germany B. Gross, Math. Inst. I, Univ. Karlsruhe, D-7500Karlsruhe 1, Germany R. Hammer, IAM R. Kelch, Tilsiter Weg 2, D-6830 Schwetzingen, Germany W.Klein, Kleinstr. 45,D-8000Munchen 70, Germany W. U. Klein, Kleinstr. 45, D-8000 Miinchen 70, Germany A. Kniifel, IAM W. Kramer, IAM U. Kulisch, IAM Ch. Lawo, IAM R. Lohner, IAM M. Neaga, Numerik Software GmbH, Postf. 2232,D-7570Baden-Baden, Germany D. Ratz, IAM W. Rufeger, School of Math., Georgia Inst. of Tech., Atlanta GA, 30332,USA R. Schutz, Kernforschungszentrum, Postfach 3640, D-7500Karlsruhe 1, Germany U. Storck, IAM W. V. Walter, IAM H. Winter, Kernforschungszentrum, Postfach 3640,D-7500Karlsruhe 1, Germany
IAM is an abbreviation for: Institut f. Angewandte Mathematik Univ. Karlsruhe Kaiserstr. 12 D-7500 Karlsruhe 1 Germany vii
This page intentionally left blank
Preface This book presents a collection of papers on recent progress in the development and applications of numerical algorithms with automatic result verification. We also speak of Enclosure Methods. An enclosure consists of upper and lower bounds of the solution. Their computability implies the existence of the unknown true solution which is being enclosed. The papers in this book address mainly the following areas:
I. the development of computer languages and programming environments supporting the totally error-controlled computational determination of enclosures;
11. corresponding software, predominantly for problems involving the differentiation or the integration of functions or for differential equations or integral equations; 111. in the context of scientific computing the mathematical simulation of selected major real world problems, in conjunction with parallel numerical treatments by means of software with or without total error control.
Concerning 111, the practical importance of techniques with automatic result verification or Enclosure Methods is stressed by the surprisingly large and even qualitative differences of the corresponding results as compared with those of traditional techniques. These examples do rest on “suitably chosen” illconditioned problems; rather, they have arisen naturally in research work in several areas of engineering or physics. The “surprise character’’ of these examples is due to the fact that computed enclosures guarantee the existence of the enclosed true solution, whereas there is no such implication in the case that a numerical method has been executed without total error control. The bulk of the papers collected in this book represent selected material taken from doctoral or diploma theses which were written a t the Institute for Applied Mathematics a t the University of Karlsruhe. Concerning Enclosure Methods, the level of development being addressed here rests on extensive research work. The essentially completed developmental stages comprise in particular large classes of finite-dimensional problems and ordinary differential equations with initial or boundary conditions. We refer to the list of literature a t the end of the book. Every paper in this book contains a list of references, particularly with respect to hardware and software supporting Enclosure Methods. This book should be of interest to persons engaged in research and/or development work in the following domains:
ix
Preface
X
(A) the reliable mathematical simulation of real world problems, (B) Computer Science, and (C) mathematical proofs involving, e.g., a quantitative verification of the mapping of a set into itself. The diagnostic power of numerical algorithms with automatic result verification or Enclosure Methods is of particular importance concerning (A). In fact, their total reliability removes the possibility of numerical errors as a cause of discrepancies of applications of a mathematical model and physical experiments. For a diagnostic application of this kind, it is irrelevant that the cost of Enclosure Methods occasionally exceeds the one of corresponding numerical methods without a total error control. Fr: the papers in this book, the background prerequisites are essentially: three years of Calculus, including Numerical Analysis, leading to the equivalent of a B. S. degree and the corresponding level of knowledge and experience in the employment of computer systems.
Acknowledgements It is a pleasure to acknowledge with gratitude the support received for parts of the presented research from IBM Corporation and Deutsche Forschungsgemeinschaft (DFG). We appreciate the enthusiastic encouragement for this compilation of papers which we have received from Academic Press and Professor W. F. Ames, the editor of the book series. We are grateful to our present and former students and coworkers for their contributions to this collection. Finally, we wish to thank our colleague Walter Krsmer for taking over the responsibility for the final layout of the book.
Dedication This book is dedicated to Professor Dr. J. Heinhold and Professor Dr. J. Weissinger on the occasion of their 80th birthdays. We are particularly grateful for many years of a fruitful and enjoyable collaboration.
E. Adams and U. Kulisch
Introduction E. Adams and U. Kulisch
1
On Scientific Computing with Automatic Result Verification
As stated in the Preface, the totally error-controlled computational determination of an enclosure or, synonymously, an inclusion or automatic result verification rests on contributions of the following four kinds: (A) suitable programming languages, software, and hardware for a reliable, fast, and skillful execution of all arithmetic operations;
(B) for problems in Euclidean spaces, algorithms translating into a set of machine numbers and, therefore, into (A); (C) mathematical methods and corresponding algorithms relating problems in function spaces to suitable problems in Euclidean spaces; (D) links relating (A), (B), and (C) such that the enclosure property and the guaranteed existence are valid for the unknown enclosed true solution. The computer-basis (A) of numerical algorithms with automatic result verification or enclosure methods is addressed in Part I of this book and in Sections 2,3,and
6 of this Introduction. As stated in the Preface, the development of algorithms concerning (B) has essentially been completed. Again we refer to the list of literature at the end of the book. Part I1 of this book is mainly concerned with the domains (C) and (D), for problems involving differentiations, or integrations, or differential equations. In a large number of case studies for problems in the domains (B) or (C), it has been shown that traditional "high-precision" numerical methods or computer systems may deliver quantitatively and even qualitatively incorrect results, unless there is a total error control. Since such failures are particularly important in the mathematical simulation of real world problems, Part 111 of this book presents case studies of this kind.
Scientific Computing with Automatic Result Verification
1
Copyright Ca 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8
E. Adams, U. Kulisch
2
2
The Hardware-Basis for Automatic Result Verification
The speed of digital computers is ever increasing. Consequently, less and less can be said about the accuracy of the computed result. In each one of the preceding two or three decades, the computational speed increased by a factor of roughly 100. For the present decade, another growth factor of approximately 1000 is expected. The transfer from electro-mechanical to electronical digital computers also involved a factor of approximately 1000. Since then, the speed of the most advanced computational systems grew by a factor of lo*. But it is not just the speed of the computers that has been tremendeously increased. Also in quantity the available computing power has grown to an extraordinary large scale. In fact, millions of Personal Computers and Workstations have been sold during the last years. In view of this huge automatic computational basis, there arises the reliability question of the delivered results. Already 30 years ago, important computations were carried out by means of a floating-point arithmetic of about 17 decimal mantissa digits. At present, the situation is not much different. By means of a speed with up to 10” arithmetic operations per second, now major computational projects are carried out which need hours of CPU-time. Obviously, a traditional estimate of the accumulated incurred computational error is not possible any more. Because of this reason, the computational error is often totally ignored and not even mentioned in the bulk of the contemporary computer-oriented literature. Some users attempt a heuristic confirmation or to justify the computed results by plausibility reasons. Other users are hardly aware of the fact that the computed results may be grossly incorrect. At present, the arithmetic of a digital computer in general is of “maximal accuracy”. This means that the computed result of an individual arithmetic operation differs from the true one by a t most one or one half unit of the last mantissa digit. For sequences of such operations, however, the quality of the delivered result may fail to be valid fairly rapidly. As an example, consider the following simple algorithm: z:=~+b-a.
If a and b are stochastically chosen from among the standard data formats single, double, or quadruple precision, the computed result for z is completely incorrect in 40% to 50% of all data choices. This means that the result of a computation consisting of only two arithmetic operations is incorrect with respect to all mantissa digits, and perhaps even the sign and the exponent. This kind of failure occurs for all usual arithmetic data formats! The usual arithmetic standards are not excluded. For the corresponding more general algorithm z := a b+ c, a completely incorrect result of the kind observed before is still obtained in an essential percentage of all then the percentage of cases. If this algorithm is executed for vectors a , b, c E R”, incorrect results approaches 100 as n increases. Of course, more involved examples
+
Introduction
3
may be chosen where the reason for an incorrect result is not as obvious. As an example, the reader may try to compute the scalar product of the following vectors:
2.718281828, = - 3.141592654, a3 = 1.414213562, a4 = 0.5772156649, a5 = 0.3010299957, =
a2
bl =
b2
b3
b4 b5
= = = =
1486.2497 878366.9879 - 22.37492 4773714.647 0.000185049
The correct value of the scalar product is
-1.00657107 x 10". It is a matter of fact that this kind of errors could easily be avoided if accumulations were executed in fixed-point instead of floating-point arithmetic. Computers have been invented to take over complicated jobs from man. The obvious discrepancy between the computational performance and the mastering of the computational error suggests the transfer of the job of error estimation and its propagation to the computer. By now, this has been achieved for practically all standard problems in Numerical Analysis and numerous classical applications. To achieve this, it is primarily necessary to make the computer more powerful arithmetically as compared with the usual floating-point set-up. In addition to the four basic arithmetic operations the following generalized arithmetic operations should be carried out with maximal accuracy: the operations for real and complex numbers, vectors, and matrices as well as for the corresponding real and complex intervals, interval vectors, and interval matrices. In this context, the scalar product of two vectors is of basic importance. It is necessary to execute this operation with maximum accuracy; i. e., the computer must be able to determine sums of products of floating-point numbers such that the computed result differs from the correct result by a t most one or one half unit of the last mantissa digit. This then is said to be the optimal scalar or dot product. For this particular task, in the immediate past fast circuits have been developed for all kinds of computers, like Personal Computers, Workstations, Mainframes, and supercomputers. It is a surprising fringe benefit that these circuits generally need less time for the evaluation of the optimal scalar product than traditional circuits which may deliver a totally incorrect result. It is an interesting fact that already 100 years ago computers have been on the market which were able to deliver the correct result for scalar products, i. e., for sums of products of numbers. The technology then used may be considered as a direct predecessor of the presently employed electronic circuits. However, since electronic realizations of the optimal dot product were not available in computers on the market during the last 40 years, algorithms using this quality have not been systematically developed during this period of time. At present, this causes certain difficulties concerning the acceptance of this new tool. In fact, computers govern
E. Adams, U.Kuliscb
4
man and his reasoning: what is not known cannot be used with sophistication or even with virtuosity. It is hoped that this book provides a bridge which contributes to remove these difficulties. Numerous examples in this book demonstrate that the new and more powerful computer arithmetic as outlined in this Introduction allows the computational assessment of correct results in cases of failures of traditional floating-point arithmetic. However, the new and more powerful computer arithmetic has been developed for standard applications, not just for extreme problems. Traditionally, there is frequently a need for numerous test runs employing variations of the working precision or of input data in order to estimate the reliability of the computed result. By means of the new technique, there is usually no such need. In fact, just one computer run in general yields a totally reliable computed result.
3
Connection with Programming Languages
In the translation of the new approach to computer arithmetic into practice, immediately there arise serious difficulties. Only the four arithmetic operations are made available by the usual programming languages such as ALGOL, FORTRAN, BASIC, PASCAL, MODULA, or C. By use of these languages, it is not practically possible to address a maximally accurate scalar product or a maximally accurate matrix product directly! The usual simulation of these operations by floating-point arithmetic is responsible for the kind of errors indicated by the examples above. Suitable extensions of these languages provide the only meaningful way out of these difficulties. These extensions should provide all operations in the commonly used real and complex vector spaces and their interval correspondents by the usual mathematical operator symbol. Corresponding data types should be predefined in these languages. For an example, the variables a, b and c are chosen to be real square matrices. Then a matrix product with maximal accuracy is simply addressed by c:=a*b. An operator notation of practically all arithmetic operations simplifies programming in these language extensions significantly. Programs are much easier to read. They are much easier to debug and thus become much more reliable. But in particular, all these operations are maximally accurate. Already in the 19709, a corresponding extension of the programming language PASCAL was developed and implemented in a cooperative project of the Universities of Karlsruhe and Kaiserslautern (U. Kulisch and H. W. Wippermann). The applicability of the new language, PASCAL-SC, however, was severely restriced by the fact that there were only two code-generating compilers available for the Z-80 and MOTOROLA 68000 processors.
Introduction
5
In another cooperation between IBM and the Institute for Applied Mathematics at the University of Karlsruhe, a corresponding extension of the programming language FORTRAN was developed and implemented in the 1980s. The result is now available as an IBM program product for systems of the /370 architecture. It is called ACRITH-XSC. The programming language ACRITH-XSC is upwardly compatible with FORTRAN 77; concerning numerous details, ACRITH-XSC is comparable with the new language FORTRAN 90. With respect to its arithmetic features, however, ACRITH-XSC exceeds the new FORTRAN by far. Parallel to the developemt of ACRITH-XSC, the programming language PASCALSC has been further developed at the Institute for Applied Mathematics at the University of Karlsruhe. Analogously to the IBM program product ACRITHXSC, the new language is being called PASCAL-XSC. For this language, there is a compiler available which translates into the programming language C; additionally, the extended run-time system for all arithmetic routines of PASCAL-XSC has been written in C. Consequently, PASCAL-XSC may be used on practically all computers possessing a C-compiler. A Language Reference of PASCAL-XSC with numerous examples has been published by Springer-Verlag in German and in English. A translation into Russian is under preparation. Compilers for PASCAL-XSC are now available on the market. They can be purchased from Numerik Software GmbH, P. 0. Box 2232, 7750 Baden-Baden, Germany. The new computer language C++ possesses several new features such as an operator concept, overloading of operators, generic names for functions, etc, which are also available in the programming languages PASCAL-SC, ACRITHXSC, and PASCAL-XSC. With these features of C++, it is possible to provide arithmetic operations with maximal accuracy, standard functions for various types of arguments, etc. in C++ without developing a new compiler. Using these features of C++, a C++ module for numerical applications, as an extension of the programming language C, has been developed at the Institute for Applied Mathematics at the University of Karlsruhe. This programming environment is called C-XSC. In order to employ it, it is sufficient for a user to be familiar with the programming language C and this arithmetic-numerical module extension. A knowledge of C++ is not required. This module, however, can also be used in conjunction with C++ programs. A C-XSC program has to be translated by a C++ compiler. An identical run-time system is used by C-XSC and PASCALXSC. Therefore, identical results are obtained by corresponding programs written in these languages. In the present book, the contributions to Part I provide brief introductions to the programming languages PASCAL-XSC and ACRITH-XSC as well as to the programming environment of C-XSC. The run-time system of PASCAL-XSC and C-XSC has been written in C. It provides the optimal real or complex arithmetic for vectors, matrices, and corresponding intervals. The execution of these routines would need significantly smaller computation
6
E. Adams, U. Kulisch
times if there would be a suitable support by the computer hardware. The fourth contribution in Part I of this book, therefore, addresses the computer manufacturers. It contains a proposal of what has to be done on the side of computer hardware in order to support a real and complex vector, matrix, and interval arithmetic! We would like t o stress the fact that a realization of this proposal would result in significantly faster computer systems. This is a consequence of the fact that a hardware realization of the optimal scalar product allows a significantly simpler and faster execution of this operation as compared with the traditional, inaccurate execution of the dot product by means of the given floating-point arithmetic. By use of the programming languages PASCAL-XSC and ACRITH-XSC in the last few years and recently also by use of C-XSC, programming packages have been developed for practically all standard problems of Numerical Analysis where the computer verifies automatically that the computed results are correct. So, today there are problem solving routines with automatic result verification available for systems of linear algebraic equations and the inversion of matrices with coefficients of the types real, complex, interval, and complex interval. Corresponding routines are also available for the computation of eigenvalues and eigenvectors of matrices, for the evaluation of polynomials, for roots of polynomials, for the accurate evaluation of arithmetic expressions, for the solution of nonlinear systems of algebraic equations, for numerical quadrature, for the solution of large classes of systems of linear or nonlinear ordinary differential equations with initial or boundary conditions, or corresponding eigenvalue problems, for certain systems of linear or nonlinear integral equations, etc. In most cases, the code verifies automatically the existence and the local uniqueness of the enclosed solution. If no solution is found, a message is given to the user that the employed algorithm could not solve the problem. In the case of differential or integral equations, the routines deliver continuous upper and lower bounds for the solution of the problem. Solutions obtained by means of these methods have the quality of a theorem in the sense of pure mathematics. The computer proves those theorems or statements by means of the clean arithmetic in a mathematically correct and reproducible manner by often very many tiny steps. In order t o obtain these results, numerous well-known and powerful methods of Numerical Analysis are employed. Additionally, entirely new constructive approaches and tools had to be developed. Examples for techniques of this kind are presented in Part I1 of this book.
4
Enclosure Methods, Predominantly for Problems in Function Spaces
For problems in Numerical Algebra or Optimization, residual correction or iterative refinement is the essential tool. Usually, these techniques are not applied to the
Introduction
7
computed approximation but, rather, to its error. Because of cancellation, this well-known classical tool fails generally when employing a usual floating-point arithmetic. In fact, these tools become practically useful in conjunction with an optimal scalar product and interval methods. An interval arithmetic with multiple precision is another very useful tool. The paper by R. Lohner in Part I1 of this book is concerned with a PASCAL-XSC routine for a version of this method that has been called Staggered Arithmetic by H. J. Stetter. By means of the optimal scalar product and PASCAL-XSC, this arithmetic can be implemented very elegantly by easily readable programs. Automatic differentiation is an important tool for the development of methods possessing a built-in verification property, mainly for problems relating to function spaces. The first paper in Part I1 of the book, by H. C. Fischer, presents an introduction into these techniques. For sufficiently smooth functions, derivatives and Taylor-coefficients may hereby be determined very efficiently. They can be enclosed automatically by means of interval methods. Concerning the numerical integration of functions, by quadrature formulas, the remainder terms are usually neglected in the execution of traditional numerical methods. These terms are fully taken into account by all methods with automatic result verification. In the remainder term, the unknown argument is replaced by an interval containing this argument. For this purpose, interval bounds for the derivatives occurring in the remainder term are computed by interval techniques. By the usual factors I/n! and a high power of the step size, the remainder term, as the procedural error, can usually be made smaller than a required accuracy bound. By means of the remainder term and rounded interval arithmetic, both the procedural and the rounding errors are fully taken into account; this yields a mathematically guaranteed result. The second paper in Part I1 of the book, by R. Kelch, treats the problem of numerical quadrature in the spirit of extrapolation methods. For every element of the usual extrapolation-table, a representation for the remainder term is determined by means of the Euler-Maclaurin summation formula. The integration begins with the evaluation of this remainder term for a particular element of the tableau. If the result is less than a prescribed error bound, the value of the integral then is determined by means of an interval scalar product of the vector collecting the values of the function a t the nodes and the vector of the coefficients of the quadrature formula, which are stored in the memory of the computer. The third paper in Part 11, by U. Storck, presents an outline of this method for the case of multidimensional integrals. Integral equations are treated in the fourth and the fifth paper in Part I1 of the book. The methods presented by W. Klein in the fifth paper have been applied successfully in the case of large systems of nonlinear integral equations. The kernel is replaced by a two dimensional finite Taylor-expansion series with remainder term yielding the sum of a degenerate kernel and a contracting remainder kernel. Both parts then can be treated by standard techniques, and enclosures are obtained by
8
E. Adams, U.Kulisch
interval methods. The fourth paper of Part 11, by H. J. Dobner, shows among other things that suitable problems in partial differential equations can be reduced to integral equations and then treated with Enclosure Methods. Suitable problems with integral equations can also be represented equivalently by differential equations and then treated with Enclosure Methods and automatic verification of the result. Lohner's enclosure algorithms for systems of ordinary differential equations are now the most widely used tool in order to determine verified enclosures of the solution for initial or boundary value problems as well as for corresponding eigenvalueproblems. For the case of initial value problems (IVPs), Lohner has developed a program package called AWA, which is available by means of the computer languages PASCAL-XSC, ACRITH-XSC, and C-XSC. In the case of IVPs, a suitable a priori choice of the step size h is difficult. As a supplement of AWA, the sixth paper in Part I1 of the book, by W. Rufeger and E. Adams, presents an automatic control of the step size h and its application to the Restricted Three Body Problem in Celestial Mechanics.
5
On Applications, Predominantly in Mat hematical Simulation
Since at least 5000 years, computational tasks have arisen from practical needs. As concerned with the computational aspects of mathematical simulations of real world problems, "Scientific Computing" is now considered to be the third major domain in the Sciences, in addition to "Theory" and "Experiment". In the absence of a total error control, "Scientific Computing" may deliver quantitatively and even qualitatively incorrect results. "Automatic Result Verification", the topic of this monograph, implies reliability of the numerical computation. In case of errors of the hardware, the operating software, or the algorithm, the execution of a verification step usually fails. The development of methods, algorithms, etc. possessing this verifying property is irrelevant unless they are applicable with respect to nonartificial mathematical simulations, i. e., models not only chosen by mathematicians but, rather, by physicists, engineers, etc. This applicability is demonstrated by the contributions in Part I11 of this book and by the extensive existing literature as outlined in an appendix. By means of the computer languages supporting the determination of verified enclosures, numerous mathematical models as taken from outside of mathematics have been treated. In the majority of these case studies, failures of traditional numerical methods were observed which are not totally error-controlled. A large portion of these publications appeared in the proceedings of the annual conferences which, since 1980, have been jointly conducted by GAMM (Gesellschaft f i r Angewandte Mathematik und Mechanik) and IMACS (International Association for Mathematics and Computers in Simulation). These conferences were devoted to the areas of "Computer Arithmetic, Scientific Computation, and Automatic
Introduction
9
Result Verification". Concerning mathematical simulations, the papers presented in the proceedings of these conferences belong to different domains. We mention a few of them. Mechanical Engineering: turbines at high numbers of revolution, vibrations of gear drives or rotors, robotics, geometrical methods of CAD-systems, geometrical modelling; Civil Engineering: nonlinear embedding of pillars, the centrally compressed buckling beam; Electrical Engineering: analysis of filters, optimization of VLSI-circuits, simulation of semiconducting diodes; Fluid Mechanics: plasma flows in channels, infiltration of pollutants into groundwater, dimensioning of wastewater channels, analysis of magneto-hydrodynamic flows at large Hartman numbers; Chemistry: the periodic solution of the Oregonator problem, numerical integration in chemical engineering; Physics: high temperature supraconduction, optical properties of liquid crystals, expansions of solutions of the Schriidinger equation with respect to wave functions, rejection of certain computed approximations in Celestial Mechanics or concerning the Lorenz equations. Persons engaged in the mathematical simulation of a real world problem usually are not computer specialists. Consequently, they expect software not requiring special knowledge or experience. The first paper in Part 111, by W. Kramer, demonstrates the power and elegance of the programming tools under discussion in this book. In this paper, PASCAL-XSC-codes are presented for validating computations in cases where single precision is not sufficient or appropriate. The programs use available PASCAL-XSC modules for a long real or a long interval arithmetic. Because of the employed operator notations, the codes can be read just like a technical report. Each one of the examples in this paper demonstrates the power of the available programming environment, particularly by means of a comparison with a corresponding PASCAL-code. Covering many pages, a code of this kind would be lengthy and almost unreadable; it therefore would be almost outside a user's control. The second paper in Part 111, by B. Gross, addresses the classical control theory on the basis of systems of linear ordinary differential equations (ODES),y' = Ay, where A represents a constant matrix. In a realistic mathematical simulation, intervals must be admitted for the values of at least some of the elements of A. Concerning applications, it is then desirable to obtain verified results on the asymptotic stability of y' = Ay and the corresponding degree of stability. For this purpose, four constructive methods are developed such that the interval matrix admitted for A is directly addressed. Consequently, there is no need for an employment of the characteristic polynomial and its roots, i. e., of quantities which would have to be computed prior to a stability analysis. As a major example, the automated electro-mechanical track-control of a city bus is presented. In the third paper in Part 111, by W. U. Klein, a discretization of a parameterdependent boundary value problem with a nonlinear partial differential equation is investigated. Throughout many years, solutions of the system of difference equations could not be approximated reliably for high values of this parameter when making use of standard numerical methods. As shown in the paper, an
E. Adams, U. Kulisch
10
employment of Enclosure Methods allow a reliable determination of the difference solutions, and this even in the case of high values of the parameter. The fourth and the fifth paper in Part 111, by E. Adams, address mathematical simulations by means of ordinary differential equations (ODEs) and partial differential equations (PDEs). In particular, the following problem areas are discussed: (a) in conjunction with an automatic existence proof, a verified determination of true periodic solutions 0
0
of systems of nonlinear autonomous ODEs, particularly the Lorenz equations in Dynamical Chaos and
of systems of linear ODEs with periodic coefficients, particularly the ones arising in the analysis of vibrations of gear drives;
(b) for discretizations of nonlinear ODEs, the existence and the actual occurrence of spurious difference solutions; while exactly satisfying the system of difference equations, they do not approximate any true solution of the ODEs, not even qualitatively; (c) for the Lorenz equations and the ones of the Restricted Three Body Problem in Celestial Mechanics, the occurrence of diverting difference approximations; in the course of time, these computed approximations of difference solutions are close to grossly different true solutions of the ODEs. In Adams’ chapters in Part 111, the overruling topic is the unreliability of difference methods as has been shown by means of Enclosure Methods. The severity of this problem area can be characterized by the title of a paper which will appear in the Journal of Computational Physics: “Computational Chaos May be Due to a Single Local Error”. The chapters by Adams highlight the need for further development 0
0
of hardware and software supporting Enclosure Methods and efficient mathematical methods for the determination of enclosures of true solutions of PDEs that cannot be enclosed through the available techniques or through a preliminary “approximation” of the PDEs by systems of ODEs, e. g. , by means of finite elements.
The sixth paper in Part 111, by Ehret, Schiitz, and Winter, addresses a problem involving quantitative work concering the Schriidinger equations. Just as in the third paper in this part of the book, applications of traditional numerical methods have failed conspicuously; however, reliable results were determined by means of verifying Enclosure Methods.
Introduction
6
11
Concluding Remarks Concerning Computer Arithmetic Supporting Automatic Result Verification
In the opinion of the authors the properties of computer arithmetic should be defined within the programming language. When addressing an arithmetic operation, the user should be fully aware of the computer’s and the compiler’s response. Only thus, the computer becomes a reliable mathematical tool. When the first computer languages were generated in the 1950s, there were no sufficiently simple means available for the definition of a computer arithmetic. Consequently, this issue was ignored and the implementation of a computer arithmetic was left to the manufacturer. A jungle of realizations was the consequence. In fact, there appeared specialists and even schools of people ridiculing arithmetic shortcomings of individual computer systems. Basically, a search of this kind is irrelevant and idle. Rather, it should be attempted to find better and correct approaches to this problem area. This has been done by now, and computer arithmetic can be defined as follows: if the data format and a data type are given, then every arithmetic operation which is provided by the computer hardware or addressable by the programming language must fulfil the properties of a semimorphism. All arithmetic operations thus defined then possess maximum accuracy and other desirable properties; i. e., the computed result differs from the correct one by a t most one rounding. Usually, in case of floating-point operations, there is only a marginal difference between a traditional implementation of the arithmetic and the one governed by the rigorous mathematical principles of a semimorphism. Consequently, the implementation of the properties of a semimorphism is not much more complicated. Rather, if vector and matrix operations are already appropriately taken into account during the design of the hardware arithmetic unit, the computer becomes considerably faster. Basically, the arithmetic standards 754 and 854 of IEEE, ANSI, and IS0 are a step in the desired direction. The thus provided arithmetic operations realize a semimorphic floating-point and interval arithmetic. It is regrettable, however, that no language support has been made available allowing an easy use of interval arithmetic. This is much more so since prototypes concerning the arithmetic hardware as well as the language support were already available a t Karlsruhe 25 years ago. In this context, an essential progress is made available by the programming languages PASCAL-XSC, ACRITH-XSC, and C-XSC which have been mentioned in Section 3 of this Introduction and which will be further characterized in Part I of the book. They provide a universal computer arithmetic. Additionally, they allow a simple handling of semimorph operations by means of the usual operator symbols that are well-known in mathematics. This is still true in product spaces like intervals and complex numbers, and for vectors and matrices of the types real, complex, interval and complex interval. Regrettably, the IEEE standards 754 and
12
E. Adams, U. Kulisch
854 referred to before do not support the operators in product spaces which have just been addressed. Therefore, it is very hard to convince manufacturers that more hardware support for arithmetic is needed than just the IEEE floating-point arithmetic. In particular, a vector processor, of course, should provide semimorphic vector and matrix operations which are of highest accuracy. A software simulation of semimorphic vector and matrix operations is at the expense of speed, whereas a gain in computing speed is to be expected in the case of a support by hardware. This problem is aggravated by the fact that processors implementing the IEEE arithmetic standard 754 do not deliver products of double length; consequently they have to be simulated by means of software with a resulting considerable loss of speed. Products of double length are indispensable and essential for the semimorph determination of products of vectors and matrices. The fourth contribution in Part I of the book presents a proposal concerning a supplement of existing computer arithmetics or arithmetic standards supporting semimorph computer arithmetics for vectors and matrices. Detailed investigations reveal that the additional costs are small, provided there is a homogeneous design of the overall arithmetic unit. The paper by A. Kniifel in Part I11 of the book studies more closely the hardware realization of such an arithmetic unit.
A hardware support of the optimal dot product allows a simple realization of a long real arithmetic for various basic types, as is shown by the seventh paper in Part I1 of the book. Numerous codes concerning Computer Algebra could be considerably accelerated provided a hardware support of the optimal scalar product was available. The first paper in Part I11 demonstrates the then possible straightforward coding and execution, even of very complicated algorithms. Consequently, we strongly recommend a hardware support of the arithmetic proposal in the fourth paper in Part I as well as a revision or an extension of the existing standards. The authors hope that this book will help to convince users and manufacturers of the importance of progress needed in the domain of computer arithmetic, concerning both hardware and standards. Basically, it is not acceptable that maximum accuracy is required only in the case of the basic four arithmetic floating-point operations for operands of type real while this is not so with respect to omplex numbers or vectors or matrices. In the case of a homogeneous design of the arithmetic unit, additional hardware costs are small. But it makes an essential difference whether a correct result of an operation is always delivered or only frequently. In the latter case, a user has to think about and to study the achieved accuracy for every individual operation, and this perhaps a million times every second! In the first case, this is wholly unnecessary. Karlsruhe, May 1992
E. Adams and U. Kulisch
I. Language and Programming Support for Verified Scientific Computation
This page intentionally left blank
PAS CAL-X S C New Concepts for Scientific Computation and Numerical Data Processing R. Hammer, M. Neaga, and D. Ratz
The new programming language PASCAL-XSC is presented with an emphasis on the new concepts for scientific computation and numerical data processing of the PASCAL-XSC compiler. PASCAL-XSC is a universal PASCAL extension with extensive standard modules for scientific computation. It is available for personal computers, workstations, mainframes and supercomputers by means of an implementation in C. By using the mathematical modules of PASCAL-XSC, numerical algorithms which deliver highly accurate and automatically verified results can be programmed easily. PASCAL-XSC simplifies the design of programs in engineering scientific computation by modular program structure, user-defined operators, overloading of functions, procedures, and operators, functions and operators with arbitrary result type, dynamic arrays, arithmetic standard modules for additional numerical data types with operators of highest accuracy, standard functions of high accuracy and exact evaluation of expressions. The most important advantage of the new language is that programs written in
PASCAL-XSC are easily readable. This is due to the fact that all operations, even
those in the higher mathematical spaces, have been realized as operators and can be used in conventional mathematical notation.
In addition to PASCAL-XSC a large number of numerical problem-solving routines with automatic result verification are available. The language supports the development of such routines.
1
Introduction
These days, the elementary arithmetic operations on electronic computers are usually approximated by floating-point operations of highest accuracy. In particular, for any choice of operands this means that the computed result coincides with the rounded exact result of the operation. See the IEEE Arithmetic Standard [3] as an example. This arithmetical standard also requires the four basic arithmetic operations -, *, and / with directed roundings. A large number of processors already on the market provide these operations. So far, however, no common programming language allows access to them.
+,
On the other hand, there has been a noticeable shift in scientific computation from general purpose computers to vector and parallel computers. These so-called Scientific Computing with Automatic Result Verification
15
Copyright 0 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8
R. Hammer, M. Neaga, and D. Ratz
16
super-computers provide additional arithmetic operations such as "multiply and add" and "accumulate" or "multiply and accumulate" (see [lo]). These hardware operations should always deliver a result of highest accuracy, but as of yet, no processor which fulfills this requirement is available. In some cases, the results of numerical algorithms computed on vector computers are totally different from the results computed on a scalar processor (see [13],[31]). Continuous efforts have been made to enhance the power of programming languages. New powerful languages such as ADA have been designed, and enhancement of existing languages such as FORTRAN is in constant progress. However, since these languages still lack a precise definition of their arithmetic, the same program may produce different results on different processors. PASCAL-XSC is the result of a long-term venture by a team of scientists to produce a powerful tool for solving scientific problems. The mathematical definition of the arithmetic is an intrinsic part of the language, including optimal arithmetic operations with directed roundings which are directly accessable in the language. Further arithmetic operations for intervals and complex numbers and even vector/matrix operations provided by precompiled arithmetical modules are defined with maximum accuracy according to the rules of semimorphism (see [25]).
The Language PASCAL-XSC
2
PASCAL-XSC is an extension of the programming language PASCAL for Scientific
Computation. A first approach to such an extension (PASCAL-SC) has been .available since 1980. The specification of the extensions has been continuously improved in recent years by means of essential language concepts, and the new language PASCAL-XSC [20],[21] was developed. It is now available for personal computers, workstations, mainframes, and supercomputers by means of an implementation in C. PASCAL-XSC contains the following features: 0
Standard PASCAL
0
Universal operator concept (user-defined operators)
0
Functions and Operators with arbitrary result type
0
Overloading of procedures, functions and operators
0
Module concept
0
Dynamic arrays
0
Access to subarrays
0
String concept
0
Controlled rounding
0
Optimal (exact) scalar product
PASCAL-XSC - New Concepts for Scientific Computation 0
0
17
Standard type dotprecision (a fixed point format to cover the whole range of floating-point products) Additional arithmetic standard types such as complex, interval, rvector, rmatrix etc.
0
Highly accurate arithmetic for all standard types
0
Highly accurate standard functions
0
Exact evaluation of expressions (#-expressions)
The new language features, developed as an extension of PASCAL, will be discussed in the following sections.
2.1
Standard Data Types, Predefined Operators, and F’unct ions
In addition to the data types of standard PASCAL, the following numerical data types are available in PASCAL-XSC: interval rvector rmatrix
complex cin terval cvector ivector civector cimatrix cmatrix imatrix
where the prefix letters r, i, and c are abbreviations for real, interval, and complex. So cinterval means complex interval and, for example, cimatrix denotes complex interval matrices, whereas rvector specifies real vectors. The vector and matrix types are defined as dynamic arrays and can be used with arbitrary index ranges.
A large number of operators are predefined for theses types in the arithmetic modules of PASCAL-XSC (see section 2.8). All of these operators deliver results with maximum accuracy. In Table 1 the 29 predefined standard operators of PASCALXSC are listed according to priority. Type monadic
Operators
Priority
monadic +, monadic 3 (highest)
multiplicative
2
and, div, mod
additive
1
or
relational
0 (lowest)
-, not
*,*,/,/,** t,t , -,-,
t*
in
=, , , >
, o E {+, -, *, /} for operations with downwardly and upwardly directed rounding and the operators **, +*, >< needed in interval computations for the intersection, the convex hull, and the disconnectivity test. Tables 2 End 3 show all predefined arithmetic and relational operators in connection with the possible combinations of operand types.
right
integer red
complex
operand
interval
+,
rvector cvector
*,*,
cin t erval
interval cinterval
rvector cvector
+, -
+, -
+, -
monadi;)
integer real complex
I 1
- 9
*,I,
+*
1 +,+*,** I
I ,/ < , I >
-9
*,I,
*,I
* 0,o
b.sup end;
+ +
1 mathematical notation I corresponding program statements I intadd( a,b,z); intadd(z,c,z); intadd( z,d,z);
z := a + b + c + d
or a function declaration (only possible in PASCAL-XSC, not in standard PAS-
CAL)
function intadd(a,b: interval): interval; begin intadd.inf :=a.inf < b.inf; intadd.sup := a.sup > b.sup end;
+ +
I mathematical notation I corresponding program statement I I := + b + + d I z := intadd(intadd(intadd(a,b),c),d); I z
a
c
In both cases the description of the mathematical formulas looks rather complicated. By comparison, if one implements an operator in PASCAL-XSC
operator + (a,b: interval) intadd: interval; begin intadd.inf := a.inf < b.inf; intaddsup := a m p > b.sup end:
+ +
R. Hammer, M. Neaga, and D. Ratz
22
I
mathematical notation corresponding program statement
I
z:=a+b+c+d
z:=a+b+c+d;
I
then a multiple addition of intervals is described in the traditional mathematical notation. Besides the possibility of overloading operator symbols, one is allowed t o use named operators. Such operators must be preceded by a priority declaration. There exist four different levels of priority, each represented by its own symbol: 0 0 0 0
monadic : multiplicative : additive .. relational
:
t * + --
level 3 (highest priority) level 2 level 1 level 0
For example, an operator for the calculation of the binomial coefficient defined in the following manner priority choose = *;
(i) can be
{priority declaration}
operator choose (n,k: integer) binomial: integer; var i,r : integer; begin if k > n div 2 then k := n-k; r := 1; for i := 1 t o k do r := r * (n - i 1) div i; binomial := r; end;
+
I mathematical notation 1 corresponding program statement I c := (;)
c := n choose k
The operator concept realized in PASCAL-XSC offers the possibilities of 0
defining an arbitrary number of operators
0
overloading operator symbols or operator names arbitrarily many times
0
implementing recursively defined operators
The identification of the suitable operator depends on both the number and the type of the operands according to the following weighting-rule:
If the actual list of parameters matches the formal list of parameters of two different operators, then the one which is chosen has the first “better matching” parameter. “Better matching” means that the types of the operands must be consistent and not only conforming.
PASCAL-XSC - New Concepts for Scientific Computation
23
Example: operator
+* (a: integer; b:
operator
+* (a: real; b: integer) rires: real;
var x
y, z
real) irres: real;
: integer; : real;
+* y; * +* x; z := x +* x; + z := y +* y; =+
z := x z := y
1. operator 2. operator 1. operator impossible !
Also, PASCAL-XSC offers the possibility to overload the assignment operator :=. Due to this, the mathematical notation may also be used for assignments:
Example: var c : complex; r : real;
operator := (var c: complex; r: real); begin c.re := r; c.im := 0; end; r := 1.5; c := r; {complex number with real part 1.5 and imaginary part 0)
2.3
Overloading of Subroutines
Standard PASCAL provides the mathematical standard functions sin, cos, arctan, exp, In, sqr, and sqrt
R. Hammer, M. Neaga, and D. Ratz
24
for numbers of type real only. In order to implement the sine function for interval arguments, a new function name like isin(. . .) must be used, because the overloading of the standard function name sin is not allowed in standard PASCAL. By contrast, PASCAL-XSC allows overloading of function and procedure names, whereby a generic symbol concept is introduced into the language. So the symbols sin, cos, arctan, exp, In, sqr, and sqrt can be used not only for numbers of type real, but also for intervals, complex numbers, and other mathematical spaces. To distinguish between overloaded functions or procedures with the same name, the number, type, and weighting of their arguments are used, similar to the method for operators. The type of the result, however, is not used.
Example: procedure rotate (var a,b: real); procedure rotate (var a,b,c: complex); procedure rotate (var a,b,c: interval); The overloading concept also applies to the standard procedures read and write in a slightly modified way. The first parameter of a new declared input/output procedure must be a var-parameter of file type and the second parameter represents the quantity that is to be input or output. All following parameters are interpreted as format specifications.
Example: procedure write (var f text; c: complex; w: integer); begin write (f, ’(’, c.re : w, ’,’, c.im : w, ’)’); end Calling an overloaded input/output procedure the file parameter may be omitted corresponding to a call with the standard files input or output. The format parameters must be introduced and seperated by colons. Moreover, several input or output statements can be combined to a single statement just as in standard PASCAL.
Example: var r: real; c: complex; write (r : 10, c : 5 , r/5);
PASCAL-XSC - New Concepts for Scientific Computation
2.4
25
The Module Concept
Standard PASCAL basically assumes that a program consists of a single program text which must be prepared completely before it can be compiled and executed. In many cases, it is more convenient to prepare a program in several parts, called modules, which can then be developed and compiled independently of each other. Moreover, several other programs may use the components of a module without their being copied into the source code and recompiled. For this purpose, a module concept has been introduced in PASCAL-XSC. This new concept offers the possibilities of 0
modular programming
0
syntax check and semantic analysis beyond the bounds of modules
0
implementation of arithmetic packages as standard modules
Three new keywords have been added to the language:
module
:
starts a new module
global
:
indicates items to be passed to the outside
use
:
indicates imported modules
A module is introduced by the keyword module followed by a name and a semicolon. The body is built up quite similarly to that of a normal program with the exception that the word symbol global can be used directly in front of the keywords const, type, var, procedure, function, and operator and directly after use and the equality sign in type declarations. Thus it is possible to declare private types as well as non-private types. The structure of a private type is not known outside the declaration module and can only be influenced by subroutine calls. If, for example, the internal structure as well as the name of a type is to be made global, then the word symbol global must be repeated after the equality sign. By means of the declaration
global type complex = global record re, im : real end; the type complex and its internal structure as a record with components re and im is made global.
A private type complex could be declared by global type complex = record re, im: real end; The user who has imported a module with this private definition cannot refer to the record components, because the structure of the type is hidden inside the module.
R. Hammer, M.Neaga, and D. Ratz
26
A module is built up according to the following pattern: module ml; use < other modules >; < global and local declarations > begin < initialization of the module > end. For importing modules with use or use global the following transitivity rules hold
M1 use M2
and
M2 use global M3 =+- M1 use M3.
M1 use M2
and
M2 use M3
but
+
M1 use M3,
Example: Let a module hierarchy be built up by main program
J
[STANDARDS
I
All global objects of the modules A, B, and C are visible in the main program unit, but there is no access to the global objects of X, Y and STANDARDS. There are two possibilities to make them visible in the main program, too: 1. to write use X, Y, STANDARDS
in the main program 2. to write use global X, Y
in module A and use global STANDARDS
in module B or C.
PASCAL-XSC - New Concepts for Scientific Computation
2.5
27
Dynamic Arrays
In standard PASCAL there is no way to declare dynamic types or variables. For instance, program packages with vector and matrix operations can be implemented with only fixed (maximum) dimension. For this reason, only a part of the allocated memory is used if the user wants to solve problems with lower dimension only. The concept of dynamic arrays removes this limitation. In particular, the new concept can be described by the following characteristics: a
Dynamics within procedures and functions
a
Automatic allocation and deallocation of local dynamic variables
a
Economical employment of storage space
a
Row access and column access to dynamic arrays
a
Compatibility of static and dynamic arrays
Dynamic arrays must be marked with the word symbol dynamic. The great disadvantage of the conformant array schemes available in standard PASCAL is that they can only be used for parameters and not for variables or function results. So, this standard feature is not fully dynamic. In PASCAL-XSC, dynamic and static arrays can be used in the same manner. At the moment, dynamic arrays may not be components of other data structures. The syntactical meaning of this is that the word symbol dynamic may only be used directly following the equality sign in a type definition or directly following the colon in a variable declaration. For instance, dynamic arrays may not be record components.
A two-dimensional array type can be declared in the following manner: type matrix = dynamic array[*,*]of real; It is also possible to define different dynamic types with corresponding syntactical structures. For example, it might be useful in some situations to identify the coefficients of a polynomial with the components of a vector or vice versa. Since PASCAL is strictly a type-oriented language, such structurally equivalent arrays may only be combined if their types have been previously adapted. The following example shows the definition of a polynomial and of a vector type (note that the type adaptation functions polynomial(. . .) and vector(. . .) are defined implicitly):
type vector = dynamic array[*] of real; type polynomial = dynamic array[*]of real; operator
+ (a,b: vector) res: vector[lbound(a)..ubound(a)];
R. Hammer, M. Neaga, and D. Ratz var v : vector[l..n]; p : polynomial[O..n-I]; v p v v
:= vector(p); := polynomial(v); := v v;
+
:= vector(p)
+ v; { but not v := p + v; }
Access to the lower and upper index limits is made possible by the new standard functions lbound(. . .) and ubound(. ..), which are available with an optional argument for the index field of the designated dynamic variable. Employing these functions, the operator mentioned above can be written as
operator + (a,b: vector) res: vector[lbound(a)..ubound(a)]; var i : integer; begin for i := lbound(a) to ubound(a) do res[i] := a[;]
+ b[lbound(b) + i
-
lbound(a)]
end; Introduction of dynamic types requires an extension of the compatibility prerequisites. Just as in standard PASCAL, two array types are not compatible unless they are of the same type. Consequently, a dynamic array type is not compatible with a static type. In PASCAL-XSC value assignments are always possible in the cases listed in Table 5 .
I
Type of Left Side
Type of Right Side
anonymous dynamic
arbitrary array type
I
Assignment Permitted if structurally equivalent
known dynamic
known dynamic
if types are the same
anonymous static
arbitrary array type
if structurally equivalent
known static
known static
if types are the same
Table 5: Assignment Compatibilities In the remaining cases, an assignment is possible only for an equivalent qualification of the right side (see [20] or (211 for details). In addition to access to each component variable, PASCAL-XSC offers the possibility of access to entire subarrays. If a component variable contains an * instead of an index expression, it refers to the subarray with the entire index range in the corresponding dimension, e. g. via m[*, j ] the j-th column of a two-dimensional array m is accessed. This example demonstrates access to rows or columns of dynamic arrays:
PASCAL-XSC - New Concepts for Scientific Computation
29
type vector = dynamic array[*] of real; type matrix = dynamic array[*]of vector;
var v m
: vector[I..n]; : matrix[l..n,l..n];
v := m[i]; m[i] := vector(m[*, j]); In the first assignment it is not necessary to use a type adaptation function, since both the left and the right side are of known dynamic type. A different case is demonstrated in the second assignment. The left-hand side is of known dynamic type, but the right-hand side is of anonymous dynamic type, so it is necessary to use the intrinsic adaptation function vector(. ..).
A PASCAL-XSC program which uses dynamic arrays should be built up according to the following scheme:
program dynprog (inputloutput); type vector = dynamic array[*]of real; < different dynamic declarations > var n : integer;
p r o c e d u r e main (dim: integer); var a,b,c : vector[l..dim];
begin < 1/0 depending on the value of dim > c := a
+ b;
R. Hammer, M. Neaga, and D. Ratz
30
It is necessary to frame only the original main program by a procedure (here: main), which is refered to with the dimension of the dynamic arrays as a transfer parameter.
Accurate Expressions
2.6
The implementation of enclosure algorithms with automatic result verification or validation (see [17],[24],[28],[33]) makes extensive use of the accurate evaluation of dot products with the property (see [25])
To evaluate this kind of expression the new datatype dotprecision was introduced. This datatype accomodates the full floating-point range with double exponents (see [25],[24]).Based upon this type, so-called accurate ezpressions (#-expressions), can be formulated by an accurate symbol (#, #*, #, or ##) followed by an ezact ezpression enclosed in parentheses. The exact expression must have the form of a dot product expression and is evaluated without any rounding error. The following standard operations are available for dotprecision: 0 0
0 0
conversion of real and integer values t o dotprecision (#) rounding of dotprecision values to real; in particular: downwardly directed rounding (#), and rounding to the nearest (#*) rounding of a dotprecision expression to the smallest enclosing interval (##) addition of a real number or the product of two real numbers to a variable of type dotprecision
0
addition of a dot product to a variable of type dotprecision
0
addition and subtraction of dotprecision numbers
0
monadic minus of a dotprecision number
0
the standard function sign returns -1,O, or +1, depending on the sign of the dotprecision number
To obtain the unrounded or correctly rounded result of a dot product expression, the user needs to parenthesize the expression and precede it by the symbol # which may optionally be followed by a symbol for the rounding mode. Table 6 shows the possible rounding modes with respect to the dot product expression form (see the appendix on page 41 for details).
PASCAL-XSC - New Concepts for Scientific Computation Symbol
I 1
#* #< #> ## #
Expression Form
Rounding Mode
scalar, vector or matrix
nearest
scalar, vector or matrix
downwards
scalar, vector or matrix
upwards
31
I scalar, vector or matrix 1 smallest enclosing interval I I scalar only I exact, no rounding I
Table 6: Rounding Modes for Accurate Expressions In practice, dot product expressions may contain a large number of terms making an explicit notation very cumbersome. To alleviate this difficulty in mathematics, the symbol C is used. If for instance A and B are n-dimensional matrices, then the evaluation of n
represents a dot product expression. PASCAL-XSC provides the equivalent shorthand notation s u m for this purpose. The corresponding PASCAL-XSC statement for this expression is D := #(for k:=l t o n s u m (A[i,k]*B[kj])) where D is a dotprecision variable. Dot product expressions or accurate expressions are used mainly in computing a defect (or residual). In the case of a linear system A z = b, A E IRnxn, I,b E R", as an example Ay M b is considered. Then an enclosure of the defect is given by O ( b - Ay) which in PASCAL-XSC can be realized by means of ##(b - A*Yh then there is only one interval rounding operation per component. To get verified enclosures for linear systems of equations it is necessary to evaluate the defect expression
O(E - R A ) where R M A-* and E is the identity matrix. In PASCAL-XSC this expression can be programmed as ##(id(A) - R*A); where an interval matrix is computed with only one rounding operation per component. The function id(. . .) is a part of the module for real matrix/vector arithmetic generating an identity matrix of appropriate dimension according to the shape of A (see section 2.8).
R. Hammer, M. Neaga, and D. Ratz
32
2.7 The String Concept The tools provided for handling strings in standard PASCAL do not enable convenient text processing. For this reason, a string concept was integrated into the language definition of PASCAL-XSC which admits a comfortable handling of textual information and even symbolic computation. With this new data type string, the user can work with strings of up to 255 characters. In the declaration part the user can specify a maximum string length less than 255. Thus a string s declared by
var s: string[$O]; can be up to 40 characters long. The following standard operations are available: 0
concatenation
operator 0
+ (a,b: string) conc: string;
actual length
function length(s: string): integer; 0
conversion string
+ real
function rval(s: string): real; 0
conversion string + integer
function ival(s: string): integer; conversion real
+ string
function image(r: real; width,fracs,round: integer): string; 0
conversion integer + string
function image(i,len: integer): string; 0
extraction of substrings
function substring(s: string; i j : integer): string; 0
position of first appearance
function pos(sub,s: string): integer; 0
relational operators
, , =, and in
PASCAL-XSC - New Concepts for Scientific Computation
2.8
33
Standard Modules
The following standard modules are available: 0
interval arithmetic (I-ARI)
0
complex arithmetic (C-ARI)
0
complex interval arithmetic (CI-ARI)
0
real matrix/vector arithmetic (MV-ARI)
0
interval matrix/vector arithmetic (MVI-ARI)
0
complex matrix/vector arithmetic (MVC-ARI)
0
complex interval matrix/vector arithmetic (MVCI-ARI)
These modules may be incorporated via the use-statement described in section 2.4. As an example, Table 7 exhibits the operators provided by the module for interval matrix/vector arithmetic.
Table 7: Predefined Arithmetical and Relational Operators of the Module MVI-ARI In addition to these operators, the module MVI- ARI provides the following generically named standard operators, functions, and procedures intval, inf, sup, diam, mid, blow, transp, null, id, read, and write. The function intval is used to generate interval vectors and matrices, whereas inf and sup are selection functions for the infimum and supremum of an interval object. The diameter and the midpoint of interval vectors and matrices can be computed by diam and mid, blow yields an interval inflation, and transp delivers the transpose of a matrix.
R. Hammer, M. Neaga, and D. Ratz
34
Zero vectors and matrices are generated by the function null, while id returns an identity matrix of appropriate shape. Finally, there are the generic input/outputprocedures read and write, which may be used in connection with all matrix/vector data types defined in the modules mentioned above.
2.9
Problem- Solving Routines
PASCAL-XSC routines for solving common numerical problems have been implemented. The applied methods compute a highly accurate enclosure of the true solution of the problem and, a t the same time, prove the existence and the uniqueness of the solution in the given interval. The advantages of these new routines are listed in the following: 0
0
0
The solution is computed with maximum or high, but always controlled accuracy, even in many ill-conditioned cases. The correctness of the result is automatically verified, i. e. an enclosing set is computed which guarantees existence and uniqueness of the exact solution contained in this set. In case, that no solution exists or that the problem is extremely illconditioned, an error message is indicated.
Particularly, PASCAL-XSC routines cover the following subjects: 0
linear systems of equations
0
full systems (real, complex, interval, cin terval) matrix inversion (real, complex, interval, cin terval) least squares problems (real, complex, interval, cinterval) computation of pseudo inverses (real, complex, interval, cinterval) band matrices (real) sparse matrices (real)
polynomial evaluation - in one variable (real, complex, interval, cinterval)
- in several variables (real) 0
zeros of polynomials (real, complex, interval, cin terval)
0
eigenvalues and eigenvectors - symmetric matrices (real)
- arbitrary matrices (real, complex, interval, cin terval) 0
initial and boundary value problems of ordinary differential equations
- linear - nonlinear
PASCAL-XSC - New Concepts for Scientific Computation
3
0
evaluation of arithmetic expressions
0
nonlinear systems of equations
0
numerical quadrature
0
integral equations
0
automatic differentiation
0
optimization
35
The Implementation of PASCAL-XSC
Since 1976, a PASCAL extension for scientific computation has been in the process of being defined and developed a t the Institute for Applied Mathematics at the University of Karlsruhe. The PASCAL-SC compiler has been implemented on several computers (280, 8088, and 68000 processors) under various operating systems. This compiler has already been on the market for the IBM PC/AT and the ATARI-ST (see [22], [23]). The new PASCAL-XSC compiler is now available for personal computers, workstations, mainframes, and supercomputers by means of an implementation in C. Via a PASCAL-XSC-to-C precompiler and a runtime system implemented in C, the language PASCAL-XSC may be used, among other systems, on all UNIX systems in an almost identical way. Thus, the user has the possibility to develop his programs for example on a personal computer and afterwards get them running on a mainframe via the same compiler.
A complete description of the language PASCAL-XSC and the arithmetic modules as well as a collection of sample programs is given in [20] and [21].
4
PASCAL-XSC Sample Program
In the following, a complete PASCAL-XSC program is listed, which demonstrates the use of some of the arithmetic modules. Employing the module LIN-SOLV, the solution of a system of linear equations is enclosed in an interval vector by succecsive interval iterations. The procedure main, which is called in the body of linsys, is only used for reading the dimension of the system and for allocation of the dynamic variables. The numerical method itself is started by the call of procedure linearsystemsolver defined in module LINSOLV. This procedure may be called with arbitrary dimension of the employed arrays.
For detailed information on iteration methods with automatic result verification see [17], [24], [28], or [32], for example.
R. Hammer, M . Neaga, and D. Ratz
36
Main Program program lin-sys (input’output); { { { {
Program for verified solution of a linear system of equations. The } matrix A and the right-hand side b of the system are to be read in. } The program delivers either a verified solution or a corresponding } failure message. 1
{ lin-solv linear system solver 1 lin-solv, mvari, mviari; { mvari matrix/vector arithmetic 1 { mviari matrix/vector interval arithmetic }
use var
n : integer;
procedure main (n : integer); { The matrix A and the vectors b, x are allocated dynamically with } { this subroutine being called. The matrix A and the right-hand side } { b are read in and linear-systemsolver is called. 1
var
ok b x A
: boolean; : rvector[l..n]; : ivector(l..n]; : rmatrix[l..n,l..n];
begin writeln(’P1ease enter the matrix A:’); read( A); writeln(’P1ease enter the right-hand side b:’); read( b); linear-systemsolver(A,b,x,ok);
if ok then begin
writeln(’The given matrix A is non-singular and the solution ’); writeln(’of the linear system is contained in:’); write(x);
end
PASCAL-XSC - New Concepts for Scientific Computation
else writeln(’No solution found !’);
end;
{procedure main}
begin write(’P1ease enter the dimension n of the linear system: ’); read(n); main(n);
end. {program lin-sys}
37
R . Hammer, M . Neaga, and D. Ratz
38
Module LIN-SOLV module lin-solv; { Verified solution of the linear system of equations Ax = b. } { i-ari : interval arithmetic 1 i-ari, m v a r i , mviari; { m v a r i : matrix/vector arithmetic 1 { mviari : matrix/vector interval arithmetic }
use
priority
inflated =
*;
{ priority level 2 }
operator inflated ( a : ivector; eps : rea1)infl: ivector[l..ubound(a)]; { Computes the so-called epsilon inflation of an interval vector. )
var
: integer; x : interval;
I
begin for i:= 1 to ubound(a) do begin x:= a[i]; if (diam(x) 0) then a[i] := (l+eps)*x
else
- eps*x
a[i] := intval( pred (inf(x)), succ (sup(x)) );
end; {for}
infl := a; {operator inflated}
end;
PASCAL-XSC - New Concepts for Scientific Computation function approximateinverse (A: rmatrix): rmatrix[ l..ubound(A),l..ubound(A)]; { Computation of an approximate inverse of the (n,n)-matrix A } { by application of the Gaussian elimination method. 1
var
1, j, k, n : integer; : real; factor R, Inv, E : rmatrix[l..ubound(A),l..ubound(A)];
begin
n := ubound(A); E := id(E); R := A;
{ dimension of A }
{ identity matrix }
{ Gaussian elimination step with unit vectors as } { right-hand sides. Division by R[i,i]=O indicates } { a probably singular matrix A. 1
for i:= 1 to n do for j:= (i+l) to n do begin
factor := RU,i]/R[i,i]; for k:= i to n do Rb,k] := #*(RU,k] - factor*R[i,k]); EL] := Eb] - factor*E[i]; end; {for j:= ...}
{ Backward substitution delivers the rows of the inverse of A. }
for i:= n downto 1 do Inv[i] := #*(E[i]
- fork:=
( i t l ) to n sum(R[i,k]*Inv[k]))/R[i,i];
approximateinverse := Inv; {function approximateinverse}
end;
39
R. Hammer, M. Neaga, and D. Ratz
40
global procedure 1inearAystemsolver (A : rmatrix; b : rvector; var x : ivector; var ok : boolean); { Computation of a verified enclosure vector for the solution of the } { linear system of equations. If an enclosure is not achieved after } { a certain number of iteration steps the algorithm is stopped and } { the parameter ok is set to false. 1 const epsilon = 0.25; { Constant for the epsilon inflation } max-steps = 10; { Maximum number of iteration steps } var
1 : integer; y, z : ivector[l..ubound(A)]; R : rmatrix[l..ubound(A),1..ubound(A)]; C : imatrixp. .ubound(A),1..ubound(A)];
begin
R := approximateinverse(A);
{ R*b is an approximate solution of the linear system and z is an enclosure } { of this vector. However, it does not usually enclose the true solution. }
z := ##(R*b); { An enclosure of I - R*A is computed with maximum accuracy. } { The (n,n) identity matrix is generated by the function call id(A). } C := ##(id(A) x : = z;
repeat i
- R*A);
i := 0;
.- i + 1; ._
y := x inflated epsilon; { To obtain a true enclosure, the interval } { vector c is slightly enlarged. 1 x := z
+ c*y;
ok := x in y;
{ The new iterate is computed. } { Is c contained in the interior of y? }
until ok or (i = maxsteps); end; {procedure linearsystemsolver)
end.
{module lin-solv}
PASCAL-XSC - New Concepts for Scientific Computation
Appendix Review of Real and Complex #-Expressions Syntax: #-Symbol
#
Result Type
dotprecision
#-Symbol ( Exact Expression ) Summands Permitted in the Exact Expression variables, constants, and special function calls of type integer, real, or dotprecision products of type integer or real scalar producta of type real
real
variables, constants, and special function calls of type integer, real, or dotprecision products of type integer or real scalar products of type real
complex
variables, constants, and special function calls of type integer, real, complex, or dotprecision products of type integer, real, or complex scalar products of type real or complex
#* #< #>
variables and special function calls of type rvector rvector
c vector
rmatrix
cmatrix
products of type rvector (e.g. rmatrix * rvector, real * rvector etc.) variables and special function calls of type rvector or cvector products of type rvector or cvector (e.g. cmatrix * rvector, real * cvector etc.) variables and special function calls of type rmatrix products of type rmatrix variables and special function calls of type rmatrix or cmatrix products of type rmatrix or cmatrix
41
R. Hammer, M.Neaga, and D. Ratz
42
Review of Real and Complex Interval #-Expressions Syntax: #-Symbol
Result Type
interval
I
## ( Exact Expression ) Summands Permitted in the Exact Expression type integer, real, interval, or dotprecision products of type integer, real, or interval scalar products of type real or interval variables, constants, and special function calls of type integer, real, complex, interval, cinterval, or dotprecision
cin terval
products of type integer, real, complex, interval, or cinterval scalar products of type real, complex, interval, or cin terval
##
ivector
variables and special function calls of type rvector or ivector products of type rvector or ivector
civec tor
variables and special function calls of type rvector, cvector, ivector, or civector products of type rvector, cvector, ivector, or civector
imatrix
variables and special function calls of type rrnatrix or imatrix products of type rmatrix or imatrix
cimatrix
variables and special function calls of type rmatrix, cmatrix, imatrix, or cimatrix products of type rmatrix, cmatrix, imatrix, or cirnatrix
PASCAL-XSC - New Concepts for Scientific Computation
43
References [l] Allendorfer, U., Shiriaev, D.: PASCAL-XSC to C - A Portable PASCAL-XSC Compiler. In: [18],91-104, 1991. [2] Allendorfer, U., Shiriaev, D.: PASCAL-XSC - A portable development system. In [9],1992. [3] American National Standards Institute / Institute of Electrical and Electronic Engineers: A Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std. 754-1985,New York, 1985. [4] Bleher, J. H., Rump, S. M., Kulisch, U., Metzger, M., Ullrich, Ch., and Walter, W.: FORTRAN-SC: A Study of a FORTRAN Eztension for Engineering/Scientific Computation with Access to ACRITH. Computing 39,93 - 110, 1987.
[5] Bohlender, G., Griiner, K., Kaucher, E., Klatte, R., Kramer, W., Kulisch, U., Rump, S., Ullrich, Ch., Wolff von Gudenberg, J., and Miranker, W.: PASCAL-SC: A PASCAL for Contemporary Scientific Computation. Research Report RC 9009,IBM Thomas J. Watson Research Center, Yorktown Heights, New York, 1981. [6] Bohlender, G., Griiner, K., Kaucher, E., Klatte, R., Kulisch, U., Neaga, M., Ullrich, Ch., and
Wolff von Gudenberg, J.: PASCAL-SC Language Definition. Internal Report of the Institute for Applied Mathematics, University of Karlsruhe, 1985.
[q
Bohlender, G., Rall, L., Ullrich, Ch., and Wolff von Gudenberg, J.: PASCAL-SC: A Computer Language for Scientific Computation, Academic Press, New York, 1987.
PASCAL-SC Wirkungsvoll programmieren, kontrolliert rechnen. Bibliographisches Institut, Mannheim,
[8] Bohlender, G.,Rall, L., Ullrich, Ch. und Wolff von Gudenberg, J.: 1986.
[9] Brezinsky, C. and Kulisch, U. (Eds): Computational and Applied Mathematics - Algorithms and Theory. Proceedings of the 13th IMACS World Congress, Dublin, Ireland. Elsevier, Science publishers B. V. To be published in 1992.
[lo] Buchholz, W.: The IBM System/370 Vector Architecture. IBM Systems Journal 25/1, 1986. [ll] Cordes, D.: Runtime System for a PASCAL-XSC Compiler. In: [18],151-160, 1991. [12] DaSler, K. und Sommer, M.: PASCAL, Einfdbrung in die Sprache. Norm Entwurf DIN 66256,Erlauterungen. Springer, Berlin, 1983. [13] Hammer, R.: How Reliable is the Arithmetic of Vector Computers? In: [33],1990. [14] Hammer, R., Neaga, M., Ratz, D., Shiriaev, D.: PASCAL-XSC - A new language for scientific computing. (In Russian), Interval Computations 2,St. Petersburg, 1991, [15] IBM High-Accuracy Arithmetic Subroutine Library (ACRITH). General Information Manual, G C 33-6163-02,3rd Edition, 1986. [16] IBM High-Accuracy Arithmetic Subroutine Library (ACRITH). Program Description and User’s Guide, SC 33-616402,3rd Edition, 1986. [17] Kaucher, E., Kulisch, U., and Ullrich, Ch. (Eds.): Computer Arithmetic - Scientific Computation and Programming Languages. Teubner, Stuttgart, 1987. [18] Kaucher, E., Markov, S. M., Mayer, G. (Eds): Computer Arithmetic, Scientific Computation and Mathematical Modelling. IMACS Annals on Computing and Applied Mathematics 12, J.C. Baltzer, Basel, 1991. [19] Kirchner, R. and Kulisch, U.: Accurate Arithmetic for Vector Processors. Journal of Parallel and Distributed Computing 5,250-270,1988. [20] Klatte, R., Kulisch, U., Neaga, M., Ratz, D. und Ullrich, Ch.: PASCAL-XSCSprachbeschreibung mil Beispielen. Springer, Heidelberg, 1991.
R. Hammer, M . Neaga, and D. Ratz
44
[21] Klatte, R., Kulisch, U., Neaga, M., Ratz, D. und Ullrich, Ch.: PASCAL-XSC Language Reference with Ezamples. Springer, Heidelberg, 1992. [22] Kulisch, U. (Ed.): PASCAL-SC: A PASCAL Eztension for Scientific Computation, Information Manual and Floppy Disks, Version ATARI ST. Teubner, Stuttgart, 1987. (231 Kulisch, U. (Ed.): PASCAL-SC: A PASCAL Eztension for Scientific Computation, Information Manual and Floppy Disks, Version IBM PC/AT (DOS). Teubner, Stuttgart, 1987. [24] Kulisch, U. (Hrsg.): Wissenschaflliches Rechnen mil Ergebnisuerifikation - Eine Einfihrung. Akademie Verlag, Ost-Berlin, Vieweg, Wiesbaden, 1989. [25] Kulisch, U. and Miranker, W. Press, New York, 1981.
L.: Computer Arilhmetic in Theory a n d Praclice. Academic
[26] Kulisch, U. and Miranker, W. L.: The Arithmetic o f t h e Digital Computer: A New Approach. SIAM Review, Vol. 28, No. 1, 1986. [27] Kulisch, U. and Miranker, W. demic Press, New York, 1983.
L. (Eds.): A New Approach
lo
Scientific Computation. Aca-
[28] Kulisch, U. and Stetter, H. J. (Eds.): Scientific Computation with Automatic Result Verification. Computing Suppl. 6, Springer, Wien, 1988. [29] Neaga, M.: Erweiterungen uon Programmiersprachen fir wissenschaflliches Rechnen und Eririerung einer Implemenlierung. Dissertation, Universitat Kaiserslautern, 1984. [30] Neaga, M.: PASCAL-SC - Eine PASCA L-Erweiterung fir wissenschaflliches Rechnen. In: [24], 1989. [31] Ratz, D.: The Eflecls of the Arithmelic of Vector Computers on Basic Numerical Methods. In: [33], 1990. [32] Rump, S.
M.:Solving Algebraic Problems with High Accuracy. In: [27J, 1983.
[33] Ullrich, Ch. (Ed.): Contribution8 to Computer Arithmetic and Self-validating Numerical Methods. J. C. Baltzer AG, Scientific Publishing Co., IMACS, 1990. [34] Wolff von Gudenberg, J.: Einbeilung allgemeiner Rechnemn'thmelik in PASCAL miltels eines Operatorkonzeptes und Implementierung der Standardfunktionen mil optimaler Genauigkeit. Dissertation, Univerritit Karlsruhe, 1980.
ACRITH-XSC A Fortran-like Language for Verified Scientific Computing Wolfgang V. Walter ACRITH-XSC is a Fortran-like programming language designed for the development of self-validating numerical algorithms. Such algorithms deliver results of high accuracy which are verified to be correct by the computer. Thus there is no need to perform an error analysis by hand for these calculations. For example, self-validating numerical techniques have been successfully applied to a variety of engineering problems in soil mechanics, optics of liquid crystals, ground-water modelling and vibrational mechanics where conventional floating-point methods have failed. With few exceptions, ACRITH-XSC is an extension of FORTRAN 77 [l]. Various language concepts which are available in ACRITH-XSC can also be found in a more or less similar form in Fortran 90 [13]. Other ACRITH-XSC features have been specifically designed for numerical purposes: numeric constant and data conversion and arithmetic operators with rounding control, interval and complex interval arithmetic, accurate vector/matrix arithmetic, an enlarged set of mathematical standard functions for point and interval arguments, and more. For a restricted class of expressions called "dot product expressions", ACRITH-XSC provides a special notation which guarantees that expressions of this type are evaluated with least-bit accuracy, i.e. there is no machine number between the computed result and the exact solution. The exact dot product is essential in many algorithms to attain high accuracy. The main language features and numerical tools of ACRITH-XSC are presented and illustrated by some typical examples. Differences to Fortran 90 are noted where appropriate. A complete sample program for computing continuous bounds on the solution of an initial value problem is given at the end.
1
Development of ACRITH-XSC
The expressive and functional power of algorithmic programming languages has been continually enhanced since the 1950's. New powerful languages such as Ada, C++,and Fortran 90 have evolved over the past decade or so. The common programming languages attempt to satisfy the needs of many diverse fields. While trying to cater to a large user community, these languages fail to provide specialized tools for specific areas of application. Thus the user is often left with ill-suited means to accomplish a task. This has become quite apparent in numerical programming and scientific computing. Even though programming has become more convenient through the use of more modern language concepts, numerical programs have not necessarily become more reliable. Scientific Computing with Automatic Result Verification
45
Copyright @ 1993 by Academic Press, Inc. All rights of reproduction in any form reserved. ISBN 0-12-044210-8
46
Wolfgang V. Walter
The development of programming languages suited for the particular needs of numerical programming has been a long-term commitment of the Institute of Applied Mathematics a t the University of Karlsruhe. With languages and tools such as PASCAL-XSC, C-XSC (see articles in this volume), ACRITH-XSC, and ACRITH, the emphasis is on accuracy and reliability in general and on automatic result verification in particular. The first language extension for “Scientific Computation”, PASCAL-SC [7, 191, was designed and implemented in the late 70’s and has been under continuous development since. The most recent implementation, called PASCAL-XSC, has been available for a wide variety of computers ranging from micros to mainframes since 1991 [15]. In order to reach a broader public, reports and proposals on how to incorporate similar concepts into Fortran 8x were published in the early 80’s [5,6]. In the meantime, several of the proposed features have found their way into the Fortran 90 standard [13, 29, 30,31, 321. However, a rigorous mathematical definition of computer arithmetic and roundings (e.g. as defined by Kulisch and Miranker in [17, 181) is still lacking in Fortran 90. The standard does not contain any accuracy requirements for arithmetic operators and mathematical intrinsic functions. The programming language FORTRAN-SC [4, 22, 27, 28, 231 was designed as a Fortran-like language featuring specialized tools for reliable scientific computing. It was defined and implemented a t the University of Karlsruhe in a joint project with IBM Germany and has been in use a t a number of international universities and research institutions since 1988. The equivalent IBM program product High Accuracy Arithmetic - Extended Scientific Computation, called ACRITH-XSC for short, was released for world-wide distribution in 1990 (111. Numerically, it is based on IBM’s High-Accuracy Arithmetic Subroutine Library (ACRITH) [9, lo], a FORTRAN 77 library which was first released in 1984. The use of ACRITH in FORTRAN 77 programs triggered the demand for a more convenient programming environment and resulted in the development of ACRITH-XSC. With the aid of these tools, numerical programming takes a major step from an approximative, often empirical science towards a true mathematical discipline.
2
Brief Comparison with Fortran90
The new Fortran standard, developed under the name Fortran8x, now known as Fortran90 [13],was finally adopted and published as an international (ISO) standard in the summer of 1991. The new Fortran language offers a multitude of new features which the Fortran user community has been awaiting impatiently. Among the most prominent are extensive array handling facilities, a general type and operator concept, pointers, and modules. Also, many of the newly added intrinsic functions, especially the array functions, numeric inquiry functions, floating-point manipulation functions, and kind functions (for selecting one of the representation methods of a data type) can be quite useful for numerical purposes. Through their judicious use, the portability of Fortran 90 programs can be enhanced.
ACRITH-XSC: A Fortran-like Language for Verified Scientific Computing
47
Unfortunately, however, portability of numerical results is still extremely difficult to achieve since the mathematical properties of the arithmetic operators and mathematical standard functions, in particular strict accuracy requirements, remain unspecified in the Fortran 90 standard. Thus, computational results still cannot be expected to be compatible when using different computer systems with different floating-point units, compilers, and compiler options. ACRITH-XSC contains a number of Fortran 90-like features such as array functions and expressions, user-defined operators and operator overloading, dynamic arrays and subarrays, and others. However, an attempt was also made to keep the language reasonably small compared with Fortran 90. Thus other features present in Fortran 90 were not included (e. g. pointers and modules). On the other hand, the ACRITH-XSC language provides a number of specialized numerical tools which cannot be found in Fortran 90, such as complete rounding control, interval arithmetic, accurate dot products, and the exact evaluation of dot product expressions. These make ACRITH-XSC well-suited for the development of numerical algorithms which deliver highly accurate and automatically verified results. In contrast, the Fortran 90 standard does not specify any minimal accuracy requirements for intrinsic functions and operators and for the conversion of numerical data. For the user, their rounding behavior may vary from one machine to another and cannot be influenced by any portable or standard means.
3
Rounding Control and Interval Arithmetic
By controlling the rounding error a t each step of a calculation, it is possible to compute guaranteed bounds on a solution and thus verify the computational results on the computer. Enclosures of a whole set or family of solutions can be computed using interval arithmetic, for example to treat problems involving imprecise data or other data with tolerances, or to study the influence of certain parameters. Interval analysis is particularly valuable for stability and sensitivity analysis. It provides one of the essential foundations for reliable numerical computation.
ACRITH-XSC provides complete rounding control for numeric constants, input and output of numeric data, and the arithmetic operators +, -, *, / (for real and
complex numbers, vectors, and matrices). This ensures that the user knows exactly what data enters the computational process and what data is produced as a result. Besides the default rounding, the monotonic downwardly and upwardly directed roundings, symbolized by < and >, respectively, are available to compute guaranteed lower and upper bounds on a solution. All three rounding modes deliver results of 1 ulp (one unit in the last place) accuracy in all cases. A special notation is available for rounded constants. It may be used anywhere a numeric constant is permitted in a program. The conversion from the decimal representation of a constant to the internal format always produces one of the two neighboring floating-point numbers. Rounding downwards produces the largest
Wolfgang V. Walter
48
floating-point number not greater than the given constant, rounding upwards produces the smallest floating-point number not less than the constant. If no rounding is specified, the constant is converted to the nearest floating-point number: (2.7182818284591)
rounded downwards e rounded to nearest e rounded upwards e
The direction of rounding can also be prescribed in the I/O-list of a READ or WRITE statement. In the following example, a guaranteed lower bound for the sum of two numbers (given in decimal) is produced (again in decimal notation):
READ (*,*I x:’ r v e c t o r Z(3) ;
= ACCz 10) ACC1 := ACCl+ ACCz 1 1 ) ACC1 := ACC1- ACCz 12) ACCl := ACCz
explanation initialize an accumulator object with a floating-point value add a floating-point value to an accumulator object subtract a floating-point value from an accumulator object subtract a product of floating-point operands from an accumulator object invert the sign of an accumulator object compare two accumulator objects add two accumulator objects subtract two accumulator objects copy the contents of an accumulator object
Table 2: Additional Accumulator Operations
7
Conclusion
The operations described in this proposal allow a natural extension of the ideas of existing standards for scalar floating-point arithmetic to vectors and matrices. The accurate dot product as defined in this proposal provides a flexible and applicationoriented basis for the implementation of sophisticated, efficient, and reliable numerical software. Thus it becomes possible to program algorithms in such a way that they carry their own error control and produce results of high accuracy.
A new branch of numerical mathematics has evolved around the idea of automatically verifying computational results on the computer. Besides interval arithmetic, which is essential to compute guaranteed bounds on a solution, a general tool to improve the accuracy of numerical calculations is essential to obtain reliable results of high accuracy. The accurate dot product is well suited for this purpose.
Proposal for Accurate Floating-Poin t Vector Arithmetic
97
Various engineering problems in soil mechanics, optics of liquid crystals, groundwater modelling, magnetohydrodynamics and other fields have been successfully solved using automatic result verification techniques. In all of these problems, the known traditional floating-point methods had previously failed. Thus, the proposed vector arithmetic extension goes far beyond ordinary floating-point arithmetic, yet with reasonable extra effort.
A
Case Study for IEEE Arithmetic
The terms and concepts defined in the IEEE Standard 754-1985for Binary FloatingPoint Arithmetic [l] and in the IEEE Standard 854-1987for Radix-Independent Floating-point Arithmetic [2] are consistent with the basic requirements for the natural extension of the scalar arithmetic operations to the dot product operation. The IEEE data formats, the definition of arithmetic operations, the rounding modes and the exception handling are well-defined in the IEEE standards and can be applied without changes to this proposal for accurate vector arithmetic.
A.l
Dot Product Operation
This section prescribes the structure of a possible new standard for an accurate dot product operation as a basis for a well-defined vector arithmetic. The terms and wording of the IEEE standards are used whenever appropriate. This section also proves that an extension of the IEEE concepts to an accurate dot product operation can be realized without any restrictions of the basic concepts. Using the diction of the IEEE standards, the computation of the dot product of floating-point vectors must be considered as if performed in an extended data format, where all the required operations can be performed without error, and then rounded back to the destination’s precision. This guarantees both the accuracy of the result and conforming results in case of special input operands such as infinities, signed zeros and NaNs. Also, the same exception and trap handling occurs when the rounding is applied. The contents of sections 1.-4. Scope, Definitions, Precisions, Rounding of the IEEE standards can be adopted without changes. Only one definition, extracted from section 5. Operations of the IEEE Standard 854-1987,is quoted here to outline the main goal behind the new vector operations: Except for conversion between internal floating-point representations and decimal strings, each of the operations shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded exponent range, and then coerced this intermediate result to fit in the destination’s precision.
G. BohJender, D. Cordes, A. KnGfeJ, U.KuJisch, R. Lohner, W. V. Walter
98
Following the structure of the IEEE standards, the dot product operation is defined:
Dot Product Operation An implementation shall provide the dot product operation ELl zi * yi for any number n of pairs of operands zi, y;, i = 1, ...,n of the same basic format. The result shall be rounded to the destination’s precision as specified in section 4 of the existing IEEE standards. The value of the natural number n shall be representable in a supported integer format. The definitions in 6. Infinity, NaNs, and Signed Zero can be adopted, too.
For the exceptional behavior of the dot product operation, the definitions in 7. Ezceptions of the IEEE standards are sufficient and can be easily interpreted. An invalid operation shall be signaled if one of the following cases occurs: 1. Any operation on a signaling NaN, magnitude subtraction of infinities (such as (+m) - (+m)), or invalid multiplication (0 * m).
2. The number of vector elements exceeds the limit imposed by the supported integer format. The exceptions overflow, underflow, and inexact can only occur when the final rounding is applied. The trap handling in section 8. Traps can also be adopted.
A.2
Accumulator Operations
This section does not have any corresponding section in the IEEE standards. Essentially, the abstract definition of an accumulator object and of the related operations are given.
Accumulator Object
For any number n of pairs of operands x i , y i of the same basic format, an “accumulator object” shall be capable of holding the unrounded result of any dot x; * y, as computed with infinite precision and unbounded exponent product range. The value of the natural number n shall be representable in a supported integer format.
EL,
Note that an accumulator object must be capable of holding any value representable in the data format of the operands, i. e. there must be representations for NaNs, infinity and signed zero. This is a consequence of the requirements imposed by the specification of the general behavior of vector operations.
Proposal for Accurate Floating-Point Vector Arithmetic
99
The following section describes a minimal set of “accumulator operations’’ for the accumulator object.
The fundamental operations which are needed for an accurate dot product are listed in Table 3. The operands z and y and the result t are numbers in one of the basic formats, ACC is a suitable accumulator object. Apart from the explicit final rounding operation 0,all operations are performed without rounding error and with unbounded exponent range. operation
I)ACC:=z*y
2) ACC := ACC
3) z := o(ACC)
+- z * y
explanation initialize accumulator with product of z and y (exact, without rounding error) add product of z and y t o ACC (exact, without rounding error) round ACC to the destination’s precision (according to the rounding mode 0 )
Table 3: Fundamental Operations The following exceptions can occur in accumulator operations:
Except ions In the rounding z := D(ACC) of an accumulator object to the destination’s precision, the exceptions overflow, underflow, and inesact can occur. In all other accumulator operations, only an invalid operation exception may be signaled, indicating an operation on a signaling NaN, a magnitude subtraction of infinities (m - m), an invalid multiplication (0 * m), or that the accumulator object is insufficient t o hold the intermediate exact result. The last case shall never occur if the total number n of accumulation operations 2) performed on the same accumulator object is representable in a supported integer format. In addition to the fundamental operations listed above, several other related operations are often useful. These additional operations are listed in Table 4.
G. Bohlender, D. Cordes, A. Kniifel, U. Kulisch, R. Lohner, W. V. Walter
100
hdditional Operations explanat ion initialize a n accumulator object with a floating-point value add a floating-point value to a n 5) ACC := ACC x accumulator object subtract a floating-point value from a n 6) ACC := ACC - x accumulator object subtract a product of floating-point operands 7) ACC := ACC- x + y from a n accumulator object invert the sign of an accumulator object 8) ACC := -ACC compare two accumulator objects 9)ACCl< > =ACC2 10) ACC1 := ACCl ACC2 add two accumulator objects 1 1 ) ACCl := ACCi- ACC2 subtract two accumulator objects copy the contents of a n accumulator object 12) ACCl := ACC2
operat ion 4) ACC := I
+
+
Table 4: Additional Operations
References [l] American National Standards Institute/Institute of Electrical and Electronics En-
gineers: IEEE Standard for Binary Floating-point Arithmetic. ANSI/IEEE Std 754-1985, New York, 1985.
[2] American National Standards Institute/Institute of Electrical and Electronics Engineers: IEEE Standard for Radiz-Independent Floating-point Arithmetic. ANSI/IEEE Std 854-1987, New York, 1987. [3] Bleher, J. H.; Rump, S. M.; Kulisch, U.; Metzger, M.; Ullrich, Ch.; Walter, W.: FORTRAN-SC: A Study of a FORTRAN Eztension for Engineering/Scientific Computation with Access to A CRZTH. Computing 39, 93-1 10, Springer-Verlag, 1987. [4]Bohlender, G.; Rall, L. B.; Ullrich, Ch.; Wolff von Gudenberg, J.: PASCAL-SC: Wirkungsuoll progmmmieren, kontrolliert rechnen. Bibl. Inst., Mannheim, 1986;
.
. . : PASCAL-SC: A Computer Language for Scientific Computation. Perspectives in Computing 17,Academic Press, Orlando, 1987.
[5]Bohlender, G.: What Do We Need Beyond ZEEE Arithmetic? In Ullrich, Ch. (ed.): Computer Arithmetic and Self-validating Numerical Methods, Academic Press, 1990. [6] Bohlender, G.: A Vector Eztension of the IEEE Standard for Floating-point Arithmetic. In [22],3-12, 1991. [7] Bohlender, G.; Knofel, A.: A Survey of Pipelined Hardware Support for Accurate Scalar Products. In [22],29-43, 1991.
Proposal for Accurate Floating-point Vector Arithmetic
101
[8]Bohlender, G.; Kornerup, P.; Matula, D. W.; Walter, W. V.: Semantics for Ezact Floating Point Opemtions. Proc. of 10th IEEE Symp. on Computer Arithmetic (ARITH 10) in Grenoble, 22-26, IEEE Comp. SOC., 1991. [9]Cordes, D.: Runtime System for a PASCAL-XSC Compiler. In [22],151-160, 1991. [lo] Dekker, T. J.: A Floating-Point Technique for Eztending the Available Precision. Numerical Mathematics 18,224-242, 1971. [ll] Dongarra, J. J.; DuCroz, J.; Hammarling, S. Hanson, R.: A n Eztended Set of Fortmn Basic Linear Algebm Subprogmms. ACM Trans. on Math. Software 14,no. 1,1988. [12] Dongarra, J. J.; DuCroz, J.; Duff, I.; Hammarling, S.: A Set of Level 9 Basic Linear Algebm Subpragmms. ACM Trans. on Math. Software 16, no. 1, 1990. [13] Hahn, W.; Mohr, K.: APL/PCXA, Erweiterung der ZEEE Arithmetikfu’r technisch wissenschaftliches Rechnen. Hanser Verlag, Munchen, 1989. [14]Hammer, R.: How Reliable is the Arithmetic of Vector Computers? In [36],467482, 1990. [15] IBM: System/370 RPQ, High-Accumcy Arithmetic. SA22-7093-0, IBM Corp., 1984. [16]IBM: High-Accumcy Arithmetic Subroutine Library (ACRZTH), General Information Manual. 3rd ed., GC33-6163-02,IBM Corp., 1986. [17] IBM: High Accuracy Arithmetic - Eztended Scientific Computation (ACRITHIBM Corp., 1990. XSC), General Information. GC33-6461-01, [18] IMACS, GAMM: Resolution on Computer Arithmetic. In Mathematics and Computers in Simulation 31,297-298, 1989; in Zeitschrift fur Angewandte Mathematik und Mechanik 70,no. 4,p. T5,1990; in Ch. Ullrich (ed.): Computer Arithmetic and Self- Validating Numerical Methods, 301-302, Academic Press, San Diego, 1990; in [36],523-524, 1990; in [22],477-478, 1991. [19] ISO:Language Compatible Arithmetic Standard (LCAS). Committee Draft (Version 3.1), ISO/IEC 10967, 1991. [20] Kahan, W.: Further Remarks on Reducing Truncation Errors. Comm. ACM 8 , no. 1, 40, 1965. [21] Kahan, W.:Doubled Precision ZEEE Standard 754 Floating-Point Arithmetic. Conf. on Computers and Mathematics, Mini-Course on “The Regrettable Failure of Automated Error Analysis”, MIT, June 13, 1989. [22] Kaucher, E.; Markov, S. M.; Mayer, G. (eds): Computer Arithmetic, Scientific Computation and Mathematical Modelling. IMACS Annals on Computing and Applied Mathematics 12,J.C. Baltzer, Basel, 1991. [23] KieOling, I.; Lowes, M.; Paulik, A.: Genaue Rechnemrithmetik, Znteruallrechnung und Progmmmieren mit PASCAL-SC. Teubner Verlag, Stuttgart, 1988. [24]Kirchner, R.; Kulisch, U.: Arithmetic for Vector Processors. Proc. of 8th IEEE Symp. on Computer Arithmetic (ARITH8) in Como, IEEE Comp. SOC.,1987. [25] Klatte, R.; Kulisch, U.; Neaga, M.; Ratz, D.; Ullrich, Ch.: PASCAL-XSC Spmchbeschreibung mit Beispielen. Springer-Verlag, Berlin, Heidelberg, 1991.
...: PASCAL-XSC Language Reference with Ezamples. Springer-Verlag, Berlin, Heidelberg, 1992.
102
G. Bohlender, D. Cordes, A. Knofel, U. Kulisch, R. Lohner, W. V. Walter
[26]Knofel, A.: Fast Hardware Units for the Computation of Accumte Dot Products. Proc. of 10th IEEE Symp. on Computer Arithmetic (ARITH 10) in Grenoble, 7074, IEEE Comp. SOC.,1991. [27]Kulisch, U.: Grundlagen des numerischen Rechnens: Mathematische Begru'ndung der Rechnemrithmetik. Reihe Informatik 19, Bibl. Inst., Mannheim, 1976. [28]Kulisch, U.; Miranker, W. L.: Computer Arithmetic in Theory and Practice. Academic Press, New York, 1981. [29]Lawson, C.; Hanson, R.; Kincaid, D.; Krogh, F.: Basic Linear Algebm Subprogmms for Fortmn Usage. ACM Trans. on Math. Software 5, 1979. [30]Linnainmaa, S.: Analysis of Some Known Methods of Improving the Accumcy of Floating-Point Sums. BIT 14,167-202, 1974. [31]Moller, 0.:Quasi Double Precision in Floating-Point Addition. BIT 5,37-50, 1965. [32]Miiller, M.; Rub, Ch.; Riilling, W.: Ezact Addition of Floating Point Numbers. Sonderforschungsb. 124, FB 14,Informatik, Univ. des Saarlandes, Saarbriicken, 1990. [33]NAG: Basic Linear Algebm Subprograms (BLAS). Group Ltd, Oxford, 1990.
The Numerical Algorithms
[34]Priest, D.M.: Algorithms for Arbitrary Precision Floating Point Arithmetic. Proc. of 10th IEEE Symp. on Computer Arithmetic (ARITH 10)in Grenoble, IEEE Comp. SOC., 1991. [35]Ratz, D.: The Eflects of the Arithmetic of Vector Computers on Basic Numerical Methods. In [36],499-514, 1990. [36] Ullrich, Ch. (ed.): Contributions to Computer Arithmetic and Self-validating Numerical Methods. IMACS Annals on Computing and Applied Mathematics 7 , J.C. Baltzer, Basel, 1990. [37]Weeks, D.: Vectoriing a Robust Inner Product Algorithm. Proc. of Third Int. Conf. on Supercomputing, ACM, 1989.
11. Enclosure Methods and Algorithms with Automatic Result Verification
This page intentionally left blank
Automatic Differentiation and Applications’ Hans- Christoph Fischer
1
Introduction
The computation of derivatives is a problem which arises in many fields of Applied Mathematics: for a function one is interested not only in the function value but also in the values of certain derivatives. E. g., the Newton-Raphson method ZY+l
= ZY - ( f ( Z Y ) ) - - l
. f(+’)
( u = 0,1,2,.
..)
for the solution of a nonlinear equation f(s)= 0 uses the value of the first derivative - or in case of a system of nonlinear equations the Jacobian matrix - f’(z”). The computation of derivatives can be done in an efficient way by means of the socalled automatic differentiation: formulas have not to be manipulated symbolically and derivatives are determined explicitly. Furthermore, with the help of interval arithmetic it is possible to compute guaranteed bounds for the values of derivatives. This is the basis for many problem-solving algorithms with automatic result verification, e. g., numerical quadrature, or the solution of ordinary differential equations (initial and boundary value problems), or systems of integral equations. Before the method of automatic differentiation will be explained in more detail, some words about symbolic and numerical differentiation are in place. In numerical differentiation the values of derivatives are approximated by corresponding differential quotients. E. g., the approximation of the first derivative of a real function f(+) requires two evaluations of the function, one subtraction, and one division:
The stepsize h has to be chosen such that the approximation error becomes small and the rounding errors are not amplified in an unacceptable way (e. g., due to cancellation in the numerator). Because of these difficulties, this method frequently leads to unsatisfactory results. In contrast to numerical differentiation, the methods of symbolic and automatic differentiaton are not approximative. The symbolic differentiation ‘by hand’ or by use of a computer-algebra system (e. g. MACSYMA [26], MAPLE [27] or REDUCE [33)) starts with a functional expression ‘The author wants to express his appreciation to Prof. U. Kulisch and Prof. E. Kaucher, both of the University of Karlsruhe, for their constant encouragement when the thesis [8] was written, which is the basis of the present paper. Scientific Computing with AutomaticResult Verification
Copyright 0 1993 by Academic Press, Inc. 105 All rights of reproduction in any form reserved. ISBN 0-12-044210-8
106
H.-C. Fischer
for f = f ( z ) and then applies, step by step, the well-known rules of differentiation to get an expression for the desired derivative. For a fixed value of z,the evaluation of this expression then results in the numerical value of the derivative. Even for simple functions the automatically generated expressions may be large and unstructured, and their generation may involve considerable requirements concerning memory and CPU-time. Therefore, in the case that one is not interested in expressions for the derivatives but rather in values only, this method may not be efficient. Automatic differentiation avoids the problems of the symbolic method: the application of the rules of differentiation and the evaluation are carried out in parallel, i. e., during the whole process only numbers have to be manipulated. Therefore, the method of automatic differentiation can be easily coded in such programming languages as FORTRAN, PASCAL, etc. Provided the programming language supports overloading of operators, then an expression may be entered in the usual mathematical notation: the evaluation of derivatives is executed by means of a suitable definition of the operators. In general, the method requires only small extra storage. Furthermore, the employment of fast floating-point hardware is possible, i. e., the running times are short. The paper is organized as follows: in Section 2 we develop the basic principles of the (forward mode of) automatic differentiation: the Computation of the first derivative, the computation of Taylor coefficients for functions f : R + R, and the evaluation of the gradient Of = . .., for functions f : R" + R.
(g,z)
In Section 3 the method of reverse computation is presented. This method allows a considerable acceleration for many algorithms of automatic differentiation; this leads to the so-called fast automatic differentiation or reverse automatic differentiation. Especially for the computation of gradients, the following surprising estimation can be shown: A (f,Vf) 5 C . A (f),where A (f,Of)denotes the number of operations for the computation of the gradient (including function evaluation) and A (f)the number of operations for the function evaluation, respectively. Section 4 introduces interval slopes. They can be evaluated efficiently by the method of reverse computation. With interval slopes, it is possible to define centered forms which are the basis for the efficient calculation of the range of values. In Section 5 we apply the reverse computation to the problem of evaluating functions in floating-point arithmetic. Guaranteed bounds for the rounding errors can be computed. In the Appendix a program in PASCAL-XSC demonstrates an implementation of automatic differentiation. For the function f(x) = 25(z - 1)/(z2 l ) , the second derivative j'(2) is computed.
+
107
Automatic Differentiation
2
Automatic Differentiation (Forward Mode)
First, we briefly summarize the basic results of automatic differentiation since they will be essential for the following sections. Some ideas concerning this method can be found already in the late 1950s (Beda et al. [ 5 ] ) . In the books of Moore [29] and Rall [30], the automatic differentiation (forward mode) is used to solve various problems of Applied Mathematics. These authors have also examined the possibilities to implement the method.
2.1
The Computation of First Derivatives
The principles of automatic differentiation may be explained best with help of the evaluation of first derivatives. As functions we examine arithmetic expressions which can be formulated in typical programming languages (FORTRAN, PASCAL . . .), i. e. expressions consisting of a -, /, and certain finite composition of constants, variables, the basic operations differentiable standard functions such as exp, sin, cos, . .
.
+,
a,
For a function f : R -+ R, the computation of the values of the function and its first derivative is carried out by means of a differentiation arithmetic. This is an arithmetic of ordered pairs like complex arithmetic or interval arithmetic. The with u,u’ E R, where the first component is the pairs look as follows: U := (u,u’) function value and the second component is the value of the first derivative.2 The rules for addition, subtraction, etc. of these pairs are simple. In the first component, u,the function value is computed by addition, subtraction etc.; in the second component, u’,the resulting value of the derivative is computed by the well-known rules for differentiation:
u+v u-v u v *
u/v
+
+
+
(v,v’) = (u v, 21‘ v’), u’)- (v,v’) = ( u - v, 21’ - v‘), = (u, u‘)* (v,v’) = (u* v,u’* v u . v’), = (u, = (11, u’)/(v, v’) = (u/v, (u’ - (u/v) * v’)/v), v = (u, u’)
+
# 0.
Pairs involving standard functions are computed by use of the chain rule. E.g., there hold: exp(U) = exp(u,u‘) = (exp(u),exp(u)-u‘), sin ( U ) = sin(u, u’)= (sin(u), cos(u) u’), cos ( U ) = cos(u, u’)= (cos(u), - sin(u) u’). For the independent variable z we use the pair X := (z0,l); a constant c is represented by C := (c,O). Thus, for all admitted expressions the function value f(z0) zIn the following the existence of the derivatives is always assumed. For functions from the class of arithmetic expressions this assumption can be checked easily.
H.-C.Fischer
108 and the value of the first derivative use of the above rules.
f’(z0)
can be computed step by step, making
A simple example shows the procedure. This demonstrates the fact that neither approximations nor symbolic manipulations are required.
+
Example 1 For the function f (z) = 25(z - l)/(z2 l ) , the function value and the value of the first derivative are to be computed for to = 2. The employment of the differentiation arithmetic3 gives:
In the Appendix it is shown that the differentiation arithmetic can be implemented in a well-structured and efficient way provided the programming language allows operator overloading. In the language PASCAL-XSC [19] this is possible. In particular, its standard datatypes are designed for numerical applications. An analogous statement holds for Ada, C++ and the FORTRAN-based language ACRITH-XSC (141.
2.2
The Computation of Taylor Coefficents
The employment of automatic differentiation is not limited to the computation of first derivatives. By applying suitable recurrence relations, it is possible to compute derivatives of a higher order. Furthermore, no symbolic manipulations or numerical approximations are necessary. Just as in the previous subsection, the admitted functions are arithmetic expressions. To simplify the presentation, we use Taylor coefficients instead of derivatives. For the Taylor coefficients o f f : R + R at z = so, we use the following notations:
i. e., f(zo) = ( f ) ~f’(z0) , = (!)I, and f ( k ) ( z ~ =)k! ( f ) k . The special functions f(z) = I and f(z) = c (for constants c E R) lead to (I)O
= do, (x), = 1 ,
(z)k
= 0 for k 2 2
and (c)o = c, (c)k = 0 for k 2 1. 30fcourse, the usual priority rules are also applied concerning the operations of the differentiation arithmetic.
Automatic Differentiat ion
109
The rules for the differentiation of sums, products, or quotients lead (for k 2 0) to -, and / (see e. g. [29],[30]): the following formulas for the operators
+,
a,
j=O
k-1
In particular, if the functions f or g are the variable z or a constant, then the above formulas may be simplified. E. g., there hold:
For efficiency reasons, these cases have to be taken into consideration in the implementation.
With respect to the Taylor coefficients of h = f’ and the definition of (h)k, the relation
and the chain rule of differentiation make it possible to compute Taylor coefficients for standard functions.
For the exponential function, there holds
i. e. by use of (2),
H.-C. Fischer
110 Analogously, for the functions
20'
= sin(f) and wc = cos(f), one gets (see e. g.
[291,[301):
and
Thus, the computation of the Taylor coefficients of orders 0 (function value) to p 2 1 of a function can be realized by use of a Taylor arithmetic, i. e. an arithmetic of (p 1)-vectors. These vectors are defined as follows: U := (uo,. . . , u p ) with u k = ( u ) k (k = 0 . . .p). Especially, the independent variable x is represented by the (p 1)-vector X := ( x O , l , O , . . .,O), a constant c by the (p 1)-vector C := (c,O,. . . ,O). E.g., a multiplication can be written in the form:
+
+
u.v
+
=
(210,.
..,Up)
*
(vole.
.,.p)
= (wo,. . ., w p )
k
with
W k = x U j . V k - j ( k = O
...p).
j=O
Additionally, from this formula, an estimation of the number of operations can be derived: computing the coefficients of the product U V from the coefficients of U and V requires i p ( p 1) additions and i ( p l) ( p 2) multiplications, i. e. (p 1)* operations are necessary. For the ratio of the cost A ( T p f )of the computation of all Taylor coefficients of f (including order p) and the cost A (f)for the computation of the function value, the following estimate is valid4 (e. g. [29]):
+
+
+
+
+
For p = 1, the Taylor arithmetic becomes the differentiation arithmetic of the preceding subsection. However, only the combination with interval arithmetic ([2],[29]) will allow the maximum utilization of the automatic generation of Taylor coefficients. Numerous algorithms with automatic result verification (see e. g. [18],[20]) require estimations of remainder terms
4Note that for the computation of the coefficients in (3), (4), and (5) only the Taylor coefficients of order 0 require a call of a standard function.
Automatic Differentiation
111
Interval arithmetic is a useful tool for this purpose. For two real intervals [ a ,b] and [c,d] the (interval) operations o E {+, -, defined by [ a ,b] o [c,d] := { z o y I I E [ a ,b], y E [c,4).
.,/} are
The result is always an interval (in the case of a division it is assumed that 0 4 [c, 4) and the bounds of the result can be computed by use of suitable operations on the bounds of the operands.
+
+
+
E. g., for an addition [a,b] [c,d] = [a c, b d] and for a subtraction, there holds [a, b] - [c,d] = [a - d, b - c ] . For a (continous) standard function sf E {exp, sin, cos,. ..}, the interval evaluation is defined by use of sf ([a, b ] ) = {sf(.) I z E [ a , The following property (see e. g. [29]) is essential for interval arithmetic evaluation of an arithmetic expression f (z): provided the value of 2 is replaced by the interval X and all operations and standard functions are replaced by the corresponding interval operations and operations on standard functions (and there are no unadmissible operations), then one gets the so-called interval extension F of f ; additionally, the following important inclusion is valid: f(X) := {f(z)12 E X } c F ( X ) . Thus, the recurrence relations for the computation of Taylor coefficients can be evaluated in interval arithmetic: replacing 10 by the interval Xo and replacing operations by interval operations, the resulting interval ( F ) k will enclose the values of the following Taylor coefficients
By means of a suitable machine interval-arithmetic ([23],[24]) all interval computations can be carried out on a computer without loosing the enclosure property. In particular, all possible rounding errors are automatically taken into consideration; of course, this is also true for the case where Xo is a point interval, i. e. Xo = [zo,101.
2.3
The Computation of Gradients
In this subsection we consider the problem of computing the gradient Of = (&,... , of a function f (I), I = ( I ~ , .. .,zn), making use of automatic differentiation.
g)
The procedure is analogous to that of the differentiation arithmetic for the computation of first derivatives: instead of computing by means of pairs of real numbers, we use vectors just as in the case of the Taylor arithmetic. The (n 1)-vectors are composed of the function value f(zo) (zo = (zy,. . .,z",) and the n values of the partial derivatives with respect to the variables 2 1 , . .. ,2,. The rules for these vectors are the well-known rules for the computation of gradients, e. g. in the case of multiplication: V(f.g) = f . Vg g . V f .
+
+
5E.g., for the (monotonous) exponential function this is: exp([a, b ] ) = [exp(a),exp(b)].
H.-C. Fischer
112
Thus, the combination of two (n+ 1)-vectors U = (uo,. ..,tin) and V = (vo,. . .,vn) obeys the following laws of a gradient arithmetic (see e. g. [31]):
UfV
u-v
u/v
= with = with =
with = with sin(U) = with cos(U) = with
exp(U)
+
The independent variables XI,. . . x, are represented by the (n 1)-vectors XI := (z:, l,O,. . .,O), . . ., Xn := (z:,O,. . .,0, l ) , and a constant c by C := (c,O,. . . ,O). Thus, the function value f (20) and the value of the gradient V f (20) can be computed for all arithmetic expressions in a step-by-step process.6 f(x1,52,53,z4) = x1 . z2 e z 3 . z4 the function value and the gradient will be evaluated f o r xy = 1, x: = 2, x: = -1, xs = -2. The computation is carried out in the following steps:
Example 2 For the function
Xl
= (1,1,0,0,0),
x 2
= (2,0,1,0,0),
x 3
x 4 F 5 F 6 F 7
= (--1,0,0,1,0), = (-2,0,0,0,1), = XI * X2 = (2,2,1,0,0), = F5 * X 3 = (-2, -2, -1,2,O), = F6 xs = (4,4,2, -4, -2),
i . e., f(1,2,-1,-2)
= 4,
$ = 4,
= 2,
= -4, and
= -2.
The number of operations for the evaluation of the function and its gradient grows as a linear function of n: in every step the n 1 components of the vectors have to be computed', i. e., for A ( f , Of)(the cost for computing the function and its
+
61n a proper implementation, the sparse structure of the vectors X and C can be used to reduce the number of operations. 70fcourse, the number of operations for the approximation of gradients by divided differences grows analogously.
Automatic Differen tiat ion
113
gradient) and A ( f ) (the cost for computing the function) there holds (see e. g.
[301):
The method requires only a small extra storage space since the intermediate results can be overwritten immediately after their processing. Additionally, by use of the method of automatic differentiation, other problems can be treated in an efficient way: the computation of Hessian and Jambian matrices as discussed in [30],and the calculation of Taylor coefficients of multivariate functions in [S].Algorithms for the evaluation of the product of a gradient and a given vector or the product of a Hessian matrix and a given vector can be found in [S]. Here, only the problem of the product p of a gradient and a given vector v = ( ~ 1 , .. .,v,) E R" will be treated in a more detailed way. By a modification of the algorithm for gradients, the product can be computed without an explicit evaluation of the gradient. For the ratio A (f,Vf.p)/A (f),it can be shown that it is bounded by a constant, i. e., independent of n. We show the procedure for the case of the multiplication h = f x g: ph := V h - v = (f*V9+g*Vf).v = (f * Vg). v (9* Of)* v
+
= f.p9+9.P',
-
where pf and pQdenote the already computed products V f v and Vg v . For the other operations, there hold analogous formulas. The process starts with p' := vc for f = zk, k = 1 . . . n and pf := 0 for f = c E R.
3
Fast Automatic Differentiation
In this section, algorithms for fast automatic differentiation are presented. With these algorithms, it is possible to solve more efficiently some of the problems of the preceding section. The new method for the computation of gradients and Hessian matrices reduces the number of operations by one order of magnitude. In the first subsection, the basic algorithm of the fast method will be presented, i. e., the reverse computation. This reverse mode is also essential for the methods of Section 4 and
5.
3.1
The Basic Algorithm of Reverse Computation
The following simple proposition for the solution of a linear system (with a triangular matrix) is the basis of the reverse method.
H. - C. Fischer
114
Proposition 1 Let be r, a E Rsxs( s E N ) where I' = diag(Tl7') with 7i # 0 (1 5 i 5 s), a the triangular matrir: (aij) with aij = 0 for i 5 j (1 5 i , j 5 s), and the vector z = ( z i ) E R" (1 5 i 5 s). Then, for the solution h = ( h i ) (1 5 i 5 s ) of (I' - a ) h = z there holds: i-1
with 81 = o for j 8' = 7;and
> i,
i
d'j
=
7j
C
k=j+l
dkakj
for j = i - l , i - 2
,..., 1.
(9)
Proof From (I'- a ) h = z , there follows 7;lh' - C;!,j aijhj = zi (1 I i 5 s ) , i. e., hi = 7i aijhj + zi) (1 5 i 5 s ) ; thus, (7) has been proved.
(xi:!,
Let L be given by L := I'-a. Then, L = (lij) (1 5 i , j 5 s) with li; = 7?', lij = 0 for i < j , i. e., det(L) = det(I') which is # 0 by assumption. Let D be given by D = ( d j ) := L-'; then h = Dz and hi = C J = l d ' j z , (1 5 i 5 s). With I?-' = diag(7;) and the identity matrix I, there holds:
(r
D D(I'-a) Dr D i. e., for 1 5 with
= -a)-1 = I = I+Da =
5
s: dij = (6ij
= 0 (j 8' = 7i,
8 j
( I + Da)I'-',
+ ci=j+l dkak')
3 73
.
for j = s,s - 1 , . . , l . Thus,
> i),
k=j+l equations (8) and (9) have been proved.
0
Remark 1 The computation of h b y means of (7) is called the forward mode, that b y use of (8) and (9) the reverse mode. The name of the latter is due to the computation of (9) in descending order. The formulas (8) and (9) are used in [7] to estimate the temporal complezity of gradients for rational functions f : R" + R.
Automatic Differentiation
115
R e m a r k 2 Here and subsequently, the following expressions are used synonymo us1y: (i) the cost, the temporal complexity, and the number of operations (of an algorithm) and (ii) the storage cost, the spatial complezity, and the number of storage units.
Since in the following we are interested only in the component ha and the values d"", . . . ,d"' as following from the solution of (I' - a ) h = z , a corollary summarizes the basic algorithm. Corollary 1 For s E N,let be yi,zi E R, 7i # 0 (1 5 i 5 s), i 5 j , and the sequence h', . , . ,h' with
s), aij = 0 for
h'
=
aij
E
R (1 5 i,j 5
7121 i-1
hi = 7;( C a i j h J
+
2;)
for 1
< i 5 s.
(10)
j=1
Then, there holds 8
h" = c d j z j ( d j E R, 1 5 j 5 s) j=1
with
+C 8
d3 = 7j (6,j
dkakj)
k=j+l
for j = s , s
The computation of the values d", d+', means of the following algorithm:
- 1,... ,l.
.. . ,d1
b y use of (12) can be carried out b y
Algorithm 1 (Basic Algorithm of Reverse Mode) I . { Initialisation} d" := 7, and dj := 0 for j
< s.
2. {Computation} FOR k : = s DOWNTO 2 DO d3 := dj
+ 7jakjdk
for all j
< k.
(13)
+ +
Proof From the Proposition 1, there follows: ha = '&d"jzj with daj = 7,(6,j Cy=j+l d"'a1j) for j 5 s and, thus, (12) with d' := d"'. Define now d i := 7j (6,j Cf=k+l d ' a l j ) for 1 5 k 5 s. Then, d i = 7j6,j is valid; this is step 1 of the algorithm. Furthermore, dk = d i = di-l = ... = d t because of a l k = 0 for 1 5 k. From djk-l = -yj ( & j + d k a k j d'a'j) = yjakjdi djk for k = s, s - 1,.. . ,2 and j < k step 2 of the algorithm can be derived. 0
+ xf=k+l
+
H.-C.Fischer
116
3.2
Fast Computation of Gradients
In this subsection we use the reverse mode of computation to get an algorithm for the fast evaluation of gradients. For an arithmetic expression f(x) of f : R" R,we define a decomposition f l , . . .,f' off as the sequence of intermediate results in the course of the evaluation of f . For the sake of simplicity, the sequence may start with the independent variables and the constants, i. e., f' = 21,. .. ,f" = x,, and fn+' = c1,. f" = c,,~,. For an expression, there may exist several decompositions.
..
Example 3 For the function f ( X I , x2, x3, x4) = xl'd2'x3'54, the following sequence is a decomposition: f'
=
f2
f4
= 22 = x3 = 24
f5
= f'.
f3
r" f7
= =
I1
f2
f 5 . f3
r". f 4 .
For the function f(x) with decomposition f' . . .f", the component gY (v = 1.. .n) of the gradient V f = (91, .. .,g,,) = Vf' = (gf,.. . , g i ) can be computed by use of the following steps (cf. forward mode): g,k = 6,. for
fk
= X k , k = 1.. .n
g,k = o for f k = Ck-", k = n
+ 1...m
g , k = g L f g L f o r f k = f'&f',
k>m
g,k=f'xgL+ f ' x g L f o r g,k = (9;-
fk
fk=
f'xf', k > m
x gL)/f for f k = f'/f, k > m
g,k = sf'( f') x gf, for
fk
= sf (f'), k
>m
The common argument x = (21,. .. ,xn) for f k and g i has been omitted for reasons of simplicity.
Automatic Differentiat ion
117
+
.
Only the values g,k, k = 1.. .n depend explicitly on v ; i. e., for k = n 1.. s the expressions for all components of the gradient are the same. Thus, Corollary 1 can be applied. For k
> m and l,r < k, we define a k j =
g,i. e.:
and otherwise a k j = 0.
By setting 7 k = 1 and z k = 6ku for fixed v (v E (1 .. .n}) and k = 1.. . s, we get the following expression for hk = 7 k akjhj zk):
(C;::
and, for k
+
> m,
hk = h ' / f
- fk/f
x h', if f k = f'/f
and thus ha = gf. On the other hand, Corollary 1 shows that
ha = c d j z j = d", since zj = 6,") j=1
i. e., V f = ( 8 ,... ,d"), where the d-values are given by (12) or Algorithm 1. Thus, the computation of all components of the gradient o f f is equivalent to one evaluation of the d-values (using the a-values defined above). This result can be summarized in the following algorithm:
H.-C. Fischer
118
Algorithm 2 (Evaluation of gradients (reverse mode))
{ in: f : R" R, x E R", and a decomposition f' . . .f" off; out: f = f ( z ) , g = V f ( x ) = ( & ) ) ( i = l ...n);
1
1. {Forward step} Compute and store f ' ( x ) . .. f " ( x ) .
2. { Initialisation} d" := 1 and dj := 0 for j < s. 3. {Reverse computation} FOR k := s DOWNTO m
+ 1 DO
+ dk, d := d f dk, if f k = f' f f'
d' := d'
d ' : = d ' + f ' X d k , d : = d +f ' x d k , i f f k = f ' x f ' d' := d'
+ [ d k / f ' ] ,d := d - f k x [ d k / f ] ,if f k = f'/f'
{ [ d k / f ' ]has to be computed only once!} d' := d'
+ sf'( f') x dk, if f k = sf(f')
4 . {Output}
f := f", g := (d', . . .,d").
The number of operations for the computation of a gradient by means of the above algorithm can be estimated in the following way: depending on the elementary function f k in step 3 of Algorithm 2, the number of necessary operations is listed in the table below. An addition or a subtraction is counted with 1A , a multiplication or division with 1M and a call of a standard function with 1SF. The ratio 1is denoted by Q k . For the sake of simplicity, all operations are weighted with the same factor. function f k
I A( f k ) 1
4dj) 2A+2M 2A+2M
I Qk 1
119
Automatic Differentiation
Thus, in Algorithm 2, the ratio of the costs A( f , V f ) and A( f ) can be estimated to be
In particular, inequality (14) shows that the temporal complexity of the computation of the gradient (including the function value) is of the same order as the complexity of the computation of the function value. The name 'fast automatic differentiation' can be justified by a comparison with the estimation (6): the computation of gradients by use of the new algorithm has a complexity which is one order of magnitude less than that of the forward mode. Estimations analogous to (14) can be found in [ll] and [15]. In these sources, the results are proved by use of graph theory. Just as in the previous section, interval arithmetic will be used to compute guaranteed enclosures of the results. If the bounds are not sufficiently tight, defectcorrection methods for the evaluation of formulas may improve the results [9]. Algorithm 2 is now illustrated by an example.
-
Example 4 Choose f(Zl,Z2,23,24) = 11 z2 5 3 . 2 4 and 2: = 1, xi = 2, -1, xs = -2. By use of the intermediate results, f,, ... ,f7 one gets: fl
fa
=
f3
= =
f4
=
fs = f6 = f=f7 =
2:
=
d'
= O + f 2 * d S= 4 , = O+f'.d5=2, = O+f5.d6=-4, d4 = O+f'.d'=-2, d5 = O + f 3 - 8 = 2 , 8 = 0+f44=-2, d7 = 1,
=1 =2 23 = -1 2 4 = -2 fi*fa=2 f5'f3=-2 f6*f4=4 51
dl d3
22
i. e., f(1,2, -1, -2) = 4, V f = (d', dl,d3, d4) = (4,2, -4, -2).
The next example shows that the fast computation of gradients enables one to attack problems of considerable size, too.
Example 5 Let be f the Helmholtz energy-function (see e. 9. [IIJ): n
f(z) = C x i l n i=l
Xi
1 - bTx
xTAz
- -In
0 5 5 , b E R", A = AT E
4bTx
Rnxn.
+ (1 + \/Z)bTZ 1 + (1 - fi)bTz' 1
H.-C. Fischer
120
Algorithm 2 was coded in PASCAL-XSC. The function value f(x) and the gradient
V f(x) were evaluated for different values of x. The time for the evaluation of the
function value was measured b y use of a ‘straightforward’ program (using FORloops for the sum and the uector products). The following table shows the ratio V = RT( f,V f ) / R T (f) of running times for different values ofn: n n = 10
n=20
V 2.35 2.85
n n = 50 n = 100
V 3.59 9.99
A possible disadvantage of the fast automatic differentiation in its basic form of Algorithm 2 is the necessity for storing all the intermediate results f’ . . .f’;they are required in the course of the computation of the d-values in step 3 of Algorithm 2. Instead of storing all intermediate results, they can be recomputed when needed. In many cases this recomputation can be carried out efficiently: the original steps of the evaluation are inverted ([10],[28]). E.g., the intermediate results f,,+] := fl f 2 , f n + 2 := fn+l f 3 , . . .,f2,,-1 := f2,,-2 . f,,, which occur in the course of the computation of f(x1,. ..,x,) = nb1x; ( x i # 0 ) , can be recomputed from f = f2,,-1 by use of the inverted program f n + i - 2 := f,,+i-I/ f;, i = n,...3. The reduction of the required storage will increase the number of operations: about one additional evaluation of f will hereby be necessary. Another approach for balancing the temporal and spatial complexity in reverse automatic differentiation is presented in [12].
.
-
The method of fast automatic differentiation (just like the automatic differentiation in the forward mode) may help to solve some other problems efficiently: the computation of a Hessian matrix and the evaluation of the product of a Hessian matrix and a given vector can be found in [8]. In both cases, the new algorithms reduce the temporal complexity by one order of magnitude as compared to the results of Section 2. Additionally, the method of reverse evaluation may be used for ‘symbolic’ computations. In [8],a method for the generation of an explicit code for gradients is discussed: a source program for the evaluation of a function f is transformed into an extended program which evaluates f and its gradient. For the extended program, the complexity estimation (14) is valid.
4
Fast Computation of Interval Slopes
The concepts of (interval) slopes and derivatives are closely connected. In [21] and [22] interval slopes for rational functions (depending on one or several variables) and their corresponding centered forms are discussed. Additionally, a
Au t ornatic Differentiation
121
procedure for the computation of an interval slope FI[.,.]for a rational function f is given there; the complexity of this procedure can be estimated by
In the following, an algorithm is presented which computes an interval slope FII[.,.] for an arithmetic expression with complexity
i. e., the costs are reduced by one order of magnitude. With this interval slope, a quadratically-convergent centered form can be defined. Thus, in combination with a subdivision strategy it is possible to compute enclosures for ranges of values.
4.1
Slopes for Arithmetic Expressions
The following definition and proposition are based on results concerning rational functions in [22].
Definition 1 Let f : D -+ R, D E R" be given b y an arithmetic expression. Then, a continous function f[.,-1 : D x D + R" with
is called slope for f .
Remark 3 The vector f[z,z]E R" is a rowvector, i. e., the product product o f a ( 1 x n) and a (n x I)-matrix.
-
is the
Remark 4 In generul, there may be different slopes for a function; e. g., f[z,z]= ( 1 , l ) and f[z,z]= ( 1
+
52
- z2,1 - (51 - 21))
are slopes for f(z1, 52) = 51
+ 52.
The following proposition provides expressions for the computation of a slope for a function f (with decomposition f' ...f"). In analogy with the evaluation of gradients, this method is called computation in forward mode.
H.-C. Fischer
122
Proposition 2 Let f : D 3 R, D R" be an arithmetic expression, f 1 decomposition o f f , and x,z E D . Furthermore, choose
...f"
a
fk[z, z] := ek { k - t h unity vector} for k = 1 ...tz fk[x,z ] := o {zero vector} for k = n and, for k
+ 1 . . .m
> m (l,r < k),
fk[z, z] := -f'[s,21, i/ f k = -f'
fk[x,z]:=f'[x,z]ff'[x,z],
fk[z,z]:= f'(z)f'[z,z]
iffk=
i f f k =!Iff'
+ f'(z)f'[x,z],
iffk = f' x
f'
0and (fk(z)+ fk(z))> 0
in the case of a standard function
sf [ a ,b] =
sf (a)-af ( b )
if a z b,
if a = b.
Then, f[z, z] := f"[x,z ] is a slope for f .
Proof We show by induction that f'[x, 21.. . fk[x,21.. .f.[z, z] are slopes for f 1 . . .fk...f". For k = 1.. .m, this is trivial. Now, for k > m, let f'[z,z] and f " x , z ] be slopes for f' and f' ( l , r < k).
If f k = f' f f',then:
f'(4f f'(z) - (f'k) f f'(z)) = f'(4- f'(4 f (f'(z) - f'(z)) = f"., z]* (z - z ) f f r [ x ,z ] * (5 - z ) = fk[x,21 * (z- 2).
fk(z)- fk(4=
Automatic Differentiation
Thus, f k [ z z, ] is a slope for fk. For k = s this completes the proof.
Remark 5 The additional assumption (fk(z) + f k ( z ) )> 0 for
123
0
0
fk = is necessary, since the square root can be differentiated in ( 0 , ~ )only; for z , z with fk(z)= fk(z) = 0 , a slope is not defined. For standardfunctions such as exp,sin and cos, the derivative ezists on the entire domain.
H.-C.Fischer
124
By setting 2 = z in the above proposition, one gets the recurrence relations for the computation of the gradient of f , i. e. Vf(z) = f[z, z]. Therefore, there is the following estimate for the complexity of the computation of slopes by use of the above algorithm
Analogously to the case of the fast computation of gradients, it is now possible to formulate a fast algorithm for the computation of slopes: the application of the corollary in Section 3 reduces the complexity to
Algorithm 3 (Computation of slopes (reverse mode))
{ in: f : Rn+ R, x , z E R", and a decomposition f' . . .f" o f f ; out: f(z), f(z), and the slope f[z, 21 by use of Proposition 2;
1
1. {Forward step}
Compute and store f'(z).
2. { Initialisation} do := 1 and d j := 0 for j
.. y ( z ) and f'(z). . . f"(z).
< s.
9. {Reverse computation}
FOR k := s DOWNTO m + 1 DO d' := d' + d k ,
d := d f d k ,i f f k = f ' f f r , d'
:= d'
6
:=
d
d'
:= d'
d
:=
d'
:= d'
+d k / f ( z ) ,
d - fk(z) x d k / f ( x ) , if f k = f'/f,
+ (f'(z) +
f'(2))
x
dk,
if f k = ( f l y ,
+ &/(Ik(=) + f k ( z ) ) , i f f k = 0, d' + sf [f'(z), f'(z)] x dk, i f f k = sf (f').
d' := d'
d' :=
+ f ' ( ~ x) d k , + f'(z) x dk, i f f k = f' x f',
Automatic Differen tiat ion
4.
{Output} f(x) := p(z), f(z)
125
: = P ( Z )~,[ z , z:=(dl, ] ...,d").
Only the combination of interval arithmetic with the concept of slopes make it possible to compute enclosures for ranges. Therefore, interval slopes are defined as follows (cf. [21]):
Definition 2 Let f : D + R, D c R" be an arithmetic ezpression. Then, G E IR" is called an interval slope for f with respect to X c D ( X E IR") and z E X provided
is valid for all x E X.
Remark 6 Interval slopes can also be defined for non-digerentiable functions:
In the following, we discuss a different possibility to compute interval slopes: in the decomposition f l . ..f" of f , the value of x is replaced by the interval vector X, the operations and functions are replaced by the corresponding interval operations and interval functions; in the absence of errors, this will lead to the interval evaluation F'(X) . . .F'(X) = F ( X ) of f. The following example shows that the interval evaluation may not exist.
Example 6 The function f (x)= l / ( z x x for the decomposition
+ 1) is defined for all x E R. However,
and X = [-2,2], the interval evaluation does not exist since F'(X) = [-3,5] 3 0; i. e., the final division is not possible. For the decomposition
f' = 2, f 2 = 1, f 3 = (fl)',
f4
= f3
+ f',
f5
= f'/ f',
one gets F3 = [0,4], F4 = [1,5], and F = [0.2,1] provided the standard function for squaring is defined as X' := (1'11 E X}; the property f 3 2 0 is here still valid in interval arithmetic. In the following, we assume that all relevant intervals exist. If the operations in Proposition 2 are carried out in interval arithmetic then, by inclusion isotony, we get f k [ z , z ] E F k [ X , z ] f o r a l l k = l ...s a n d a l l z E X (zfixed).
H . 4 . Fischer
126
In particular, F[X, z] = F"[X,z] ( z E X, z fixed) is an interval slope for f in X since for all z,z : f(5)
- f(z) = f[z, z]* (z- z ) E F[X, z] * (z- 2 ) .
Furthermore, with f[z, z]= Vf(z)one gets for z,z E X:
where the subscript I indicates that the gradient was computed by use of Proposition 2 (i. e. in forward mode). The same convention will be used for F,[X, z] and F,[X, XI since the next section presents an algorithm for slopes which uses the reverse mode.
4.2
Computation of Interval Slopes in Reverse Mode
As in the preceding subsection we assume the existence of all intervals in the expressions to be evaluated. Through a replacement of z by X ,sf [f'(z),f'(z)] by sf'(F'(X)), dk by D k ,and the operations of Algorithm 3 by the corresponding interval-arithmetic operations, there holds: (d',
...,d")= f[z, z] E (D', . . . ,D") =: F I I [21.~ ,
In particular: Fl,[X, z] ( z E X, z fixed) is an interval slope for f in X ;for z,z E X , we get:
For the complexity of the computation of an interval slope by use of methods I and 11, estimations analogous to those for point slopes can be derived:
Especially, the new method I1 (by a factor of n) is faster than the procedure for rational functions as described in [22]. The following example shows that the results FI[X, z ] and FI,[X, z],respectively, (V,f(X) and V,rf(X), respectively) can be different. In an a priori fashion, it can not be predicted which method computes tighter bounds. Therefore, in most cases method I1 will be preferable due to its lower cost. If both FI (V,)and FIX(V,,)are computed then, an intersection my be used to improve the enclosures.
127
Automatic Differen tiation
Example 7 Choose f(z) = (z - z 2 ) z with decomposition f' = z, f' - f 2 , f 4 = f"'.
f2
= z2, f 3 =
Then, we get:
F,[X,z] V,f(X) FI,[X, z ] v,, f ( X )
+
+
= ( 1 - ( X z ) ) X (z - z2), = (1 - 2 X ) X + ( X - X 2 ) , = ( 2 - 2) - X ( X z),
=
+x + ( X - X2) + x - X(2X).
In interval arithmetic, the law of subdistributivity is valid, i. e., F,[X, z ] F,,[X, z ] and V r f( X ) 5 Vrrf( X I . For X = [0,1],z = 0 the enclosures are proper (i. e., they hold without the admission of the equality sign). Choose f ( z ) = 2xx2 - x 2 with decomposition f' = z, f5 = f4 - f3.
f2
= 22,
f3
=
(f')2,
f4
=
f2f3,
For this function we get: F,[X, z ] v,f ( X ) F,,[X, z ] v,, f ( X )
+ 2z(X + z ) - ( X + z), + ( 2 X ) ( 2 X )- 2 x , (22 - 1 ) ( X + z ) + 2 x 2 ,
= 2x2 = 2x2 =
= (2X
- 1 ) ( 2 X )+ 2 x 2 ;
i. e., Frl[X,z ] c F I [ X ,z ] and VIrf( X ) G V i f ( X ) . Here, the choice X = [0,1],z = 1 will lead to a proper enclosure.
4.3
Interval Slopes and Centered Forms
In this subsection we use the interval slopes of the previous subsection to define quadratically convergent Krawczyk-forms. In the paper of Krawczyk and Neumeier ([22]),interval slopes were first used to improve centered forms of rational functions.
Definition 3 Let f be an arithmetic ezpression and F,[X,z], F I I [ X , Z ] ( X E IR") the corresponding interval slopes as computed b y method I and II, respectively. Then, the intervals
are called Krawczyk-forms for f and z E X .
H.-C. Fischer
128
Since FI and F11 are interval slopes, the following is valid for all z, z E X :
f
- f ( z ) E FI/II[X,4 . .( - ). E FI/II[X,4 * ( X - z ) ,
i. e., f(z) E F1,1l2(X) and thus f ( X ) E F , / , I ~ ( Xfor ) all z E X with f denoting the range of values of f. Because of (16)and (17), normally the enclosures are tighter than those of the mean-value form [2,321: F M ( X )= f(z) VI/IIf ( X ) (X - z ) . The new centered forms are quadratically convergent. This is shown by the next proposition.
+
-
Proposition 3 Let f : D + R be an arithmetic ezpression, Xo C D c R",and f 1 . . f" a decomposition o f f . For Xo E IR", the ezistence of the interval slopes FI[XO,Xo] and FI,[XO,XO]is assumed. Then, for all z E X C Xo, there exist constants61 := S,(f,XO) and6,l := 611(f,Xo) (independent of X ) such that the following inequalities are valid for the corresponding centeredforms (where w([a, b ] ) := b - a denotes the width of an interval):
.
w (FIZ(X))-
( f ( X ) )I61 IIw (x)ll;
and
w (FII,(X))- w ( j ( X ) )I 611 IIw (X)Ili.
Proof See [8]. The Krawczyk-forms can be used to compute lower bounds (with arbitrary accuracy) for the global minimum fi := min,,x{ f(z)}of a function f : D + R, D c R" where X E D is an interval vector, i. e. X E IR", and f is an arithmetic expression. For this purpose, X is subdivided step by step into smaller intervals. The minima for the ranges of the subregions can be bounded by use of the Krawczykforms. The subdivision is based on the strategy of Skelboe [35]which guarantees that the number of regions does not grow exponentially as the subdivision is refined. Monotonicity tests making use of the values of partial derivatives improve the performance of the method. For a detailed outline of the algorithm, see [8]. Additionally, there is the following comparison with numerical results from [4].
Example 8 For the function
f (21 z2,z37 z4i z5) 7
= fl(xl)f2(z2)f3(z3)f4(x4)f5(x5)
Automatic Differentiation
129
the range of values has to be computed. For 51
E [8.7,8.8],
52
E [-9.4,-9.31,
14
E [-4.6, -4.51, E [3.5,3.6],
5 5
E [-2.9,-2.81,
53
the algorithm computes
f (q,z2,
z3,2 4 , z5)
E [24315.0,24513.7] and [24345.9,24481.5],
respectively, using either no or between one to five subdivisions. In [4/ the Computation of the enclosure [24054.5,24774.2] requires already four subdivisions; a naive interval evaluation yields [22283.5,26731.4].
5
Formula Evaluation and Analysis
Rounding-Error
In this section we consider the problem of evaluating functions in floating-point arithmetic. The method of reverse computation makes it possible to perform the error analysis in an automated way; for reasons of simplicity, only absolute errors are considered here. The problem of the accurate evaluation of a segment of numerical code will now be discussed briefly. Here, accurate refers to the evaluation in the set R, rather than the set of machine numbers.
For the arithmetic expression f : R" + R, the value of f (z) may be approximated by use of the floating-point result I($). The following two problems will be treated:
If($)
1. What is the error - f (.)I (or which bound can be given for it) provided the computation is carried out by use of a floating-point arithmetic with precision I? The input data may be given exactly by 1 digits or rounded to 1 digits. (It is assumed that the basic floating-point operations -, X , and / compute the optimal 1-digit result, see [24].)
+,
2. Which maximal input errors can be allowed for the values of 21.. .I,,and which floating-point precision is necessary in the course of the computation o f f in order to guarantee the error bound If(2)- f )I(. 5 E (for fixed e E R)?
The first problem can be solved easily by use of interval arithmetic. In general, the error bounds of this computation are too pessimistic. However, the combination
H.-C. Fischer
130
of interval arithmetic and reverse computation makes it possible to estimate the rounding errors in a satisfactory way.
A first algorithm (recomputation algorithm) for the solution of the second problem (without use of the reverse computation) has been given in [34]. In the execution of the first step of the algorithm, interval enclosures of all intermediate results are computed. With these enclosures it is possible to determine the precision of the input data and the computation in order to guarantee the estimation l f ( E ) - f ( x ) l 5 c. Since the algorithm in [34] does not use methods of automatic differentiation (in [34], they are believed to be too inefficient), the errors of the individual intermediate result cannot be computed seperately. Therefore, the algorithm must compute with precisions that in general are unnecessarily high. Here, a method is presented which makes it possible to express the total rounding error in terms of the input errors and the errors committed during the floatingpoint computation. The cost of the method is of the same order as the function evaluation itself.
5.1
Evaluation of Expressions and Reverse Computation
In this section, the following notations will be used: by ?k := o p , ( x k ) we denote the result of rounding x k E R to the set s, of floating-point numbers of precision pk, f k = f' opk and f k = sfpk(f') denote the floating-point operations of precision p k , where it is assumed that E i. e., no additional rounding of the is necessary. Similary, p k denotes the evaluation of F k in machine input f' and interval-arithmetic.
p
f',p s,,,
p
The next proposition summarizes some well-known results for the propagation of absolute errors.
Proposition 4 Let 3i be an approximation of xi and u; = E i - xi ( i = 1,2). Then, for the propagated absolute errors u := (51 0 5 2 ) - ( X I 0 x 2 ) and u := sf ( E l ) -sf ( X I ) , the following is valid: u* = u1 f 212 for o = f, u x = E2ul
+ x1u2 for
usf = sf'(() u1 for a
o = x,
€EW
x l ,
with differentiable standard functions sf and
Eluxl
in the domain of sf'.
Automatic Differentiation
131
Proof E. g., division:
The other cases are proved analogously.
0
For the arithmetic expression f(z) and its decomposition f'. ..f", the (local) rounding error zk is given by zk
:= z k - X k for k = 1.. .n, := ti&,, - Ck-,, for k = 72 t 1.. . m ,
zk
:=
%k
:= sf,,(f')
t k
f'o,,p-f'op
- sf (f')
for k > m a n d o € { + , - , x , / } , for a standard function.
Thus, using Proposition 4, the total absolute error Ak := f k ( 5 )- f k ( z )is given by:
Ak Ak Ak Ak Ak
= = = =
zk for k = 1.. . m ,
A'f A' + Z k , if f k = f ' f f', P A ' +- f'A' t k , if f k = f' X f', (A' - f k A r ) / p t t k , if f k = f ' / f , = Sf'(t)A'+%k,if f k = S f ( f ' ) ( < € f'd').
+
Because of Corollary 1, the absolute error A = f(5) - f ( z ) can be written as A = A. = djzj, where the values dj can be computed analogously to the case of Algorithm 1.
C:=,
However, some of the values are not known exactly, e. g., f' in the case of f k = f' x f'. If they are replaced by the corresponding approximations (e. g. f' by ! I ) , then only approximations of the dj can be calculated. By use of interval arithmetic, it is possible to compute guaranteed enclosures Dj and Bj of the values dj. Choose enclosures and 6; such that zi,& E 2; (i = 1 ... n), c;,& E 6; (i = 1. . . m
zi
- n). The machine interval-arithmetic evaluations of
to exist for the computation with precision P.
p1. . .Pa are assumed
Then, with initializations Da := 1, Dk := 0 (k < s), and:
D' := D' t Dk, D' := D' f Dk, if D' := D'
+ P' x Dk, D'
:= D'
fk
= f' f f',
+ P' x Dk, if f k
= f' x f',
132
H.-C.Fischer D' := D' + Dk//pr, D' := D' - pk x Dk//pr, if
(for k := s, s - 1.. . rn
fk
=
f'/r,
+ l ) , there holds:
The values flj are computed by evaluating the preceding formulas making use of machine interval-arithmetic with precision P. By means of (18), it is possible to estimate the contribution of the error propagation (of input errors) and the one of the rounding errors. Provided all operations are carried out exactly, i. e. zk = 0 for k > rn, then the total error is caused only by the propagation of input errors; they may be a consequence of unprecise or rounded data.
For the propagated error A, := Cz, djzj, the following is valid because of (18):
C DJ(5j- + C Dj(zj-n n
A, E
m
~ j )
j=1
~j-n)
j=n+l
and thus n
m
j=1
j=n+l
This formula can be modified easily such that a guaranteed bound can also be calculated by use of a computer (employing a suitable machine interval-arithmetic). The expression in (19) corresponds to the well-known formula for error propagation (see e. g. [GI)
which, however, represents only an approximation. In contrast to this, the estimation (19) is guaranteed and it can be calculated by use of a computer with a cost that is proportional to the cost of an evaluation of f. Thus, the cost of a sensitivity analysis based on (19) or (20) is of the same order as the cost of the evaluation of the function itself. Additionally, the contribution of the rounding errors (and their propagation in the course of the computation) can be estimated by use of (18). Provided all input data
Automatic Differentiation
133
are exact (i. e. zj = 0 for j = 1 .,.m), then the rounding error is Ae := &m+l and by use of (18):
C
dzj,
8
Ae E
D'zj.
j=m+l
Provided the operation j is exact (e. g. in the case terms cancel rigorously in a subtraction), then zj = 0 and Djzj does not influence the value of the sum in (21). In general, the exact value of zj in (21) will not be known. Since we use an optimal arithmetic and zj is the local rounding error in step f j , the absolute value of zj can be bounded by:
For the local precision pj, the inequality pj 2 P must be valid; otherwise, it cannot be guaranteed that the value (computed with precision pj) is in the corresponding interval Fj. Thus, the absolute value of the computational error Ae can be estimated by use of (21) and (22):
Now using (20) and (23), the absolute value of the total error A = A, bounded by
14
=
IJ(+f(41 n
m
j=1
j=n+l
+ Ae is
To get a guaranteed upper bound when (24) is calculated in floating-point arithmetic, the values Dj must be substituted by b j and the sums and products have to be computed by use of directed operations. The inequality (24) can be interpreted in the following way: the absolute value of Dj shows the maximum amplification of the error in step j ; thus, the values Dj are the condition numbers for the total error with respect to the errors of all intermediate results.
H.-C. Fischer
134
If one is interested in these values only, or in an estimation of the form
then the values dj6 may be computed directly by: 1. d i := 1, d: := 0 for k
< s,
2. For k := s,s - 1 . ..m
+ 1:
df, := df,
+
df, := df,
+ Isf'(P')I
x d:,
4
:= 4
x d!, if
fk
+ Ip'I x d j , if f k
= f' x
f',
(27)
= sf(f').
(29)
Thus, the need for interval arithmetic is limited here to the computation of the values p1.. . pa. The operations of steps (26) to (29) are carried out in a directedrounding mode in order to guarantee the inequality (25).
5.2
Algorithms for Formula Evaluation
The above methods for rounding-error analysis are now used to develop algorithms for the evaluation of expressions in floating-point arithmetic. First, a procedure for the solution of Problem 1 (see the beginning of this section) will be given.
Algorithm 4 ( F o r m u l a evaluation I (rounding-error estimation)) { in: f : R" 4 R, x E R" and decomposition f' . . . f" off; out: Approximation f(5) (computed with floating-point arithmetic of precision 1) - f(z)l 5 E ; and error bound E > 0 with If(?)
1
1. { Precomputation (in interval arithmetic); P 5 I } Set pk := o p ( x k ) for k = 1.. .n, pk := O p ( C k - " ) for k = n compute p r n + l . . in interval arithmetic of precision P .
.P"
{ I n particular, the ualuesf(Z) and f ( z ) arelocated in pa, i. e. w(P"). }
If(?)-
+ 1.. . m and f(z)l 5
135
Automatic Differentiation
{Reverse step} Let be d i := 1 and di := 0 ( j < s ) and compute the values d:-' to (26) - (29).
. . .d:
according
{Floating-point evaluation} Evaluate f' . . . in floating-point arithmetic of precision 1. {Error estimation} Compute the error bound e' := f dj6 IPjI B1-'.
xi=,,+,-
(_Output} f ( j . 1 :=
I.,
-
'&d i
*
1j.j
- xjl+ C;=,+,d i * IZj-,, - ~
j - ~ l +
e := min(e', w ( P 8 ) ) .
Remark 7 If, in step 1 of the above algorithm, there is an unadmissible interval operation, then the step is repeated b y use of a higher precision 1 2 P' > P . If even the precision P' = 1 is not suflcient, then the algorithm can be applied recursively with respect to the critical element: the interval Enit = [ f n i i - c , f n i i + c ] is an enclosure for fnit. If kit still causes problems, then the algorithm cannot evaluate the formula with precision 1; it is then necessary to use a multiple-precision arithmetic.
The following example demonstrates the employment of the algorithm.
Example 9 The system of linear equations Ay = b is solved b y use of a LUdecomposition (without pivoting) of A. The 'formula'f for the evaluation is the procedure for the first component y1 of the result which is computed b y use of a program for the LU-decomposition supplemented b y a forward-backward substition; i. e., f = f ( A ,b) = y l ( A ,b) is a function of n2 n variables.
+
Let A be the Boothroyd-Dekker matrix of dimension n = 7 , i. e.,
The components of the right-hand side are all equated to 1 . A program for Algorithm 4 (written in PASCAL-XSC and using a decimal arithmetic with 13 digits) computes: f = 1.000000003111 and the error bound e = 1.095322002366B - 06; i. e., the interval [0.9999989077889,1.000001098434] is a guaranteed enclosure for the result f = y1 = 1 . The naive interval computation in step 1 of the algorithm computes only the rough enclosure = [0.9463683520819,1.053631643420]. Of course, it is more favorable to compute all components of the result b y means of a specialized algorithm for the verified solution of linear systems (see e. g. [25]).
H.-C. Fischer
136
Problem 2 (see the beginning of the section) can also be solved by use of the methods developed so far. The propagated error and the rounding error can be estimated by use of (20) and (23). In practice, the precision of the computation will be chosen in such a way that the accumulated rounding error is of the same order of magnitude as the propagated error as caused by uncertainties of the input data. Therefore If"(Z) - f(x)l 5 e is valid under the assumptions: (a) the choice for the precision of each step results in an error B1-Pj 5 f, and (b) IZj-xjI, lEj-n-cj-nl IEji-" - cj-"I 5 f.
di
!j
- lpjl-
in (25) satisfy thebound Cj"=ldi.JZj-xjl+Cz=n+ld i -
Then propagated and the rounding error are approximately equal in magnitude. The procedure can be summarized by the following algorithm:
Algorithm 5 (Formula evaluation I1 (predefined error bound)) { in: f : R" -+ R, x E R" and decomposition f' . .. f" o f f , 0 < e E R; out: Precisions (Pk), k = 1 . . .s and approximation f(5) (computed b y Boatingpoint arithmetic of precision f(x)l 5 6;
Pk
in step k) which satisfies the error bound: f(Z) -
1
1. { Precomputation (in interval arithmetic)} Set pk := o p ( Z k ) for k = 1 . . .n, Fk := O p ( C k - n ) for k = n 1.. . m and .PEb y use of machine interval-arithmetic with precision P . compute pm+l.. { I n principle, the choice of P > 0 is arbitrary; however, if there is an unde-
+
fined interval operation, this step must be repeated with an increased precision, P' > P.} Z f w ( P ) 5 e is valid, then the procedure has been completed and each computed value from fi satisfies the required error bound.
2. {Reverse step} Set d: := 1 and di := 0 ( j < s ) and compute the values di-' to (26) - (29).
...4 according
3. {Determination of precisions} For k = 1 m the precisions Pk are chosen in such a way that the following is valid:
.. .
137
Automatic Differentiation
+
For k = m 1 . . .s, the precisions following is valid:
pk
are chosen in such a way that the
..
Set Pk := max ( P k , P ) for k = 1 .s. { The minimum precision P is necessary in order to validate the estimations as computed b y use of machine interval-arithmetic with precision P.}
4.
{Floating-point computation} Compute (by use of the precisions approsimations E l . . .
p.
Pk
of the previous step) the floating-point
5. (Output} f ( z ) :=
p, a vector ofprecisions ( P k ) .
In step 3 of the preceding algorithm, all.precisions are chosen in such a way that the error contributions (weighted by d t . IPkl) satisfy identical bounds and
t
respectively. In cases where the weights d; * lPkl differ in a wide range, other error distribution strategies may lead to betters results. In practice, however, the availability of different hard- or software arithmetics has a significant influence on the optimal choice, too. The following example shows Algorithm 5 at work.
Example 10 The function f ( a , b) := 333.75b6+a2(lla2b2-b6-121b4-2)+5.5ba+ al(2b) from [13] is to be evaluated for a = 77617.0 and b = 33096.0. Algorithm 5 was implemented in PASCAL-XSC. In the first two steps of the algorithm, the interval computations are cam'ed out b y use of a decimal arithmetic of 13 digits. The input data a, b can be represented without input errors.
If
E is chosen in such a way that the relative error of the result is less than one percent, then the program computes precisions for the floating-point operations which are in a range from 13 to 42. Indeed, for precisions of less than 37 decimal digits (at critical places) not even the sign of the floating-point result is correct. For at least 37 digits, the accuracy increases suddenly, and the result differs from the solution -8.2739.. . E - 1 only in the last place. Therefore, the result of the algorithm (i. e., a minimum precision of 42 instead of 37 digits) is a meaningful result, especially if one considers the condition of the problem: the absolute values of the both are of the order 1P2. partial derivatives
2,%
For the input a = 3.333333333333E - 1 and b = 4.444444444444E - 1 (i. e., f = 2.2348 ...) the problem is not dificult: x 3,1%1 x 30. For an €-bound the computed corresponding to a relative error of the result of less than precisions range from 13 to 15. For these values of a and b, the floating-point
121
H.-C.Fischer
138
evaluation b y use of 13 digits already has a relative error of less than even the bounds of the interval evaluation off difler b y less than 5 digits in the last place. In this case too, the results of the algorithm (a minimum precision of 15 digits) are meaningful since, in an a priori fashion, it is not clear whether the problem is 'trivial'or not. The algorithm automatically chooses the precision in the appropriate way. The cost for this guarantee consists at most in the computation of some extra digits.
If in the first step of the above algorithm, we set pk := O p ( X k ) , X k E IR for k = 1.. .R, then the computed precisions (step 3) are even valid for all x k E Xk. The following example shows how this can be used to solve a 'collection' of problems.
Example 11 For the reduced argument z E [0,0.1],the exponential function is c;xi, ci := l/i! of degree N = 3. approximated b y the Taylor polynomial TN = CEO What accuracy for x and the coeflcients ci is necessary, which precision for the computation is required in order to get a computational error of less than 5 Since the truncation error le" - TNI is less than 5 for the interval under consideration, the following is valid for the total error: le" - F N ( 2 ) l 5 5. +5 = for z,2 E [O,O.1]. The evaluation may be carried out as follows: f' = X , f 2 = co, f 3 = ~ 1 f,4 = ~ 2 f 5 = c 3 , f 6 = f 5 X f', f ' = f"f4, f 8 = f ' x f', f 9 = f"f3, f'O= f 9 x f', f" = f ' O + f 2 .
-
In the above algorithm, we set p' := [0,0.1],p2 := p3 := [l,11, p4 := [0.5,0.5]and p5:= [0.166,0.167];this leads to the following values (using %digit interval computation): p6 = [0,0.0167],p' = [0.5,0.517],p8 = [0,0.0517],pg = [l, 1.061, plo=
[0,0.106],p" = [l,1.111 and, thus, d;' = di0 = 1, 4 = 9 = 0.1, d:' = d'O6 = 1, 4 = 9 = 0.1, 4 = 6 = 0.01, d i = 0.001, d i = 0.01, dz = 0.1, d;4 = 1, d i = 1.13. From this, there follow the inequalities 5 10-p a2.7 5 5 . and p 2 6.5. Thus, it has been proved for & = i.1 = 1, 22 = 0.5, i.3 = 0.1666667, and a jloatingpoint evaluation with 7 digits that the computation satisfies the required error bound for all arguments from [0,0.1](rounded to 7 digits).
5.3
Accurate Evaluation of Segments of Code
The applicability of the preceding algorithms for the purpose of an evaluation of segments of code has already been shown by means of the example making use of a LU-decomposition. In that case, a program for the solution of a linear system was used for the 'formula' to be evaluated. The LU-decomposition was carried out without pivoting since comparisons and conditional statements here are necessary. The comparison of intermediate values as computed by use of floating-point arithmetic must lead to the same logical result as the comparison of the exact values. Therefore, interval enclosures for the operands of a comparison are computed in the course of the precomputation. If the logical result of the interval comparison is
,
Automatic Differentiation
139
unique, then the computation can be continued without problems. Otherwise, the enclosures of the operands have to be improved until a decision is possible. This improvement can be carried out either by means of an increase of the precision of the interval computation or by an application of Algorithm 5 for an evaluation of the operands. By means of this procedure, all comparisons which are meaningful for floating-point computation, can be brought to a decision. E.g., the following statement (which seems to be trivial) is not meaningful in this sense: IF (1/3) x 3 = 1 THEN . . . The logical result (TRUE or FALSE) of the IF-condition cannot be computed (without further transformations) by use of a floating-point arithmetic (provided the machine basis is even (e. g. 2, 10 or 16). Thus, the problem of evaluating segments of a code can be solved by use of Algorithm 5 (and the preceding extensions for the purpose of conditional statements). For algorithms with iterations, it may be difficult to store all intermediate results. Here, Algorithm 5 must be applied to the main part of the iteration only; alternatively, the possibilities for the reduction of the spatial complexity of the reverse computation (see section 3) must be used.
A
Implementation of a Differentiation Arit hmetic
In the following, it is shown how a differentition arithmetic can be implemented in the programming language PASCAL-XSC [19], a PASCAL extension for Scientific Computation. The computation is done in interval arithmetic; this makes it possible to compute bounds for derivatives. The type interval for intervals is a standard data type, the corresponding interval operations are defined in the module i-ari. The pairs of the differentiation arithmetic (for the first derivative of a function f : R + R) are represented by type df-type. The function is f(z) := 25(r-l)/(z2+1) from Example 1 with I the point interval X = [2,2]. (References to FORTRAN implementiations can be found, e. g., in [ l l ] and [30].) program ex-i (input,output); use i-ari; type df-type = record f,df: interval; end; operator + (u,v: df-type) res: df-type; begin res.f:=u.f+v.f; res.df:=u.df+v.df; end; operator - (u,v: df-type) res: df-type; begin re8.f:iu.f-v.f; res.df:=u.df-v.df; end;
H.-C. Fischer
140
operator * (u,v: df-type) res: df-type; begin res.f:=u.f*v.f; res.df:=u.df*v.f+u.f*v.df; end; operator / (u,v: df-type) res: df-type; var h: interval; begin h:=u.f/v.f; res.f:=h; res.df:=(u.df-h*v.df)/v.f;
end;
operator + (u: df-type; v: integer) rea: df-type; begin res.f:=u.f+v; res.df:=u.df; end;
-
operator (u: df-type; v: integer) res: df-type; begin res.f:=u.f-v; res.df:-u.df; end; operator * (u: integer; v: df-type) res: df-type; begin res.f:=u*v.f; res.df:=u*v.df; end;
(***
further operators for mixed types ***)
function sqr (u: df-type) : df-type; begin sqr.f:=sqr(u.f); sqr.df:=2.0*u.f*u.df; end;
(***
further standard functions ***)
function df-var (h: interval) : df-type; (* definition of independent variable *) begin df-var.f:=h; df-var.df:=l.O; end; var x,f: df-type; h : interval; begin h:~ 2 . ;0 x:=df-var(h) ; f:=25*(~-l)/(sqr(x)+i); (* Example i *) uriteln( 'f , df : ' ,f.f,f.df); end.
References [l] Alefeld, G.: Bounding the Slope of Polynomial Operators and some Applications, Computing 26, pp. 227-237, 1981 [2] Alefeld, G., Herzberger, J.: Introduction to Interval Computation, Academic Press, New York, 1983
Automatic Differentiation
141
[3]ANSI/IEEE Standard 754-1985,Standard for Binary Floating-point Arithmetic, New York, 1985 [4]Asaithambi, N.S., Shen, Z., Moore, R.E.: On Computing the Range of Values, Computing 28, pp. 225-237, 1982 [5] Beda, L.M. et al. : Programs for automatic differentiation for the machine BESM, Inst. Precise Mechanics and Computation Techniques, Academy of Science, Moscow, 1959 [6] Bauer, F.L.: Computational Graphs and Rounding Error, SIAM J. Numer. Anal. Vol. 11, NO. 1, pp. 87-96, 1974 [7] Baur, W.,Strassen V.: The Complexity of Partial Derivatives, Theoretical Computer Science 22, pp. 317-330, 1983 [8] Fischer, H.-C.: Schnelle automatische Differentiation, Einschliefiungsmethoden und Anwendungen, Dissertation, Universitat Karlsruhe, 1990 [9] Fischer, H.-C., Haggenmiiller, R. and Schumacher, G.: Evaluation of Arithmetic Expressions with Guaranteed High Accuracy, Computing Suppl. 6, Springer, pp. 149-158, 1988
[lo] Gries, D.:The Science of Programming (Chapter 21: Inverting Programs), Springer, 1981
[ll] Griewank, A.: On Automatic Differentiation, in Mathematical Programming: Recent Developments and Applications, ed. M. Iri und K. Tanabe, Kluwer Academic Publishers, pp. 83-103, 1989
[12] Griewank, A.: Achieving Logarithmic Growth of Temporal and Spatial Complexity in Reverse Automatic Differentiation, Preprint MCS-P228-0491,Argonne National Laboratory, 1991 [13] IBM High-Accuracy Arithmetic Subroutine Library (ACRITH), Program Description and User’s Guide, SC 33-6164-02,1986 [14] IBM High-Accuracy Arithmetic - Extendend Scientific Computation, Reference, SC 33-6462-00,1990 [15] Iri, M.: Simultaneous Computation of Functions, Partial Derivatives and Estimates of Rounding Errors - Complexity and Practicality -, Japan J. Appl. Math. 1, pp. 223-252, 1984 [16] Kaucher, E., Miranker, W.L.: Self-validating Numerics for Function Space Problems, Academic Press, New York, 1984 [17] Kedem, G.: Automatic Differentiation of Computer Programs, ACM TOMS Vol. 6, No. 2, pp. 150-165, 1980 [18] Kelch, R.: Numerical Quadrature by Extrapolation with Automatic Result Verification, this volume
142
H.-C. Fischer
[19] Klatte, R. et al.: PASCAL-XSC, Sprachbeschreibung mit Beispielen, Springer, 1991 [20]Klein, W.: Enclosure Methods for Linear and Nonlinear Systems of Fredholm Integral Equations of the Second Kind, this volume [21] Krawczyk, R.: Intervallsteigungen fur rationale Funktionen und zugeordnete zentrische Formen, Freiburger Intervall-Berichte 83/2,pp. 1-30, 1983 [22]Krawczyk, R., Neumaier A.: Interval Slopes for Rational Functions and Associated Centered Forms, SIAM J. Numer. Anal. , Vol. 22, No. 3, pp. 604-616, 1985 [23]Kulisch, U.: Grundlagen des Numerischen Rechnens, Bibliographisches Institut, Mannheim, 1976 [24]Kulisch, U., Miranker, W.L.: Computer Arithmetic in Theory and Practice, Academic Press, New York, 1981 [25]Kulisch, U., Miranker, W.L. (ed.): A New Approach to Scientific Computation, Academic Press, New York, 1983 (261 MACSYMA, Reference Manual, Symbolics Inc. Cambridge (Massachusetts, USA) [27]MAPLE, Reference Manual, Symbolic Computation Group, University of Waterloo (Ontario, Canada), 1988 [28] Matijasevich, Y.V.: A posteriori interval analysis, Proceedings of EUROCAL 85, Springer Lecture Notes in Computer Science 204/2,pp. 328-334, 1985 [29] Moore, R.E.: Interval Analysis, Prentice-Hall, Englewood Cliffs, 1966 [30]Rall, L.B.: Automatic Differentiation, Springer Lecture Notes in Computer Science 120, 1981 [31]Rall, L.B.: Differentiation in PASCAL-SC: Type GRADIENT, ACM TOMS 10, pp. 161-184, 1984 [32]Ratschek, H., Rokne, J.: Computer Methods for the Range of Functions, Ellis Horwood, Chichester, 1984 [33]REDUCE, User’s Manual V3.3,Rand Corporation, Santa Monica, 1987 (341 Richman, P.L.: Automatic Error Analysis for Determing Precision, Com. of the 15, NO. 9,pp. 813-817, 1972 ACM, VO~. [35] Skelboe, S.: Computation of Rational Interval Functions, BIT 14, pp. 87-95, 1974 [36]Wengert, R.E.: A simple automatic derivative evaluation program, Com. ACM 7, pp. 463-464, 1964
Numerical Quadrature by Extrapolation with Automatic Result Verification Rainer Kelch
In this paper, we will derive an adaptive algorithm for a verified computation of an enclosure of the values of definite integrals in tight bounds with automatic error control. The integral is considered as sum of an approximation term and a remainder term. Starting from a Romberg extrapolation the recursive computation of the T-table elements is replaced by one direct evaluation of an accurate scalar product. The remainder term is verified numerically via automatic differentiation algorithms. Concerning interval arithmetic, the disadvantages of the Bulirsch sequence are overcome by introducing a so-called decimal sequence. By choosing different stepsize sequences we are able to generate a table of coefficients of remainder terms. Via this table and depending on the required accuracy, a fast search algorithm determines that method which involves the least computational effort. A local adaptive refinement makes it possible to reduce the global error efficiently to the required size, since an additional computation is carried out only where necessary. In comparison to alternative enclosure algorithms, theoretical considerations and numerical results demonstrate the advantages of the new method. This algorithm provides guaranteed intervals with tight bounds even a t points where the approximation method delivers numbers with an incorrect sign. An outlook on the application of this method to multi-dimensional problems is given.
1 1.1
Introduction Motivation
Integrals appear frequently in scientific or engineering problems. Often they provide the input parameters in a chain of complex algorithms and therefore they have a decisive influence on the quality of the final result. With approximative quadrature procedures, it is not possible to come up with a verified statement about the error tolerance. Indeed, there exist asymptotical enclosure methods (see e.g. in [8,35, 37]), but they cannot lead to a guaranteed solution (see Section 6).
For that reason we turn t o enclosure methods with absolute, exact, and safe error bounds. In [ l l , 121 methods for integral enclosures are derived employing NewtonC6tes formulae, a Gauss quadrature, and a Taylor-series ansatz. In [lo, 421 the Scientific Computing with Automatic Result Verification
Copyright 0 1993 by Academic Press, Inc. 143 All rights of reproduction in any form reserved. ISBN 0-12-044210-b
R. Kelch
144
problem of an optimal quadrature procedure is discussed. In the narrower choice the Gauss quadrature and the Romberg quadrature are left. Because of the necessary computation of the irrational nodes for the Gauss quadrature, this comparison indicates the Romberg integration as preferable and more elegant. The attempts to transfer these onto interval methods and the recursive computation of the elements of the T-table turn out to be detrimental. The observed growths of intervals can accumulate quickly and destroy the advantages of the accelerated convergence. Under this aspect it seems to be preferable to employ the Gauss quadrature or some other direct method. Through the idea of transforming the recursive computation of the T-table elements into a direct computation of one scalar product (see Section 4), the disadvantage of the computation of the Romberg extrapolation can be neutralized. Even in [4], Bauer pointed out this possibility, however, he preferred the recursive one because of its higher effectiveness. This assertion is not valid any more in view of the KulischMiranker-Arithmetic, which admits a controlled calculation of verified enclosures of solutions by use of a computer, an interval arithmetic, and an extension of a programming language (see [2, 5, 19, 20, 24, 25, 261). The direct calculation via a scalar product is more accurate than the recursive computation of the T-table elements; this admits a more rapid calculation of a solution possessing the required accuracy. Indeed Bauer specifies formulae for computing the weights, but they are not practicable. Direct algorithms for a determination of the weights are derived in this article. Analogously to the Romberg sequence, an enclosure algorithm may be realized for the Bulirsch sequence Fs.In Theorem 7 we present the proof of an important property for the application of the h-sequence by Bulirsch, which was missing until now. By introducing a new h-sequence FD (Decimal sequence) the set of the possible procedures for the required accuracy can be enlarged. Of course, in any particular case, the search for the best (and therefore most rapid) method should not be destroyed by a high computational effort in the preparatory work: the storage of the remainder term factors in a table for a selection of effective methods enables us to choose the optimum before the start of the computation-intensive part of the algorithm.
1.2
Foundations of Numerical Computation
In the execution of numerical algorithms and formulae by use of a computer, there are rounding errors because of the transition from the continuum to a floating-point screen. Additionally there are procedural errors, whose practical determination is not possible even though they can be expressed explicitly by means of the integrand or its derivatives. In order to deliver mathematically verified results, a specific algorithm has to be able to control both rounding and procedural errors. These basic requirements call for a certain standard of computer arithmetic and programming languages. They are met by the Kulisch-Mimnker- An'thmetic and the programming language PASCAL-SC, which have been developed for scientific
Quadrature by Extrapolation
145
computations. Under the name PASCAL-XSC (see [23]), this extension of PASCAL is not only available for PCs but also for workstations and host-computers. By means of the implementation of the optimal scalar product, we avoid the accumulation of rounding errors and therefore obtain an increase in accuracy (see [5, 241). The foundations, which are the tools of modern numerics, are explained in detail in [2, 20, 251. Important terms like screen, rounding, rounding symbols like 0,4 with o E {+,-,*,-,/}, optimal scalar product, long accumulator, and ulp (one unit in the last place), are defined in these sources.
1.3
Conventions
+
h-sequence : Stepsize sequence F := { h; } with hi := $ , n; 1 the number of nodes for the quadrature formula, i.e., we need n; + 1 nodes in the i-th row of the T-table for the element in the O-th column.
N , Z, R
: Sets of natural, entire, and real numbers respectively.
ZR : Set of real intervals X := [a, b] := {yly E R A a 5 y 5 b } , with a, b E R.
##
: In PASCAL-SC this denotes the evaluation of maximum accuracy by use of rounding to the nearest enclosing interval possessing machine numbers as endpoints of the braced expression (see [5]).
alb, ayb: a divides b, a does not divide b. cr ee cc
/
pr : Eequired absolute
/ /
p e : estimated absolute pc : absolute
/
/
relative error.
/
relative error.
relative error as following from computed enclosures.
t x : total execution time in seconds for the method X. gP : the quotient of the true (accurate) and the estimated error.
V ( m ,I , F) : quadrature method with the T-table row m , the bisection step I , and the h-sequence F as parameters (see Section 5). vquad : optimal quadrature algorithm with verified integral enclosure.
approz : quadrature algorithm via an approximating Romberg integration. asympt : quadrature algorithm with an asymptotical enclosure via a h m b e r g ex-
trapolation.
itayl : verified quadrature algorithm by use of a Taylor-series ansatz.
Formulae and equations are enumerated for each section. If they are used in another section, they are cited with the section number at first place.
R. Ir‘elch
146
Romberg Integration
2
The Romberg integration is essentially an application of the Richardson extrapolation (see [14]) to the Euler-Maclaurin summation formula (see [37]). For an approximate computation of the integral (1)
J :=
J” f ( t ) d t , a
a, b E
R , f E C k [ a ,b] ,
we can use the trapezoidal sum
c n
T ( h ):= h *
(2)
j=O
11
b-a f(u + j h ) , h := -, n E N . n
Now we discuss the asymptotical behavior of T(h) for h going to zero.
2.1
The Euler-Maclaurin Summation Formula
In [27, 371 we find the proof of
Theorem 1: For f E CZm+z[u, b] the trapezoidal sum (2) is given by 1.
T(h ) = J
+ i=l c Tih2’ + am+,( h ) m
*
h2m+2
Remarks: 1. The ~i are constants independent of h; the equation in 3. is said to be the Euler-Maclaurin summation formula. 2. In [l],the Bernoulli numbers Bk are stored for k 5 30.
R defined by Sk(z) := BL(Z- [ z ] )with , Bk(z) the 3. The functions Sk : R Bernoulli polynomials, have the period 1. This implies
Quadrature by Extrapolation
147
Corollary 1: The remainder term in 3. in Theorem 1 can be expressed by use of %+l(h)
2.2
=
*
( b - a ) . &+1
*
(ZmtZ)
c
, t E [a,bI.
The Remainder Term
Let us consider the definite integral J defined in (1) for the case of a sufficiently large k. Let the h-sequence
(3)
ho , O : = b - a , F : ={hO,hl,h2,...}, h; := h ni
n;EN
be given. By means of the appropriate trapezoidal sums T(hi),an approximation p(0) of J is computed via a Lagrange interpolation making use of the well-known recursion formulae (4)
{
,o Ii 5
T;o := T ( h ; ) Tik :=
hi-k
(7) 'Ti,k-l-Ti-l,k-l I
(*)L
rn,
,lllclilrn;
the Tik are the T-table elements, which may be grouped in the T-table
(5)
TOO Ti0 Ti1 T20 Tzi T2z T ~ oTm1
T,,
The procedural error
may be computed directly, with m
and
,
m arbitrarily large.
R. Kelch
148
Concerning special h-sequences, we demonstrate in [22] that K(x) does not change its sign in [a,b]. From this we obtain
By use of the Euler-Maclaurin summation formula and with respect to (6) or (9), we deduce
Theorem 2: For the Romberg sequence 3 R := { h 0 / 2 i } , K(x) in ( 7 ) does not change its sign in [a,b]; i.e. we obtain the following remainder term formula for the T-table elements for 3 R , with t E [ a ,b]:
Proof: By use of a complete induction
(see [4, 271).
Remarks:
1. For the sequence with the smallest increase of the number of the required nodes (relative to the 0-th column of the T-table), we obtain a remainder term formula with an analogously simple structure.
2. The statement ' ' 3has ~ the minimal increase of required nodes" is not necessarily valid for the diagonal elements (see Sections 4 and 5 ) .
3. For arbitrary h-sequences, it is not possible to determine in an a priori fashion whether there is a change of the sign (if there is one). The possibility of a simplified representation (6) of the error depends on the choice of the hsequence.
4. Subsequently we often use the notation T, := T,,,
R, := &m.
Quadrature by Extrapolation
Convergence
2.3 For i
149
+ 00, we
proportional to
can show that the error of Tik in the k-th column goes to zero
n h L j . Additionally, the following theorem is valid: k
j=O
Theorem 3: If (*)*
I
5 a < 1 for all i E No and m-m lim h, = 0, then there holds lim
m+w
Tom = mlim T,, +w
= T ( 0 )= J
.
In [7] we find the proof by use of the Theorem of Toeplitz.
For all elements T i k the proof of the following theorem is given in [4]:
Theorem 4: Provided the 0-th column of the T-table converges, all diagonal-sequences converge.
2.4
The Classical Algorithm
The difference of two successive T-table elements serves as a truncation criterion (see [31]):
(10)
ITm-l,m
- Tmml 5
€7
* J M Tmm
*
Therefore it is necessary to compute 4 T-table elements which are positioned to the left and above x k . In Section 6 we will use the algorithm approsfor the purpose of comparisons.
3
Verified Computation of the Procedural Error
Concerning a determination of an enclosure of the remainder term, the only difficulty is the computation of a higher order derivative of f at an intermediate point ( E [a, a]. Because this ( is unknown, we compute an enclosure of f(j)(() by replacing ( with [a,b]. A numerical approximation for computing the j-th derivative is not suitable for a verified result. A symbolic differentiation requires too much effort. By means of recursive calculations, the method of "automatic differentiation" yields enclosures of Taylor coefficients, which immediately imply the required derivatives.
R. Kelch
150
The algorithms are valid independently of any argument type. Therefore, they are valid for real-valued as well as for interval-valued nodes. The recursion formulae for the calculation of some standard functions, which are referred to in [28, 30, 331, are supplemented in [22] by means of additional formulae. By use of two examples, we now explain the direct transfer of mathematical algorithms, which is possible in PASCAL-SC. Finally we outline algorithms for automatic differentiation of functions in two variables with respect to cubature problems.
3.1
Automatic Differentiation Algorithms
Let u, v, w be real-valued functions, which are sufficiently smooth in a neighborhood of t o . We define the Taylor coefficients (u)k of a function u by means of ( u ) k :=
(1)
$ . d k ) ( t o ) := $ - dkto) , for k 2 o
or ( u ( T ) ):= ~
B . & ) ( T ) , T E aZ
or
7
(u)
E IR .
(b)
In this notation, the Taylor-series of u around to is m
(2)
u(t) = c ( u ) k ' ( t - to)'. k=O
For functions that are compositions of other functions by use of arithmetic operations, we can immediately give the following rules making use of rules for computations with power series: (UfV)k
=
( u * v)k =
k
C(u)j
*
*
f o r k 2 0 (b)
(2))k-j
j=O
(u/v)k =
(a)
(U)kf(V)k
k
{(u)k-
C(v)j.(u/v)k-j} j=1
(4
*
By use of the trivial relations (c)o = c
(t)o = t o
, (c)k = 0,
for k 2 1 , c a constant, , ( t ) l = 1, ( t ) k = 0 , for k 1 2 , for the independent variable t
,
we are able t o compute the Taylor coefficients of each order k for arbitrary rational functions by first computing the coefficients for k = 0 for all partial expressions, then for k = 1, etc. We are able to derive calculation formulae for the Taylor coefficients for a larger
Quadrature by Extrapolation
151
class of functions making use of the following relation, which can be derived immediately, with an additional application of the chain rule.
For the Taylor coefficients of the exponential function, we obtain: k-1
(6)
(e")k
j
= c(1 - -) k
*
(e"),
*
(U)k-j
j=O
,k21 .
Analogously recursion formulae are derived for all other standard functions. Composite function expressions do not cause any problems since the arguments of standard functions in recursion formulae are not the independent variable t but, rather, a suitable function u of t. These expressions are valid for arbitrary k 2 1; for k = 0 we take the function value. In [22] all these functions and proofs are listed in detail.
3.2
Implementation in PASCAL-SC
All usual operators and functions are implemented as itaylor-operators and functions. For linking they are available in a package. As data type we choose a dynamic array where the components represent the Taylor coefficients:
type itaylor = dynamic array [*] of interval; By means of two examples we illustrate the handling and the structure of the differentiation package:
+
Let f(s) := x 2 - 3 s 1 be given. Let 3 be the maximum required order of the Taylor coefficients. Therefore we declare the independent variable x by use of
var x, y : itaylor [0..3]; By use of (4), the variable x is represented by (I, 1,0,0) and the constants 3 and 1 by (3,0,0,0) and ( l , O , 0, O), respectively; (3b) as applied to x 2 ( = I . s ) yields (s2,25, 1,O). By use of (3a), we obtain for y := f(t0):
With
f"(10)
= 2!. (f(s0))2, we obtain:
f"(20)
=2 * 1 =2
.
R. Kelcb
152
By initializing, e.g. x as 10 := 1 or [O,l] respectively, we obtain for the components of y:
i=2 i=3 Example 2: By use of the following program, the third order derivative of the function f(z) = I will be computed at nodes to be read in. It is observed that the coding is similar to the mathematical notation.
-
program example 2(input,output);
use irari, itaylor; var i, j : integer;
h, x : itaylor[0..3]; x0 : interval; function f(x:itaylor):itaylor[O..ubound(x)]; begin f := x * exp(1-sqr(x));
end;
begin { main program } -
read(input,xO); expand(x,xO);. { Initialization of x with (xO,l,O,..,O). } h := f(x); write (output, h[3] * 6); { f(”(s)= fs(z) 3! } end.
-
The following multiplication and the exponential function from the numeric packages irari, itaylor are used: global operator
* (A,B : itay1or)res : itaylor[O..ubound(A)];
var k, j : integer;
begin for k := 0 to ubound(A) do -
end;
res[k] := ## ( for j := 0 to k sum (Ab]*BF-j]) );
global function exp(x : itaylor) : itaylor[0..ubound(x)];
var k, j : integer;
Quadrature by Extrapolation
153
h : itaylor [O..ubound(x)];
begin h[O] := exp(x(0));
for k := 1 to ubound(x) do begin -h[k] := zero; for j := 0 to k-1 do h[k] := h[k] (k-j) * hb] h[k] := h[k]/k;
+
* xk-j];
4; &;
3.3
exp := h;
Automatic Differentiation in T w o Dimensions
Concerning a function f : R" + R , f E C k ( G ), G E R" , the automatic computation of Taylor coefficients is possible by use of analogous recursion algorithms, just as in the one-dimensional case. The foundation is the Theorem of Taylor for the n-dimensional case (see [IS]). The rule for the n-dimensional case can be outlined as follows: 1. For the first variable, the rule for the one-dimensional case is applied in such a way that the n-1 other variables are treated as constants. 2. In the expression generated in this way, the rule for the one-dimensional case is applied with respect to the second variable while the n-2 other variables are treated as constants, etc.
, sequence of the Since the functions under consideration are in the space C k ( G ) the differentiations with respect to the variables is irrelevant. In the following we consider functions of two variables which are denoted by tl and t z . Let u,v, and w be real-valued functions in Cml+m*(G) with G c R2.With d k l v b ) we define the function which is generated by differentiating u at first k1 times with respect to tl and then k2 times with respect to tz:
Analogously to the one-dimensional case we define the Taylor coefficients of a function u by means of
1
or
= k l ! kz!
(U)kl,k2
R. h'elch
154
In this notation, then the Taylor series of u is as follows, making use of the point of expansion (t:, t:):
For functions which are arithmetic compositions of other functions, we can immediately indicate calculation rules for the Taylor coefficients that are analogous to (3a-c). For this purpose, we use the following identity and abbreviating notation:
(10)
((U)j,O)O,k
:= ( U ) j , k =: ((U)O,k)j,O
For the following rules it is assumed that kl, kz 2 1. For k l or kz = 0, respectively, the corresponding one-dimensional rule is valid. For the four basic operations, we obtain
Proof: see [22, 301. Applying (10) with respect to the standard functions and by means of the onedimensional rules as derived in Section 3.1, we obtain the desired two-dimensional recursion formulae. As an example we obtain for the exponential function:
Reversing the ordering of the successive applications of the one-dimensional rule and because of (lo), we obtain:
The expressions (14) and (15) yield identical values in the case of an exact computation since, for e'' e C k ( G ) ,the ordering is unimportant for computing the derivatives. When computing the derivatives numerically, however, we will obtain different results. Using interval arithmetic for the calculation, it becomes possible to obtain a tighter enclosure by generating the intersection of both results. This is
Quadrature by Extrapolation
155
possible for all functions and operations. Analogously t o the computation of the Taylor coefficients of the exponential function by use of (14), we are able to derive recursion formulae for other transcendental functions (see [22]). The implementation is executed analogously to the one-dimensional case.
4
Modified Romberg Extrapolation
Provided we replace ( in the expression for the remainder term by the interval of integration [a,b], then just as in (2.9) and on the basis of inclusion-isotony (see [24, 25]), we obtain the optimal approximating diagonal elements with
(1)
J = Tmm
+
Rnm
E Tmm
+ Rmm.
Let us now consider the operations in a floating-point screen instead of in the real set. The real operations o are now replaced by the computer interval operations @ with the properties explained in [24]. We then obtain
(2)
J E
VJ
:=
VTmm @ V k n m *
The outward directed roundings may accumulate to a considerable growth of the diameter of V J . As shown subsequently and in Section 5, it is always possible to decrease km as compared with Tmmsuch that the value of the relative diameter is insignificant with respect t o the accuracy of V J . Nevertheless and because of the recursive definition of the Tmm,significant interval growths may occur by means of cancellations, This can be avoided by use of the new modified Romberg procedure. We replace the recursive calculation of all T;k by one direct computation of an element T,, by means of the call of one optimal scalar product. We thus can guarantee in (2) that the expression VTmmis a result of operations with roundings as small as possible. The calculation of the procedural error is possible, without computing Tmma priori. Thereby the error is minimized efficiently without the need for the execution of an unnecessary computation of T-table elements.
4.1
The Basic Enclosure Algorithm
From equations (2.6) and (2.9), a first method for the enclosure of values of integrals may be derived. The computation of the T-table element (in the following we only consider diagonal elements) via the recursive formula (2.4) is replaced by a direct computation by means of the accurate scalar product (see (21, 221). Thus, a verified computation via simple interval arithmetic is replaced by a faster and more accurate enclosure method. A special T-table element may be computed directly without requiring an explicit knowledge of other T-table elements!
R. Kelcb
156
4.1.1
Optimal Computation of t h e Approximating Term
With Definition 1 : The Romberg sequence FR := {
(3)
e}is given by
ni := 2 ' , i 2 0 .
For the Romberg sequence, there holds Theorem 5: For the Romberg sequence FRwith Tik concerning (2.4) there holds:
Each element Tik can be represented by
For k=O, the weights Wikj may be computed via (6)
wioj =
(2ni)-' n;'
, ,
for j = 0 and ni f o r j = I(1)ni- 1
'
and, for k 2 1, by means of the recursion rule
Proof: by complete induction (see [21, 221). The weights Wikj may be represented precisely since they are rational. They may be computed a priori via a rational arithmetic and stored in a table separately for the numerator and the denominator of all relevant T-table elements. Instead of a recursive computation of the Tik requiring numerous evaluations of
Quadrature by Extrapolation
157
functions, we only need to calculate rational numbers as following from (6)and (7). Because the weights do not depend on f, we are able to generate fixed tables. A recursive calculation going further than the fixed table is unnecessary in practice because we will not extend the maximum level (here: m = 7). Therefore we define a scalar product of the vector of the weights
(8) G i k := ( W ~ M ) ,... W i k n i ) and the vector of the function values
(9)
x k
:=
(f(zi0)
f(zin;))
with n; := 2'. Since T i k is computed by use of nLi
H N i k is the main denominator of the weights W i k j , and the 6 i k j are the corresponding numerators.
Examdes: drn = & i
30,I), G o
= i(1, 2, 1)
,
= A(217, 1024, 352, 1024, 436, 1024, 352, 1024, 217).
The expression OT,, appearing in (2) is evaluated exactly by use of Theorem 5 (&j 2 0) and (lo), making use of the optimal scalar product
(11) O T ~ , = [
v
n, ( C G n m j
j=O nm
A(C&mj j=O
*
Vf(Zmj)
* Af(zmj)
7
) ] Qho+HNrnrn.
v
The interval rounding symbols A and in (11) in front of the summation sign make it clear that there is only one rounding operation instead of roundings in each summation and each multiplication of 6,,j with f(zmj). When running the program, only the function values are to be computed. 4.1.2
Optimal Computation of the Remainder Term
Instead of an error estimation of the difference of two T-table elements, we are now able to compute an error enclosure of the remainder term R due to equation (2.9). For the Romberg sequence the following holds true, provided ( is replaced by the interval [a,b]:
R. Kelch
158
In general, the coefficient of the remainder term C , may be stored. The Taylor coefficient fZrn+2( [a, b]) is verified numerically via automatic differentiation algorithms [21, 22, 32, 33, 341.
For a verified enclosure we get the equation
with
(14) OC,
:=
Q OBrn+~+2"("+')
In this case too, we are able to confine our attention to only a few interval rounding operations. The evaluation of OC, can be carried out in a non-recurrent fashion in advance just as in the case of the weights w;kj. Therefore during runtime we have only to compute the Taylor coefficients and to multiply them with OC., With ( l l ) , (12) and (14), we change (2) to
Therefore the essential computational effort is the evaluation of the function values and of the Taylor coefficients. In this case Gmm is the vector of the weight numerators with conversion to the main denominator.
4.1.3
The Enclosure Algorithm Basic
The algorithm Basic is a direct interval algorithm, which is a realization of the modified new Romberg procedure for a verified integral value enclosure. The interval diameter of
& serves as a truncation criterion
where 6 is the required absolute total error. Thus, the choice of an appropriate method for (16) does not require that any T-table element has to be known! If real operations are replaced by corresponding screen operations (see [25]), rounding errors in computer applications will not affect the result anymore.
159
Quadrature by Extrapolation
Algorithm Basic: 1. input (f,a,b, eps) 2. i := -1; 3. repeat i := i+l OR, := -Ci QO(.f[a, b])?i+l until d ( 0 R i ) Ieps O J ~:= ## ( O(Gi * * ho OR)
+
5)
4. output (OJi, i)
Remarks:
f7:
1. The variables Ci, Gi are tabulated, represents the vector of the function values at the points q0 to sin,; this vector is computed only for the last i. The Taylor coefficients (f[a, b])2i+2are verified in a separate step.
2. As compared to the algorithm approz, and in addition to the quality of a guaranteed enclosure which has now been achieved, the essential improvement is the direct and therefore exact evaluation of the remainder term, which is executed subsequently to the determination of an optimal remainder term.
4.2
Extension to Arbitrary h-Sequences
It is possible to extend the algorithm Basic to arbitrary h-sequences 7 . We state the following lemma for 3,for which K(x) in the remainder term formula (2.6) changes its sign in [a,b].
Lemma 1: Let 3 := { h i } be an h-sequence, for which K(x) in (2.7) changes its sign in [a,b]. For the procedural error R in the Romberg extrapolation, then there holds
with
(18) D, = ( b - a )
B,+I
Ic,,.,~~. htrn+’
. i=O
R. Kefch
160
Proof: see [22]. If we compute the c,,,i to determine D , by a single evaluation of (18), the algorithm Basic can be generalized to the case of sequences 3converging because of Theorem 3, even though there is a change of sign of K(x). In order to obtain the same optimal properties by means of applications of an arbitrary h-sequence as in the algorithm Basic, we try to represent the Tik as a scalar product. Starting at the h-sequence F := we consider the set of nodes Ki relating to the stepsize hi. Because the boundaries a and b belong to each node set, we simplify the representation as follows
{?},
(19) Ki :=
{k}
ni-1
.
j=1
In the case of the usual and the new evaluation of Ti,,we employ the function values precisely in those nodes where the trapezoidal sums Z-k,Oto Ti,oare needed; i.e., by choice of the optimally approximating T-table elements T,, we use all function values in the trapezoidal sums TOO to T,o which have already been computed. To obtain a recursion formula for the evaluation of the weights for the purpose of a direct representation of Tik by means of a scalar product, it is necessary to know the number of preceding elements Ti0 with respect to all required function values. For this we consider two sequences, which in a sense are extreme in this point.
A) Maximal Utilizing Sequence Concerning the newly required values of functions, the ones that have already been computed appear in the row immediately above 3 R . This is characterized by . (20) ni = 2 . n ,-I and
, for
all i 2 1
(21) Ki-1 C K; , for all i 2 1 ; i.e., 4 function values evaluated up to the (i-l)-th row appear precisely in the (i-1)-th row and all of them are needed in the i-th row.
B) Minimal Utilizing Sequence ho Pi has the property 3 p
:= {-}
, pi =
i-th pri.me number, i 2 1, po := 1 ,
Ki r l Kj = 0 , for all i, j with i # j
;
i.e., in order to evaluate the function values needed in the i-th row, we cannot use any individual function value which has been evaluated up to this point. These considerations are important for the determination of the procedure with the smallest number of employed nodes.
Quadrature by Extrapolation 4.2.1
161
The Bulirsch Sequence
Now we discuss the Bulirsch sequence 3 B .
Definition 2: The Bulirsch sequence 3 3.2t-1
B
:= {
2 }is given by
,2li,
i#O
,i=o
.
For 3 8 there holds:
( 2 3 ) n; = 2
.n;-2
+
K;-2 C K; ,for all i 2 3
.
The recursion formula ( 2 . 4 ) may be generalized. We obtain:
with
Concerning 3 B , all nodes from TOOto T i - l , ~are included in T i - l , ~and Ti,o (see (23)); for the construction of a recursion rule for the weights, therefore, only the two last-computed rows in the T-table have to be added:
j=O
Thus, we proceed to
j=O
R. Kelch
162 T h e o r e m 6:
?or the Bulirsch sequence F6 due to (22) with n W l:= 0, there holds:
?or all i,k we obtain:
~ i := j
a+j.hi,j=O(l)ni,OIk5i.
?or all i,k, there holds:
j=O
j=O
rhe weights W i k j and
Uikj
are ration 1 and may be c mputed as follows:
1. fork = 0, 0 5 i: (29)
vitht
(Yik
[
and
wiOj
Vioj
pik
= =
l/ni { l/(2ni)
0
, j = l(l)ni - 1 , j = 0 and ni
, j = O(1)ni-l
;
due to (25).
Proof: see [22] Analogously to the Romberg sequence, an enclosure algorithm may be realized for the Bulirsch sequence provided the weights have been stored in a T-table and
Quadrature by Extrapolation
163
after it has been decided which one of the remainder term formulae is to be applied, i.e., whether or not K(x) in equation (2.7) changes its sign in [a,b]. This mathematical problem is discussed in [7]. For Fs a t certain intermediate points of the integration interval (the estimation has been carried out in steps of Bulirsch emphasizes that the function K(x) possesses the same sign for all m with m 5 15; therefore he concludes that “with considerable certainty” K(x) does not change its sign for m 5 15 in the whole interval. This rather unsatisfactory statement may now be replaced by T h e o r e m 7:
Tmm - J = him+3* E m . B m + 1
(f(t))~rn+~
with
Proof: The proof for the absence of a change of sign for K ( z ) is not obvious and may be looked up in [22]. It was executed by means of the computer algebra system REDUCE and by means of a program computing tight enclosures of the range of values of polynomials [15] . 4.2.2
Introduction of t h e Decimal Sequence
Concerning the Bulirsch sequence, problems with respect to the implementation may occur since, in general, the nodes cannot be represented exactly. This may have negative consequences for the interval diameter and for the execution time in the case of the function values; as a remedy a so-called decimal sequence FD may be introduced: Definition 3:
no=1,
,n1=2
,i 2 2 ,i 2 3
Now the nodes can always be represented precisely provided a decimal arithmetic is
R. Kelcb
164
used. Clever implementation, however, allows 3 B to be used in many cases. Since shows a similar behavior as 3 B , the screen in the coefficient table becomes more dense and thus the choice of an optimal method is improved. Theorems 6 and 7 are valid analogously for the decimal sequence (see [22]).
3 D
4.3
Comparison of Different h-Sequences
For this comparison we consider the number of required function values for the evaluation of T,, and we establish their relationship to the corresponding procedural error &. Since we cannot specify a priori statements about the growth rates of the Taylor coefficients for an arbitrary function f , here we will relate only the remainder terms of the same column, i.e., with identical order of the derivative and the corresponding function values. Therefore in (12) we consider only the expression Cm = . him+3. Bm+lwith the remainder term factor
em
(33)
em= (
m
n n y
.
i=O
In the determination of T, by means of a scalar product, generally the growth of the intervals is not related to m or the choice of a sequence 3 ~Therefore, . in the determination of an optimal sequence 3, we may confine our attention to the remainder term. w e compare the sequences 3 R , 3 B , 3 ~and, ~ J V for ; all rn 5 7 we determine the number of the required function values and the quantities 6,. By use of equations presented before, we conclude for m 2 1 that
(34)
cm=
therefore, we obtain a table of remainder term factors (see Table 5, Section 4.4 in [221)We will now compare the remainder term factors 6, for identical m and identical numbers of the required nodes. In order to achieve identical numbers of nodes by using identical m, occasionally we have to carry out bisections or divisions into three parts. In such a case, there is a change of the factor him+3in C,. Table 1 lists the optimal sequences for all m 5 7.
Quadrature by Extrapolation
165
Table 1: Optimal h-sequences For a high precision in conjunction with a minimal number of nodes and the smallest error term, the best possible choices are the Bulirsch sequence 3 8 and, in the case of a decimal arithmetic, the then more meaningful sequence 3 D . Provided only a low accuracy is required, it is possible to go only to the third row of the T-table and there to choose the optimal sequence (see Table 1).
5
The Optimal Enclosure Algorithm vquad
We will now derive the algorithm uquad for the determination of the optimal remainder term on the basis of Section 4.3. For the purpose of a reduction of the error we combine bisection methods and a continuation in the diagonal of the Ttable. After a first basic partition, there follows an adaptive refinement. This is controlled by the remainder term. Only if the remainder term satisfies the required error bounds, it becomes necessary to evaluate the T-table element. Thereby it is possible to achieve the required accuracy with a minimal total cost because an increase of the computing effort by means of bisections or higher T-table elements is used only where it is necessary. Locally where the function is sufficiently simple, we achieve high accuracy at low cost; vqvad delivers an optimal result fulfilling the required accuracy without large over- or underestimations. In the case of periodic functions, the trapezoidal rule yields the best approximation as has already been mentioned in [4]. Therefore a continuation in, e.g., the diagonal of the T-table will not cause an improvement but only a growth of the remainder term. Also in this case vqvad selects the best remainder terms in the corresponding partitions such that both the error and the effort are small. Table 8 in [22] lists the optimal remainder term factors for all m 5 7. The underlying assumption is a lemma which is derived in Section 5.1. In dependency of the Taylor coefficients, the finally decisive comparison of the remainder terms is executed in vquad, with minimal cost and by using the table.
5.1
Criterion for the Selection of the Optimal Method V(m,l,F)
The expression
emin the representation of the remainder term,
can be computed and stored for each h-sequence and for all m 5 mmaz. Now we return to the bisections of as discussed in Section 4.3 and we arrive at
em
R. Kelch
166
Definition 4: V(m,l,F) is the 6asic method computing the remainder term with the remainder term factor
c:,,:
IC,,:
is the value of
ernfor the h-sequence Fzlafter 1 bisections of Z;. I
Remark: The values 6, and Bm+l are stored in a table. We also can store him+3 in a table (see Section 5.3.1 in [22]); this involves a minimal computational effort. For the sub-domain 1; we require (see Section 5.3.1):
(3)
2
d(~m)
(e+)i
*
By use of ( l ) , we transform (3) into
=: K,
and we obtain
Lemma 2: Provided the inequality
is satisfied, making use of,'tZ and A, in (4), then the originally required accuracy (3) is also satisfied. Remarks: 1. Lemma 2 is the central point of the algorithm for the search for the best method V(m,l,F). This is the primary control of the whole adaptive algorithm vquad.
2. Instead of computing the corresponding remainder term for each table element (i.e. n times.? for the purpose of a comparison with e, according to (3), we have only to evaluate once to be able to execute the comparisons!
c:,,
The table of the coefficients
c, (see Section 4) is now extended by the parameters
Fz (for the choice of the h-sequence) and 1 (bisection step). We give two examples concerning an interpretation.
Quadrature by Extrapolation a) Table Concerning
167
cL,las Sorted in View of Values Around lo-'*:
From Table 2, we imply that CTo yields the best relationship between computational effort and accuracy. Depending on the value of the diameter of the Taylor coefficient, the position of the optimal remainder term may change. So only by using the algorithm search as described in Subsection 5.3.2, we can arrive a t a final decision.
(m,l,F)
C;,
#fi
Table 2: Some
... *--
c:,
(472,D) (3,3,R) 2.3E - 12 1.8E - 12 32 64
(7,O,D) (2,5,B) 9 . 5 E - 13 8.1E - 13 20 96
a * *
... ...
for the value lo-''
For all m 5 7, Table 8 in [22] contains all parameters of the optimal method (in this case a total of 61).
b) Table Concerning
c;,, as Sorted in View of the Taylor Coefficients m:
For the case of m = 6, let us compare the different methods V ( m ,I, F); we thus obtain: (m,l,F) Cil f;
*
Table 3: Some
(670, R) (672, B) (6, 1, B) 2.3E - 13 4.9E - 18 1.6E - 13 64 64 32
c:,
for m = 6
For FB,the results imply a procedural error smaller by a factor of than the error for FR, provided the same computational effort is being made. If FB is bisected only once, the computational effort is only half as much as in the case of FR,and the error still decreases by 30 $6. In this case, 3 B is to be preferred. The optimal method can be looked up in the C:,[-table in an a priori fashion. Subsequent to the computation of the appropriate Taylor coefficients, the best method (i.e. the method requiring the least computational effort satisfying approximately the required accuracy) may be realized in an a posteriori fashion by applying equation (5) in Lemma 2.
5.2
Survey of the Principles of Generating vquad
c;,,
By means of a comparison of the extended coefficient table with equation (5), the optimal method may be chosen. Concerning the required accuracy and the increase or decrease of the derivatives of the integrands in the combinations of the three different h-sequences 3 with the T-table row and the bisection step, the
R. Kelch
168
fastest method is realized via a search algorithm (see 5.3.2). The time spent on this search is irrelevant, since at most 10 values have to be compared in each row. It is only afterwards that the function values are computed which are necessary for the optimal method V ( m ,I, F);this is the main computational effort.
If the chosen method did not satisfy the required accuracy e,, nevertheless we would obtain an integral enclosure with guaranteed bounds even though the interval diameter is larger than required. For a further reduction, an adaptive refinement strategy is applied. To keep the errors small from the beginning, we start by initially segmenting the interval into subintervals allowing applications of the Bulirsch sequence. If no satisfactory method can be found for a certain m, m will be increased or (in the case that the remainder term is augmented by the last increase) bisected and the same method will be applied recursively to both subintervals. Above all, bisections are preferable in the case of a more rapid increase of the higher order derivatives of the integral as compared with the decrease of We observe this situation in the cases of, e.g., strongly oscillating functions or close to poles. By deciding whether the new remainder term is larger than the old one, we can automatically prevent a growth of the procedural error. In this case, we choose a bisection step instead of a new T-table element. This is to be repeated as long as it is useful or a maximal bisection or T-table step has been achieved. The appropriate function values are computed after the optimal method has been executed for a certain subinterval; T, und R, are verified numerically by use of the above-mentioned theorems and stored in a linked list. Finally, these lists are scanned and the values are accumulated. Thus, we obtain an optimal method since additional computation is carried out only where necessary. The algorithm wquad realizes this method.
c,”,,.
5.3
vquad
in Detail
In [22] we find a complete outline of all details of the following flow-chart. An individual computation of the total number of 2 (mmaz 3) coefficients for each sub-interval l i of the basic partition is carried out.
-
+
169
Quadrature by Extrapolation
0
Algorithm vquad :
0
but(f+,b,p4 I
u
Compute c, over PI. I
Initial segmenting in interval. I,, computation of
.............
. b l o d ( I , , R, m, 1, c,,
J at the endpoints.
......................................
...
I
Compute (+), from c, according t
logl:=(d(Rg')
< (er)i)
I
I
:.................
I
........................................ I
Accumulatedl Tm,Rm.
6
R. Kelch
170 Global a n d Local R e q u i r e m e n t s of A c c u r a c y
5.3.1
By way of the input of the required relative total error, a global condition is considered,
(6)
pr
= -
€7
:=
€7
Ijge8’1
with (7)
d(Jge8)
=
SuP(Jge8)
-
inf(Jge8)
*
This means that our method as applied to [a,b] should yield an integral enclosure O J with
(8)
!
c = d ( O R ) 5 er
or
The reason for these choices is that only the procedural errors are taken into account for our automatic error control by use of an adaptive refinement strategy. We compute er by use of pr and
(10) cr = P.
ijge8 .i
The letter r in the index of c und p refers to a required accuracy; j,,, is the notation for a good approximation of the exact value. We use (9) to allow a control of the adaptive refinement strategy. As compared with the exact value of J,,,, a does not affect decisively the adaptive control (there is no small difference of jges influence on the verification, of course). Therefore, in most cases, a coarse approximation method is sufficient for the evaluation of j,,, and therefore also for step 4 in vquad corresponding to (10). We now have to consider the choice of the local requirements of accuracy with respect to the sub-intervals in order to satisfy the global requirement (8). By use of the definition of the achieved local absolute error (11)
ci
:=
ci,
d(R,) ,
and by use of the definition of the required absolute error ( e l ) ; ,
Quadrature by Extrapolation
171
we immediately obtain the following lemma, which is proved in [22]:
Lemma 3:
k
5.3.2
The Algorithm search
By use of the table of the remainder term factors, the determination of the optimal remainder term RRp' employs a search algorithm search, which is called by a procedure block (see flow-chart vpuad) calling itself recursively. This perhaps will require a bisection of a subinterval Ii and the determination of an optimal remainder term by means of search and the thus generated partitioned subintervals. This is initiated by calling the procedure block for the new intervals. As the final condition, that Rapt is chosen which has the smallest number of nodes in the set of all R$ for 0 5 i 5 rnmao. The search time is negligible since there are less than 10 candidates for applicable methods concerning any row in the T-table. Remarks to the Followine Algorithm search: 1. The €-growth in search may lead to an acceleration of the method. 2. The external repeat-loop in search may be omitted; it serves as a refinement of the method, provided we use a more dense net of the values of the premultiplication factors.
R. Kelch
172
Algorithm search:
9
no
I
ves
5.3.3
Computation, Storage, and Accumulation of R,,, and T, with High Accuracy
Subsequent to a determination of m via search (and therefore a determination of R,,,),there follows the evaluation of T, by means of (4.10). The addition of R,,, and T, by use of (4.14) is carried out only a t the end in order to avoid an unnecessary growth of the interval diameter. In this context, we use the identity
By use of a floating-point screen, there holds Jgea
E
VJ E
The values R,,, are determined via the procedures search and block. By use of the h-sequences and the thus generated m- and 1-values, we are able to read the suitable weights in the tables in order to compute the corresponding function values and in
Quadrature by Extrapolation
173
order to determine 0 T,. The values (Tm)iand (R,,,)i, wh.:h are stored in linkec lists, are accumulated successively (without roundings) in the long accumulator, i.e., without a loss of information. For this we use the data type dot precision (see [5, 241). For each Zi in the subdivided basic interval, a linked list is gekerated. Only in one step at the end, there will be one rounding operation with, therefore, a minimal interval growth.
Remarks on the Implementation
5.4
By an adaptive refinement, the recursive call in algorithm block leads to a tree structure. In case of a bisection, the nodes of the tree point to a left and to a right branch. Otherwise, the evaluation takes place successfully and the obtained results are stored in a linked list. Figure 1 shows a simple example for a generated tree. A node consists of a data part D and three pointers. Provided the step I,, is reached, further bisection should be carried out, i.e. there is a maximum of (I,, - 1) bisections. Step 1
Pointer
l j*
3
J := O(CZ
Figure 1: Example for a tree for an adaptive bisection
+ C&)
R. Kelch
174
6
Numerical Results and Comparative Computations
The algorithm vquad is now to be compared with the approximative algorithm approx (see (2.10)), the asymptotic method asympt (see (2) in the following section), and an enclosure method itayl based on the Taylor-series ansatz. To guarantee a "fair" comparison, comparable algorithms were also equipped with a similar adaptive refinement strategy. All methods were computed on an Atari Mega-ST. They were implemented in PASCAL-SC (see 151). In general, the simple test examples provide close approximations or tight enclosures even though occasionally there are considerable differences of the computational effort (see above). The asymptotic method requires generally the highest execution time. Since the ratio of these times is more than 100 as compared to vquad, this method soon becomes irrelevant, particularly as there is no verification and incorrect enclosures are occasionally obtained.
6.1
Methods for Comparison
In Section 6 in [22], there is a detailed discussion of the methods employed for the intended comparisons.
6.1.1
approx and asympt
As to the approximative method, it may happen that the true error is much higher than the estimated error; i.e. the program proclaims a significantly smaller error than it actually produces. Thus, for difficult integrals, the reliability of this method is doubtful. Additionally, in case of a high requirement for accuracy, the computational effort is higher than the one of algorithm vquad. An asymptotic enclosure is obtained by introducing a U-sequence by use of
Thus, due to [35] there also holds:
In practice, however, it is not possible to determine this m' precisely. Just as in case of approximative methods, estimations may only yield numbers which have nothing to do with the solution.
Quadrature by Extrapolation
175
A Taylor-Series Ansatz with Verified Enclosure
6.1.2
With f E C"[a,b], we obtain for J in (2)
J=Q+R
(3) with
and
We are able to evaluate (4) and (5) in different ways by choice of different values for
20.
Assume that z o =
% or zo = a (or, analogously, zo = b).
In order to obtain a high precision with a minimal cost of computing time, the diameter d(h)is of decisive importance. There holds that d(h)as due to the first version (zo = is smaller by a factor of 2" than d(k)by means of the second version. Therefore Theorem 8 as proved in [22] is valid. Another improvement is obtained via intersections.
9)
Theorem 8: As an enclosure of the integral J := Jab f(z)dz,we obtain:
(6)
J E
(8) R, := w; :=
Q,+ R, with
w,
b .2-' it1
- v,
with
, i = O(l)n,
R. Kelch
176
6.2
Discussion of the Presented Methods
6.2.1
Theoretical Statements
As measures for the quality, in the case of an approximation method, we choose an error estimator (e.g. the difference of two succeeding approximations) and, in the case of verification methods, the diameter of a remainder term enclosure.
We test whether the ‘pseudo”-endosures as generated by asympt, are true enclosures. In a final comparison of all four procedures and as a measure for the quality of a method, we will choose the total computing time for obtaining a required accuracy. The reason is that in vquad and in itayl Taylor coefficients are computed which yield an essential contribution to the total effort. The measure of the overor underestimation, respectively, of the true error is of special importance. (1) Comparison of approx and asympt: Based on the numerical stability of the Romberg extrapolation (see [37]), we frequently will get very good approximations. But we can demonstrate easily by use of examples that the rounding errors may quickly lead to an error estimator that is far away from the exact error. In the case of lower accuracy requirements er, approx needs less computation time than asympt; for asympt, the validation of (2) is more costly than the one of the error estimation. For large er the method asympt cannot work efficiently, either. Because of the identical approximation terms for small e,, we obtain similarly good approximations. The “pseudo”-enclosure due to asympt may yield results that are as misleading as the ones due to approx. Therefore we expect that asympt will yield less acceptable results than approx.
(2) Comparison of vquad and itayl: Because of the convergence-accelerating effect of the extrapolation, it is to expected that vquad terminates faster than itayl. This does not take into account the cost for evaluating the Taylor coefficients, both in the remainder term and in the approximation term.
a) We now compare the remainder term formulae for vquad ( Rv ) and for itayl ( RI ). For the quality of the remainder term, the interval diameter is of decisive importance, which is caused by the true interval l i in f2,+2(Zi). Since this value is identical for both methods, the factors &! and @ govern the absolute magnitude of the procedural error. Inspection of Figure 3 reveals immediately that the new algorithm vquad is distinctly superior to itayl beginning with m=3. b) We now compare the effort of evaluating the corresponding approximation terms T,, and QZm+2. The computational effort for T,, rises proportionally to the number of the required function values. Therefore, a bisection step implies a doubling of the computation time for the evaluation of the function values. For the evaluation of the Taylor coefficients, we cannot derive an analogous direct dependence
Quadrature by Extrapolation
177
between the approximation degree and the runtime. Numerical results show a behavior that differs occasionally. Using lower or medium accuracy or sufficiently simple functions, the Taylor-method is faster and somewhat more precise than the extrapolation method. In difficult cases and/or with higher requirements for the accuracy, the algorithm vquad delivers distinctly better results (see Table 10 in 1221). A
10-20
--
10-15
--
10-10
--
-
0
0
0
I
0
*
0
1
- m
2
3
4
5
6
7
Figure 2: Comparison of the remainder terms of vquad and itayl Figure 3 illustrates these effects distinctively for fi, fz and f3; e.g. for fz, the absolute diameter increases while the absolute values of the Taylor coefficients decrease. For and f1, the absolute diameters are growing by one power of ten with every increase in the approximation degree m . The occurrence of this growth is remarkable in view of the fact that the argument zo is a number that can be precisely represented. As compared with itayl, there is another advantage of vquad in the case of bisections. Using vquad, let all values of functions computed up to that point be re-used (see Section 4). As a consequence, the additional effort due to the iterative bisection is significantly reduced. Using itayl on the other hand, it is not possible to re-use a Taylor coefficient or a corresponding vector since, because of Theorem 8, the Taylor coefficients are computed only a t the midpoint of the interval currently under consideration. After a bisection, the old midpoints become endpoints and therefore cannot be used! In this case, the effort for itayl increases enormously. In view of the cost and the precision of the remainder term, therefore, in the majority of the cases under consideration vquad is to be preferred.
178
R. Kelch
The functions
fi
fl
=
f2
=
f3
are evaluated a t
LO;
they are defined as follows:
ezz . sin(ez2) , zo = 2.0 ( see J1 in Section 6.2.2)
-&, z o = 1.0 ( see J2 in Section 6.2.2)
&.
s02* wre
w.imdt
7 2 0 = 0.5 with re =a3cos3t + a ~ c o s 2 t + a l c o s t+ a 0 im = a3 sin 3t a2 sin 2t al sin t v = 3a3 cos 3t 2a2 cos 2t al cos t w = 3a3 sin 3t 2a2 sin 2t al sin t and a3 = l , ~ = 2 -5.5,al = 8 . 5 , = ~ -3
=
+
+ + +
+ +
(3) Comparison of vquad and asympt As has been observed i n (1) asympt is more costly and not more accurate than approx in spite of the declaration of an “enclosure”. The extent to which approx still delivers useful error estimators for difficult integrals will be seen in Section 6.2.2.
t
8l
l2
4
2
*/
4 / d
8
10
12
14
- m
f3 f2
Figure 3: Interval blow-up of the Taylor coefficients as a function of the approximation degree m.
6.2.2
Numerical Results
Table 4 and Figure 4 confirm the conjecture concerning a doubtful reliability of the classical error estimator (notice the logarithmic scale!). Just as in the case of
Quadrature by Extrapolation
11 1 1 I 1
179
average requirements for accuracy, we frequently are able to find examples showing significant distances between the error estimator and the accurate value. Thus, this error estimation is worthless (see e.g. Table 8). le-4 le-6 le-5 4.5e-7 5e-4 2.2e-5 50 gP 50
le-8 le-10 1.3e-9 9.6e-12 1.6e-9 6.5e-12 1.3 0.7
Table 4: The doubtful reliability of the classical error estimator
I
* *
+underestimation
Figure 4: Some g,-values of the integral examples The method itayl yields excellent results. In case of simple functions with low requirements for accuracy, this method is faster than vquad. In difficult cases with high requirements for accuracy, however, vquad is five times faster than itayl.
R. Kelch
180 Integral-Example No. 1:
J1 =
1 b
eZ2. sin(ez2)dx (see [8])
The integrand is strongly oscillating. The magnitude of the derivatives grows rapidly as x increases, particularly in the case of large b. Thus, there are serious problems. For b=2.1 we obtain the results listed in the subsequent Table 5. Computations marked by (*) in the column for appros in Table 5 yield approximations with significant distance from the error estimator. Neither do the asymptotic enclosing intervals as obtained from asympt contain the integral value in the cases marked by (*). Figure 5 demonstrates once more distinctively the advantages of the verified algorithm vquad as compared with the comparison algorithm.
1600-
0
0
0
0
1400-
1200--
*
1000800
600 --
200 -100 --
itayl
o
approx
*
vquad
*
--
400-
*
*
* 0
a
0
6
required relative accuracy p,
Figure 5: Computational effort in comparison with the achieved accuracy for 5 1
Quadrature by Extrapolation
119 220 252
le-6 le-9 le-11
858 1092 1135
181
142") 282 465
39 1 1
1653 1656 1773
Table 5: Comparison of the computational effort with the achieved accuracy for 51
Integral-Example No. 2:
The authors of [43] deal with this integral as a result of the integration of an equation of motion. If we analyse v(t) at the nodes t = 0(0.2)2.0, our enclosure algorithm uquad yields excellent results in the case of a required accuracy of pr = lo-', see Table 6 which also lists the values as given in [43]. It is remarkable that in most cases only 2 or 3 digits are correct, whereas in [43] 3 additional decimal digits are considered to be accurate.
Table 6: Comparative computation
Integral-Example No. 3:
1 1
J3
=
a4
+
ax (31 - 1)4
R. Kelch
182
For the given parameter values a = lo-' and a = Tables 7 and 8 illustrate the numerical results in the case of the absolute accuracy requirements cr = 1 and The accurate value of 53 for a = lo-' is 740. If we apply approz, we obtain 86 as a first approximation, making use of an absolute error estimator of 21. For a = there holds J3 z 740480. Here, the approximation method provides an approzimation differing even more from the solution. The error estimator is too small by a factor of lo4: whereas approz provides the number 30 as an error estimator with respect to the approximated value 335289, itayl aborts with an exponent overflow. Exdanations to Tables 7 and 8: Marking vquad by (**) denotes that no standard partitioning was chosen. Table 7 demonstrates quite clearly the advantages of a more favorable computation of the Taylor coefficients. The integral values marked by (*) are outside of the domain of the verified enclosure; they thus demonstrate the insufficiency of this method. The last column shows the obtained absolute error d(J) or the indicator size gp for the quality of the error estimation. If gP > 1 , this implies a significant undemstimation of the error (see below).
Table 7: Comparative computation in the case of example J S , with a = lo-'
I .-
t 10-3
I
1.2
I 3.35289..E + 5")
I
1.3E+3
Table 8: Comparative computation in the case of example 53, with a =
I
Quadrature by Extrapolation
7
183
Conclusions and Outlook
As in the case of almost all numerical problems, it is shown that an approximation method may provide results close to the solution only with a certain probability. If, however, guaranteed results as well as bounds are required for cases (even ill-conditioned ones), then verified enclosure algorithms - see vquad in Section 5 should be used. Enclosure algorithms enable the user t o compute pairs of bounds with small distances for the required integral; this can be achieved in a fast and uncomplicated manner via automatic differentiation algorithms and transformation of the Romberg extrapolation into a direct scalar product. Thus, a true error control has become possible via an adaptive step size control! In the case of ill-conditioned integrals, the classical error estimators may fail. Whether or not an integral is a critical case in this respect cannot be decided in an a priori fashion (see Example No. 2 in Section 6.2.2). Only subsequent to applications of enclosure methods, we are able to ascertain whether an integral is ill-conditioned. Obviously, applications of inaccurate approximation methods are unnecessary since the method vquad yields guaranteed bounds which enclose the true results for the values of the integral. An interesting possibility is the generalization of vquad for multi-dimensional integrals (see [13, 14, 29, 411). In [38] we find an excellent continuation of the present work for the case of two-dimensional integrals.
References [l] Abramowitz, M. and Stegun, J.A.: Handbook of Mathematical Functions, Dover Publications, New York, 1965 [2] Alefeld, G., Herzberger, J.: Introduction to Interval Analysis, Academic Press, New York, 1983
[3] Bauch, H., Jahn, K.-U., Oelschlagel, D., Susse, H., Wiebigke, V.: Intervallmathematik, Theorie und Anwendungen, BSB B.G. Teubner Verlagsgesellschaft, Leibzig, 1987
[4] Bauer, F.L., Rutishauser, H. and Stiefel, E.: New Aspects in Numerical Quadrature, Proc. of SIAM, 15, AMS, 1963 [5] Bohlender, G., Rall, L.B., Ullrich, Ch., Wolff v. Gudenberg, J.: PASCAL-SC, Wirkungsvoll programmieren, kontrolliert rechnen, B.I., Mannheim, 1986 [6] Braune, K.: Hochgenaue Standardfunktionen fur reelk und komplexe Punkte und Intervalle in beliebigen Gleitpunktrastern, Doctoral Dissertation, University of Karlsruhe, 1987 [7] Bulirsch, R.: Bemerkungen zur Romberg-Integration, Num. Math.,6, pp.816, 1964
[El Bulirsch, R. and Stoer, J.: Asymptotic Upper and Lower Bounds for Results of Extrapolation Methods, Num. Math.,E,pp.93-104, 1966
[9] Bulirsch, R. and Stoer, J.: Numerical Quadrature by Extrapolation, Num. Math.,9, pp.271278, 1967
184
R. Kelch
[lo] Bulirsch, R., Rutishauser, H.: Interpolation und genaherte Quadratur, in Sauer, R., Szabb, I. (Editor): Mathematische Hilfamittel des Ingenieurs, Springer, Berlin, 1968 [ll] Corlk, G.F.: Computing Narrow Inclusions for Definite Integrals, in [20] [12] Corlk, G.F. and Rall, L.B.: Adaptive, Self-validating Numerical Quadrature, MRC Technical Summary Report, # 2815, University of Wisconsin, 1985 [13] Davis, Ph.J., Rabinowitz, Ph.: Methods of Numerical Integration, Academic Press, San Diego, 1984 [14] Engels, H.:Numerical Quadrature and Cubature, Academic Press, New York, 1980 [15] Fischer, H.C.: Bounds for an Interval Polynomial, ESPRIT-DIAMOND-&port, Doc. No. 03/2b-3/1/K02.f, 1988 [16] Fischer, H.C.: Schnelle automatische Differentiation, EinschlieBungsmethoden und Anwendungen, Doctoral Dissertation, University of Karlsruhe, 1990 [17] Aearn, A.C., REDUCE user’s manual, The Rand Corporation, 1983 [18] Heuser, H.:Lehrbuch der Analysis, Teil 1 und 2, Teubner, Stuttgart, 1989 [19] Kaucher, E., Miranker, W.L.: Self-validating Numerics for Function Space Problems, Academic Press, New York, 1984 (201 Kaucher, E., Kulich, U., Ullrich, Ch. (Eds.): Computerarithmetic, Scientific Computation and Programming Languages, B.G. Teubner, Stuttgart, 1987 [21] Kelch, R.: Quadrature, ESPRIT-DIAMOND-Report, Doc. No. 03/3-9/1/Kl.f,
1988
[22] Kelch, R.: Ein adaptive8 Verfahren zur Numerischen Quadratur mit automatischer Ergebnisverifikation, Doctoral Dissertation, University of Karlsruhe, 1989 [23] Klatte, R., Kulisch, U., Neaga, M., Ratz, D., Ullrich, Ch.: PASCAL-XSC, Sprachbeschreibung mit Beispielen, Springer, Berlin, 1991 [24] Kulisch, U.W.: Grundlagen des Numerischen Rechnens, B.I., Mannheim, 1976 (251 Kulisch, U.W., Miranker, W.L. (Eds.): Computer Arithmetic in Theory and Practice, A c e demic Press, New York, 1981 [26] Kulisch, U.W., Miranker, W.L. (Eds.): A New Approach to Scientific Computation, A c e demic Press, New York, 1983 [27] Locher, F.: Einfiuhrung in die numerische Mathematik, Wissensch. Buchges., Darmstadt, 1978 [28] Lohner, R.: EinschlieBung der Lbung gewohnlicher Anfanga- und Randwertaufgaben und Anwendungen, Doctoral Dissertation, University of Karlsruhe, 1988 [29] Lyness, J.N. and McHugh, B.J.J.: On the Remainder Term in the N-Dimensional Euler Maclaurin Expansion, Num. Math.,l5, pp.333-344, 1970 (301 Moore, R.E.: Interval Analysis, Prentice Hall, Englewood Cliffs, New Jersey, 1966 [31] Neumann, H.:Uber Fehlerabschatzungen zum Rombergverfahren, ZAMM,46( 1966), pp.152153
Quadrature by Extrapolation
185
[32] Rall, L.B.: Differentiation and Generation of Taylor Coefficients in PASCAL-SC, in [25] [33] Rall, L.B.: Automatic Differentiation: Techniques and Applications, Lecture Notes in Computer Science, No.120, Springer, Berlin, 1981 [34] Rall, L.B.: Optimal Implementation of Differentiation Arithmetic, in [20] [35] Schmidt, J.W.: Asymptotische Einschliehng bei konvergenzbeschleunigenden Verfahren, Num. Math.,& pp.105113, 1966 [36] Stiefel, E.: Altes und Neues uber numerische Quadratur, ZAMM 41, 1961 [37] Stoer, J.: Einfihrung in die Numerische Mathematik I, Springer, Berlin, 1979 [38] Storck, U.: Verifizierte Kubatur durch Extrapolation, Diploma Thesis, Institut fur Angewandte Mathematik, University of Karlsruhe, 1990 [39] Stroud, A.H.: Error Estimates for Romberg Quadrature, SIAM, Vo1.2,No.3, 1965 [40] Stroud, A X . : Numerical Quadrature and Solution of Ordinary Differential Equations, Springer, New York, 1974 [41] Stroud, A.H.: Approximate Calculation of Multiple Integrals, Prentice Hall, New York, 1971 [42] Wilf, H.S.:Numerische Quadratur, in: Ralston, A,, Wilf, H.S.,Mathematische Methoden f i r Digitalrechner, Bd.2, Oldenbourg Verlag, Munchen, 1969 [43] Wylie, C.R.,Barrett, L.C.:Advanced Engineering Mathematics, Me Graw Hill, 1982, pp. 265266
This page intentionally left blank
Numerical Integration in Two Dimensions with Automatic Result Verification Ulrike Storck
For calculating an enclosure of two-dimensional integrals, two different methods with automatic result verification have been developed. Both procedures are based on Romberg extrapolation. They determine an enclosure of an approximation of the integral and an enclosure of the corresponding remainder term using interval arithmetic. In both algorithms, the quality of the remainder term chiefly determines the error of the result, i. e. the width of the enclosure of the integral. We therefore examine in detail the representations of the remainder terms in dependency on the chosen step size sequences.
1 1.1
Introduction Motivation
In scientific and engineering problems, the values of multi-dimensional integrals are frequently needed. There are many different methods for numerical integration, especially in one and two dimensions (see [5], [6], [IS]). Particularly in the twodimensional case, however, the remainder term, assuming it is taken into account, is not given in a form suitable for numerical computation. In addition, the roundoff errors are rarely taken into account. Therefore, a reliable statement about the accuracy is not possible in general, and the numerical results are often doubtful. In order to obtain an error estimate for a numerical result, two methods for calculating integrals of the form b
a
d
c
with automatic result verification are presented in this paper. We will call these procedures the single Romberg extrapolation and the double Romberg extrapolation, respectively.
Scientific Computing with Automatic Result Verification
Copyright Q 1993 by Academic Press, Inc. 187 All rights of reproduction in any form reserved. ISBN 0-12-044210-8
U.Storck
188
1.2
Foundations of Numerical Computation
To introduce the two methods, we start with the representation of the integral J in (1) by
J=T+R with T denoting the approximation and R the remainder ter_m. In both procedures the remainder terms depend on unknown (&) with ((ji) E [a,b] x [c,d]; consequently, a direct evaluation of R is impossible. Therefore, we replace (&) by [a, b] x [c,d] and obtain an enclosure R of R, which yields
However, if the calculation is executed in a floating-point system, the round-off errors must be taken into account. There follow now some important definitions (see [lo]):
A floating-point system is defined by S = S(b,I,el,e2) := (0 := 0 - bcl}
u {z = *rn. be I * E {+,-}, b E N , I
-
b 2 2, el, e2, e E Z,el 5 e 5 e2, rn = C z[i] P, z[i]E {0,1,
...b -
i=l
1) for i = 1(1)I, 4 1 1 # 0)
.
Here b is called the base, 1 the length of the mantissa, el the minimal and e2 the maximal exponent. We have S(b, I, el, e2) c R. For calculations by means of a computer, a rounding o:R+Swith
A O x = x ZES
is required. The following roundings are very important in practice: the monotone downwardly directed rounding
v with
the monotone upwardly directed rounding A with
A Az := min{y E s I y 2 z) ,
ZER
Numerical Zntegration in Two Dimensions 0
the interval rounding
A
189
0 (where ZR denotes the set of intervals over R)with
OX := [V(minx),A(maxx)] SEX
XEIR
For an arithmetic operation x@y := O(x 0 y )
0,
SEX
the interval operation
Q is defined by
.
+,
Furthermore, we assume that all floating-point operations -, / are of maximum accuracy and that we can use an exact scalar product (see [3], [lo]) which, using a long accumulator, determines error-free values of expressions of the form
c n
A=
i=l
-
xi yi with x;, y; E S
s,
.
The result of the exact scalar product is rounded by one of the roundings V, A, 0, which is indicated by a rounding symbol 0 E (0, A, 0)in front of the sum; the scalar product is then called an accurate scalar product. For the rounding 0 we have
In both extrapolation methods, interval arithmetic and the accurate scalar product are used for calculating an enclosure of the approximations, denoted by OT, and for calculating an enclosure of the remainder terms, denoted by Ox ; we have
J E
1.3
OJ := OT+OR.
Romberg Extrapolation
We will now discuss the principle of Romberg extrapolation. Assuming f(Z) is integrable in [ a ,b] x [c, 4, we get an approximation of the integral J in (1) by applying the trapezoidal rule
k=O
it follows that
I=O
lim T(h1,h l ) = J
hi ,ha+o
.
Here, the double prime next to the summation symbol indicates that the first and the last summand are multiplied by the factor 3. In order to obtain an optimal
190
U. Storck
approximation of J , we have to choose hl and h2 as close to 0 as possible. However, the round-off errors increase with decreasing hl, hz; moreover, very small hl , h2 yield a large number of function evaluations costing a lot of computing time. Therefore we cannot use very small hl,h2 for our computation. Instead, we obtain an approximation of T(0,O)by use of Romberg extrapolation. First of all, we introduce two arbitrary step size sequences
F1 = { h l O , h l l , hl2, ...} , 9 - 2 = { h o , h21, h 2 2 , ...I with
this yields the corresponding trapezoidal sum
Considering the T ( h l ; ,h2j) as function values a t the nodes hli, h,j, by extrapolation for ( h l ,hz) = (O,O), we obtain an approximation of T(0,O). Now, in addition to the possibility of choosing among different step size sequences, we can distinguish between two different ways of using Romberg extrapolation. The first one is generated by choosing the same number of nodes of T(hl;,h z j ) in the 21- and 2 2 direction, i. e. j = i and nl; = n2;. With ni := nl; , h; := 1 , and substition of (3) n, into (4), there follows
Now the values T ( h ; )are used for the Neville-Aitken-algorithm (see [14]) in order to extrapolate the values of T ( h )for h = 0. This method is called the single Romberg extrapolation. The second method, denoted as the double Romberg extrapolation, is based on two different Romberg extrapolations for the z1- and 2 2 - direction. This means that the two extrapolations are independent of each other.
Numerical Integration in Two Dimensions
191
The Single Romberg Extrapolation
2
In accordance with the last section, we have
T(hi) = ( b - a ) . ( d -
(6)
C) *
h?.
ni
ni
k=O
1=0
C”Z”j(a + k. ( b - a ) hi , *
C+
I. ( d -
C ) * hi)
,
with hi = and n, E N. According to the Neville-Aitken-algorithm, we obtain the following recursion: Ti0
:= T ( h ; )
O 0, then a,
For the constants cl, c2, we have:
and for the variables (ill t z ) , with kl, k2 > 1, we have:
208
U. Storck
Note that by applying the preceding formulas, the Taylor coefficients of arbitrary functions of C " ~ I ~ Z ( Gcan ) be determined, provided these functions are explicitly given. Further differentiation formulas are given in [7]. Since these formulas are recursive, however, it is necessary to calculate all (f)"," with 0 5 u 5 s , 0 5 v 5 t in order to calculate (f)#,*. Upon arranging the Taylor coefficients in a matrix M = (mjj) with coefficients mjj = (f)j,j, we have to consider the left upper triangular matrix consisting of the first 2(k 2) diagonals of M in order to determine all (f)zj,zk+24j with j = O(1)k 1; these coefficients are required for the remainder , [C,4))2j,Zk+Z-Zj are needed term. Finally, note that the Taylor coefficients ( f ( [ a b], for the determination of the remainder term and, therefore, the calculation of the derivatives f (2jt2k+2-2J) ( [ a ,b], (c, 4) is not necessary. After having examined the approximations for 3 ~ 3, 8 , (and 3 ~in) the previous subsection, we will now deal with the remainder terms for these step size sequences:
+
+
Theorem 2.5 For
3 ~ 3, 8 and 3~with 1 5 m 5 5, there holds:
with
Dm =
um= Em = andwithhlo=b-a,
h 2 0 = d - ~ ,t = [ y ] .
209
Numerical Integration in Two Dimensions
Proof:
The inclusion (52) is obtained by means of an application of (27) and (28), substitution of ( [ a , b ][c,dJ) , for all K0,2m+2-2j, K2m+2-2j,~ occurring in (25), and an employment of (34) for Km+l,,,,+l,Km+2,,,,. However, if we wish to employ (27) and (28), we have to show that the following terms do not change their sign in [a,bl x [c,4: for odd m: for even m:
(53)
K0,2~+2-2j(Z)for
j = O...?
K2m+2-2j,0(2) for
j =O . . . y
Ko,2m+2-2j(Z)for
j = O...?
K2m+2-2,,,-,(Z)
for j = O . . . y
Ko,m+2(Z)
n .
IIn(pi,
In IS, the arithmetic operations R are defined in the following way
F 8 G := II,(F
* G)
,
F,G E ZS,,*
E
R.
R) is called (interval) functoid, where
The space (ZS,,O
4
0R ={
8,0, 0,8, }; it will be simply referred to as ZM,. Furthermore the symbol 8 ,* E R will be replaced by * ,* E R, for the sake of clarity. Remark 2.1 Every interval [u] E IR represents simultaneously a subset [ a ] C_~ M of M, that is [UIM
= {f E M
I f(s) E I.[
1
I s IP>
a
f
This ambiguity of the coefficients [u;] in (2) as real intervals or as ranges of function sets makes it possible to deal with functional problems in ZR.In this sense we shall ~ [ u i ] . An interval function F E IS, contains all functions g of M identify [ u i ] = with n g(s)
E
~ ( 8=) X [ f i l c ~ i ( s ) i=l
[ji]
E IR
1 Q
I IP.
Lemma 2.1 The elements of I M , are closed, bounded, convex sets. Proof F E I M , can be represented as an interval
[ E m = (9 E M I
F(s)I g ( s ) IF ( s ) ,
Q
I sI P}
in the partially ordered Banach space M; hence F is closed, bounded and convex. 0
Extensive use will be made of the important test F G. For practical purposes a sufficient condition, the coefficient enclosure will be employed instead of C.
s,,,
Lemma 2.2 Let be F,G E I M , with F = C;=,[fi](pi , G = enclosure C_, be defined as follows: (3)
F
snG
:@
[fi]
C_
[gi]
Cy=l[gi](pi,
,
and the coefficient
i = 1,...,n.
H.-J. Dobner
228
Then the implication
FG,,G=+F(s)CG(s) , F,GEIM,, a5sIP1 is valid and (IM,,, c,,) is a partially ordered set. Proof
Follows by the rules of intervalmathematics and (3).
0
Definition 2.3 F E (IS,,, 0 0) is called an enclosure of the real valued function f E M provided the following relation is valid (4) f(.) E F ( s ) 1 a 5.9 5 P. Enclosures for operators are defined analogously.
Convention: Throughout this article an enclosure of a real valued quantity f will always be denoted by the corresponding capital letter F. Now, in view of computational purposes, we shall formulate a modification of Schauder’s fixed point theorem.
Theorem 2.1 Let t : M + M be a compact operator and T : I M , + IM,, an enclosure. If for an element X E IM,,, the condition (5) TXCX is satisfied, then the operator t has a fixed point 2 and moreover (6) 2 E x. Proof
From (4) we can derive t X C X . According to Lemma 2.1, X is closed, bounded and convex, so that the existence of a fixed point 5 E X follows by Schauder’s fixed point theorem. 0 An analogous theorem holds in the case of an a-condensing operator t. Even uniqueness statements can be validated with computer-adequate theorems; since this exceeds the scope of this paper, we refer to Kaucher / Miranker [ll].
3
Enclosure methods for linear Volterra integral equations of the second kind
The equation being considered first has the form
(7)
4 s ) = g(s)
+ J ’ k ( s , t ) z ( t ) dt, a
a
I s Ip,
Verified Solution of Integral Equations
229
with g E M, k E M x M. In operator notation (7) is written as z=g+kz,
where we use the same symbol for the kernel and the corresponding operator. Two different approaches for enclosing the solution of (7) will be discussed. The first method is based on an iterative process. Let Z be an initial guess for the solution of (7) and 2 the point interval extension of Z . If I denotes the identity, we define
U
:= ( Z - K ) * - G
and iterate by use of
Then, the enclosure K’+’ of the ( p + l ) - iterated kernel k’+’ satisfies the recurrence formula I denote directed rounded operations 1 stop:= (expo(errbound) < -32*resprec) or (errbound = 0 . 0 ) ; end else stop:= true; if not stop then begin writeln(’***** SQT routine: Error bound still to large! *****’I; y:= np5*(y + x/y> ; end ; until stop;
-
{ Give back result up to the number of mantissa digits required
by the calling program. setprec(resprec); { initialize the result of the function 1 sqrt:= true; sqrt:= y; { set the temporary flag for the result 1 sqrt:= false;
{
1 1
mpfree(y) ; mpfree(yy1; mpfree(np5) ; mpfree(errbound1; mpfree(sqrtx) ; { deallocate local mpreal variables 1 end ;
For the resulting value of the given routine there holds
v
The symbols and A stand for the directed roundings with respect to the precision setting when entering the routine sqrt ( >. Using the program given above in combination with formula (l), an interval enclosure of fi with one ulp (unit in the last place of the mantissa) accuracy can be determined easily.
Multiple- Precision Computations
3
329
Arithmetic-Geometric Mean Iteration
In Section 5 , the evaluation of elliptic integrals to high accuracy is described. This may be carried out using the so called arithmetic-geometric mean ( A G M ) iteration. Let go and a0 be two positive numbers with 0 < go < ao. The geometric mean of these numbers is g1 := @while their arithmetic mean is a1 := From the properties of means it follows that go < g1 < a1 < ao. For the sequences { g j } , { a j } with
F.
gj+l
and
:=
aj+l := g j + a j , j = o , 1 , 2 ,..., 2
there holds the following monotonicity property
The sequences
{gj}
and { a j } converge to their common limit
lim g j = lim a, = AGM(go,ao).
j-ma
J+m
Moreover, the sequence of distances { d j } with
shows the quadratic convergence rate of the AGM iteration. In general, relation (3) is not valid on a computer using rounded arithmetic. However, using directed rounded operations or interval arithmetic, a lower bound Gj for gj and an upper bound A, for a, can be computed in each step of the iteration. Therefore,
The AGM iteration may be executed to arbitrary precision using only the arithmetical operations and * and the square root function. Interval operations as well as evaluations of the interval square root function give verified bounds for the A G M . The method is not a self-correcting method like the Newton iteration. All iteration steps have to be performed with a full length arithmetic. The following program is a straightforward implementation in order to get verified bounds for the arithmetic-geometric mean of two positive real numbers. Multiple-precision interval arithmetic is used.
+
330
W. Kramer
program agm; use mp-ari, mpi-ari; {multiple-precision real and interval module 1 function agm(x, y: mpinterval): mpinterval; Var a, b, bnew: mpinterval; begin mpvlcp(x); mpvlcp(y); { value copy for mpinterval value parameter 1 mpinit(a); mpinit(b); mpinit(bnew); { initialization of local mpinterval numbers 1 a:= x; { starting values for the AGM iteration 1 b:= y; while a >< b do begin { iterate as long as a and b a r e disjoint 1 bnew:= sqrt(a*b); { geometric mean 1 a:= (a+b)/2; { arithmetic mean 1 b:= bnew; { in each iteration step the convex hull of a and b is an 1 { enclosure of AGM(x,y) = AGM(y,x) 1 end ; agm:= true; { initialize function result 1 agm:= a +* b; { convex hull of a and b 1 ap:= false; { result of function is only temporarily used 1 mpfree(a); mpfree(b); mpfree(bnew1; { free memory for local var 1 end ; Var { multiple-precision intervals 1 a, b, res: mpinterval; nn, relerr: integer; begin mpinit(a) ; mpinit(b) ; mpinit(res) ; { initialize mpinterval variables 1 writeln; writeln(’*** Arithmetic-Geometric Mean Evaluation ***’); writeln (’**+ to Arbitrary Verified Accuracy ***’I; repeat writeln ; write(’Number of mantissa digits (base=2**32)? ’1; read(nn); writeln; setprec(nn); { precision setting 1 write(’a, b = ? ’1; read(a.b); writeln; res:= agm(a,b); if nn < 7 then writeln(res); setprec(2) ; relerr:+ 1 + expo( (res.sup -> res.inf) /> res.inf 1; writeln(’Re1. error of enclosure new approx. using circumscribed polygon 1 pi-old:= pi; pi:= n*(sl_in/2 +* hsl); { convex hull gives new enclosure of pi 1 writeln(short(n):14:l,’-gon ’, ;)!p until pi >= pi-old; C check monotonicity 1 end ;
W. Kriimer
336 This program yields the following output: 6-gon 12-gon 24-gon 48-gon 96-gon 192-gon 384-gon 768-gon 1536-gon 3072-gon 6144-gon 12288-gon 24576-gon 49152-gon 98304-gon 196608-gon 393216-gon 786432-gon 1572864-gon 3145728-gon 6291456-gon 12582912-gon 25165824-gon 50331648-gon 100663296-gon 201326592-gon 402653184-gon 805306368-gon 1610612736-gon
[ 2.9E+00 3.5E+00] I [ 3.1E+00 3.3E+00] 8 [ 3.13E+00 3.16E+00] , [ 3.1393+00 3.147E+00] * [ 3.141E+00 3.143E+001 [ 3.14143+00 3.1419E+00] [ 3.1415E+00 3.1417E+00] I [ 3.14158E+00 3.14162E+001 8 3.141598E+00] [ 3.141590E+00 * [ 3.141592E+00 3.141594E+00] [ 3.1415925E+OO 3.14159303+00] * [ 3.1415926E+OO 3.1415928E+001 , [ 3.14159264E+OO 3.141592683+00] 8 [ 3.1415926513+00 3.1415926583+00] , [ 3.1415926533+00 3.1415926553+00] 8 [ 3.14159265343+00 3.14159265393+00] 3.14159265373+00] [ 3.1415926535E+OO [ 3.141592653583+00 3.14159265361E+OO~ 3.141592653593E+OO] [ 3.1415926535873+00 , [ 3.1415926535893+00 3.14159265359OE+OO~ , [ 3.14159265358963+00 3.14159265359OOE+OO~ [ 3.141592653589763+00 3.141592653589853+00] [ 3.141592653589783+00 3.14159265358980E+OOI 8 [ 3.14159265358979lE+OO 3.1415926535897973+00] [ 3.1415926535897923+00 3.1415926535897943+00] 8 [ 3.14159265358979313+00 , 3.1415926535897934E+001 3.1415926535897933E+OO] [ 3.1415926535897932E+OO [ 3.14159265358979322E+OO 3.14159265358979325E+OO] [ 3.14159265358979323E+OO 3.14159265358979324E+OO]
.. .
.. . .
.. .
Now, the final enclosure is very satisfactory with respect to the 21 digit decimal interval arithmetic used. The numerical results show that the rate of convergence is rather poor. The following subsections also describe methods with a higher rate of convergence.
Pi with Order One
4.2
A linearly convergent method which is due to Ramanujan (see [3]) is considered. Here 1 / is ~ given as an infinite sum 1 T
-
fi 9801
2 n=O
+
(4n)!{1103 26390n) (n!)43964n
with O! being defined as 1 as usual. Each additional summand increases the number of correct decimal digits of the approximation roughly by 8 (i.e. 26 bits). The following program shows an implementation of the preceding formula.
Multiple-Precision Cornputations
337
program ramapi; use mp-ari; module for multiple-precision real arithmetic 1 Var s. an. app, pi: mpreal; nn. n. nmax: integer; begin write(’Number of digits (base=2**32)? ’1; read(nn); writeln; setprec(nn + 1); nmax:= nn-2; { in each step roughly eight decimal places a r e gained 1 { i.e. roughly one mantissa digit to base 2**32 1 mpinit (app) ; mpinit (8) ; mpinit (an) ; mpinit (pi) ; correct reference value for pi 1 pi:= 4*arctan(mpreal(l)); 8:’ 0; for n:= 0 to nmax do begin setprec(nn-n); n-th summand only to lover precision 1 En:= fac(4*n)*(1103 + mprea1(26390)*n) / ( fac(n)**4 * (mprea1(396)**(4*n)) 1; setprec(nn); full-precision for summation 1 s:= s + sn; app:= 9801 / ( sqrt(mpreal(8))*s 1; if n < 10 then vriteln(’approx: ’, app); Error m then m a x : = n else max:= m; end ; Var xn, xnpl, pi2-ref: mpreal; nn, err: integer; begin mpinit(xn) ; mpinit(xnp1) ; mpinit(pi2-ref ; write('Number of mantissa digits (base=2**32) ? '1; read(nn); writeln; setprec(nn+i) ; pi2-ref:r Z*atan(mpreal(l)); { accurate reference value for pi/2 1 setprec(3); { default setting of multiple-precision arithmetic 1 xn:= 1.5707963; { pi/2 to normal floating-point accuracy 1 { initial approximation 1 setprec(1) ; repeat setprec( 3*getprec-i ) ; { actual precision setting 1 if getprec > nn then setprec(nn+2); writeln('Actua1 precision setting: ', getprec); xnpl:= xn + cos(xn); xn:= xnpl; err:= 1 + expo( abs(pi2-ref - xn) ; writeln(' Error: 2**('. err, '1'); until getprec >= nn; end.
Multiple-Precision Computations
343
Running this program with 120 mantissa digits gives: Number of mantissa digits (base=2**32) ? Actual precision setting: Error: 2**(-33) Actual precision setting: Error: 2**(-104) Actual precision setting: Error: 2**(-315) Actual precision setting: Error: 2**(-949) Actual precision setting: Error: 2**(-952)
120
2 5
14 41 122
In each iteration step, the number of correct bits is almost tripled. The actual precision of the employed arithmetic is altered in each iteration step.
4.6
Pi with Order Four
The rate of convergence of the method discussed in this section is four. Again, it is not the value of ?r which is approximated. Instead a sequence of values a, is The method used is based on a modular computed which tends to the limit equation of order four (see[3]).
a.
Based on the starting values
yo := a0
Jz-1 ,
:= 6 - 4 & ,
the following iteration is performed:
The implementation in PASCAL-XSC uses some auxiliary variables for intermediate results. The fourth square root is computed using the normal square root function twice. Of course, it will be faster to compute the fourth square root by the Newton method directly analogously to the processing of the normal square root. The method is not a self-correcting one. All operations and function evaluations have to be performed in full-length arithmetic. So the precision setting in the PASCAL-XSC program is executed only once at the beginning of the program part. In order to find an approximation for ?r, the reciprocal of the final value of the last iterate must be calculated.
W. Kramer
344
program quadpi; module for multiple-precision real arithmetic 1 use mp-ari; var a, y. h, t, 8 : mpreal; one, two. four, six, ref: mpreal; i. imax. n, nn, err: integer; begin urite(’Number of mantissa digits (base=2**32)? ’1; read(nn); writeln; mpinit(a); mpinit(y1; mpinit(h); mpinit(t); mpinit(s1; mpinit (one) ; mpinit (tso) ; mpinit (four) ; mpinit (six) ; mpinit (ref1; setprec(nn+i) ; one:= 1; two:= 2; four:= 4; six:= 6; C accurate reference value for i/pi ref := mpreal(0.25)/atan(one); setprec(nn) ; h:= sqrt(two) ; y:= h one; a:= six four*h; t:= y*y; n:= 8; { 2**3 1 imax:= 7; for i:= 1 to imax do begin h:= one t*t; c 1 y**4 1 t:= sqrt(h); C sqrt(Y - y**4) 1 h:= sqrt(t); { sqrt4( 1 - y**4 1 1 y:= (one h>/(one + h); h:= one + y; 8:’ h*h; t:= y*y; a:= s*s*a - mpreal(n)*y*(h+t); err:= i + expo( ref - a 1; writeln(’Number of correct bits a r e at least: ’, abs(err)); n:= 4*n; C 2**(2k+3) end ; end.
Number of mantissa digits with respect to the employed base ’i?32: 3200 Number of correct bits for the first 7 iteration steps:
>= >= >= >= >=
30 137 570 2308 9268 >= 37113 >= 102365
Only 7 iteration steps are necessary to compute an approximation with more than 100000 correct bits. For testing purposes the first 10000 significant bits of ?r as well as the bits 90001 up to 100000 are given in Appendix A. They have been computed using the fourth order method described here.
Multiple-Precision Computations
345
There are many other methods to compute numerical values for 7r. See for example [3]and [4].
5
Elliptic integrals
The computation of elliptic integrals is strongly related to the arithmetic-geometric mean iteration described in Section 3 as well as to the availability of sufficiently accurate values of T (see the previous section). To get verified bounds for the elliptic integral of the first kind
F ( k ) :=
1'
d r 1 - k2sin2r
the relationship
for moduli k E ( 0 , l ) is made use of. If { G j } and { A j } are the sequences of machine numbers associated with the process of forming arithmetic and geometric means in ( 2 ) with starting values go := I/= and a0 := 1 , then there holds
v
The symbols and A denote rounding of constants as well as directed rounded results of floating-point operations towards -m and +m, respectively. The AGM iteration is monotone with respect to both arguments. Thus, if go := 4is not a machine number, the { G j } sequence has to be computed using a floating-point number less than to go and the sequence { A j } using a floatingpoint number exeeding go. The iteration stops for some j o if the term AJo-G'o is 'JO sufficiently small and this then is true for the closely related bound of the relative error. With the same notation as above the complete elliptic integral of the second kind
E ( k ) :=
1' 4
1 - k2sin2r d7
may be represented by the expression (see [IS])
The term
W. Kramer
346
can easily be enclosed in an interval as described above (formula (14)). In practice, only a limited number N of terms of the infinite series are used. The associated remainder term m
RN :=
C
2'(aj
- - ~ j ) ~
j=N+1
is positive and bounded by
Thus, an enclosure of E(E) is given by
All the operations in (16) are set operations and [. .. , . ..] stand for intervals of the given bounds. Again g, and aj can be replaced by their floating-point bounds G, and Aj as given by (6). The following PASCAL-XSC function makes use of the relationship in (16), in order to compute guaranteed bounds for complete elliptic integrals of the second kind. function ce12( modulus: r e a l 1: i n t e r v a l ; I Complete e l l i p t i c i n t e g r a l of t h e second kind Par
>
g, p e w , a, k, s, h , pi08: i n t e r v a l ; J . ] m a r , twoj: i n t e g e r ; err: real; atop: boolean;
begin k:= modulus; i f k = 1 then ce12:= 1 e l s e begin pi08:= arctan(intval(l))*O.S; { i n t e r v a l enclosure of p i / 8 a:= I ; i n t e r v a l evaluation of go k ) * ( l + k) ); g:= s q r t ( ( I j:= 0 ; jmax:= 8; e r r : = 10-14; a:= ( a g)*(a g); twoj:= I ; repeat writeln(twoj) ; j:= j + l ; twoj:= l*twoj; gnev:= sqrt(g*a); I geometric mean a:= O.S*(g + a ) ; I arithmetic mean g:= gnaw; h:= twoj*(a g)*(a - g ) ; a:= s + h;
>
-
-
-
> >
-
>
Multiple-Precision Computations
347
{ The range of the complete e l l i p t i c integral of the second kind { for arguments k i n [O, 11 i s [l, pi/2]. Thus, the following error { criterion leads t o a r e l a t i v e error bound. stop:= (a.sup g.inf < err) or ( j > j m a x ) ;
-
> >
1
-
u n t i l stop; s:= 4 - 2+k+k s intval(0.0. sup(h)); h : = srpio8 / i n t v a l ( g . i n f , a.sup); i f h.inf < 1 then h . i n f : = I; ce12:= h; end ; end ;
-
The following Table shows the results of each iteration step for the function c e l l ( k ) with modulus k:= predecessor( 1) = 0.999999999999999999999. An interval arithmetic with 23 decimal mantissa digits has been used.
[ i.OE+OO
c
C C C C C C
l.OE+OO i.OE+OO l.OE+OO I.OOOE+OO 1.00000E+00 1.00000000000E+00 1.000000000000000000E+OO
, , ,
1.00000000006E+00] 1.000000000000000001E+OO]
The subsequent output demonstrates the fast quadratic convergence rate for E(0.5). The computation has been executed using a multiple-precision interval arithmetic. Only the correct digits are displayed which have been determined by means of a comparison of the lower and the upper bounds of the enclosure that is produced in each iteration step.
Methods for the computation of interval bounds for incomplete elliptic integrals of the first and second kinds are discussed in (131.
6
Evaluation of the Natural Logarithm
The first method described here can be formulated easily; it is linearly convergent.
W. Kramer
348
6.1
Natural Logarithm (First Method)
The method described in [7] will be used. Two sequences zn and yn with starting values 2 0 > 0 and yo > 0 are defined:
Using the common limit of these sequences log( way:
2)can be expressed in the following
In order to compute log(.), set z0 := 2 and yo := 1. In each iteration step the common limit lies in the interval lim
n-w
2,
= lim yn E [min{in,yn}, maz{zn,yn}]. n-w
The following program uses this observation to get bounds for log(z). program logagm; use i-ari; { i n t e r v a l module 1 Var
a r g , res: i n t e r v a l ;
f u n c t i o n m-ln( a r g : i n t e r v a l 1: i n t e r v a l ; { computation of an enclosure of I n ( a r g ) 1 Var
x , y, xnew, ynew, m: i n t e r v a l ; begin x:= a r g ; y:= 1; repeat xnew:= s q r t ( x*(x+y)/2 1; ynew:= s q r t ( y*(x+y)/2 1; x:= xnew; y:= ynew; { convex h u l l of x and y m:= x +* y; 1 m: = (arg- 1)* ( a r g + l ) / (2*m*m) ; writeln(m) ; { enclosure of I n ( a r g ) 1 u n t i l not(x >< y ) ; { t h e i n t e r s e c t i o n of x and y i s no longer empty 1 m-ln:= m; end ;
Muitiple-Precision Computations
349
begin repeat write(’Argument ? ’1; read(arg); writeln; res:- m,ln(arg); writeln(’res : 3 , res); writeln(’intrinsic: , ln(arg) 1 ; until false; end.
Enclosures computed for log(1.25) are given below. Argument ? r
1.25
c c c
1.93-001, 2.13-001 , 2.13-001. 2.23-001.
2.6E-0011 2.4E-0011 2.33-0011 2.33-0011
c c c c
2.2314353-003, 2.2314353-001, 2.2314353-001. 2.23143643-001.
2.2314363-0011 2.2314363-0013 2.2314363-0011 2.23143563-0011
L
c c c c c c c c c c c
2.2314355133-001, 2.2314355143-0011 2.23143551303-001, 2.23143551333-0011 2.23143551313-001, 2.23143551323-0011 2.23143551313-001, 2.23143651323-0011 2.231435513133-001, 2.231435513153-0011 2.23143551313E-001, 2.231435513153-0011 2.231435513141-001. 2.231435513153-0011 2.2314365131413-001, 2.2314355131443-0011 2.2314355131413-001, 2.2314356131433-0011 2.2314355131413-001. 2.2314355131433-0011 2.2314355131413-001. 2.2314355131433-0011 2.2314355131423-001, 2.2314355131433-0011 2.23143551314203-001. 2.23143551314223-0011 res : [ 2.23143551314203-001, 2.23143551314223-001] intrinsic: [ 2.2314355131420973-001. 2.231435513142098E-0011
c c
The enclosure indicated by intrinsic is the resulting interval for In(arg) using the intrinsic natural logarithm of the PASCAL-XSC interval module iari. Obviously the rate of convergence is very poor.
6.2
Natural Logarithm (Second Method)
The arithmetic-geometric mean iteration may also be used to compute guaranteed bounds for the natural logarithm. The following method is applicable for I E (0,l);
W. Kramer
350 it uses a precomputed enclosure of x . There holds (see [4]) x
‘log
x
‘
n
- 2AGM(1,10-”) -t 2 A G M ( l , ~ l 0 - ~ ) ’102(”-’)‘
AGM(. ..) denotes the arithmetic-geometric mean iteration (2). A PASCAL-XSC program for the computation of logarithms may look as follows: program logquad; use mp-ari, mpi-ari; { multiple-precision modules 1 function log(n: integer; x: mpinterval): mpinterval; { Enclosure for log(x) using 10**(-n) for the ACR 1 V U
u, v, one, t, err: mpinterval; pi2: mpinterval; i: integer; begin rnpvlcp(x) ; mpinit(u); mpinit(v1; mpinit(one) mpinit(t); mpinit(err) mpinit (pi21 ; err.inf:= -n; { err:= C-n, nl 1 err.sup:= n; for i:= 1 to n-1 do err:= err/100 one:= 1; pi2:= 2*atan(one) ; C pi12 3 t:= one; for i:= 1 to n do t:= t/lO; { t:= 10**(-n) 1 u:= agm( t. one); v:= agm(x*t, one); log:= true; { enclosure of ln(x) 3 log:= pi2*(l/u - l/v) + err; log:= false; mpfree(u) ; mpfree(v1; rnpfree(one1; mpfree(t) ; mpfree(err1; end ; Val-
x, res: mpinterval; k, n, kmax, abserr: integer; begin setprec(3) ; mpinit(x) ; mpinit(res1; kmax:= 4 ; n:= 8 ; repeat writeln; write(’x = ? I ) ; read(x); writeln; for k:= 1 to kmax do begin writeln(’Actua1 precision setting: ’, getprec); res:= log(n, x); { enclosure of ln(x) 1 if getprec 1;
a
the number m of iterations exceeds a prescibed bound; then a decision concerning the asymptotic stability of [C]is not possible.
In the execution of the multiplications of interval matrices, there are unavoidable overestimates. This may cause an unstable point matrix to be enclosed in [ P I whereas this is not so for [C"'-l]. In this case, the property of asymptotic stability cannot be verified for [C] irrespective of its truth.
Partition of the Input Interval
4 4.1
Introduction of the Problem
In the context of the problem of asymptotic stability, systems of linear ODES are considered in this paper, y' = Ay,
y :R
4
R",A E R"'".
The following constructive sufficient criterion for this property is derived in Section 3: 0
an interval matrix [(A - Z)-l] can be determined, i.e., then the matrix A - Z is invertible, and,
0
by means of the Cordes-Algorithm, p(B)
[B] := Z
+ 2 [ ( A- Z)-'].
< 1 can be verified for all B E
B. Gross
372
This algorithm is still applicable in the case that an interval matrix [A]is admitted, provided the width of [A] is sufficiently small. Frequently in applications, this condition is not satisfied because of relatively large variations of input data or uncertainties of measurements. Generally, 0
0
then an enclosure [ ( A- Z ) - l ] cannot be determined under the simultaneous admission of all A E [A]or the Cordes-Algorithm fails to verify that p(B) < 1 for all B E [ B ] .
As a remedy, a suitable sequence of partitions of the input interval [A] may be carried out. The presented constructivemethods are then to be applied with respect to each generated subinterval of [ A ] .Provided the property of asymptotic stability has been verified for all subintervals of [ A ] ,this property is also true for the union of these intervals, i.e., for all A E [ A ] .
4.2
The Philosophy of Partitioning [A]
Given an interval matrix [A]C R"'", there are k E (1,. . .,n2}elements which are represented by genuine intervals; the remaining n2 - k elements are real numbers. The Cartesian product of the k interval-valued elements defines an interval in Rk, which is to be partitioned (subdivided). Generally, the k individual input intervals in R possess largely non-uniform widths. As an additional problem a sequence of subdivisions yields a rapidly increasing number of subintervals. Consequently, in each step of this sequence it is advantageous to partition only one interval on R possessing maximal width. Each partition in R generates subintervals of equal widths. Therefore, in each step, two new interval matrices are generated in R"'". They are identical except for the one element in R which has been subdivided in this step, i.e., with the exception of their location in Rk.The total ordering of the set R allows the notation left (or Tight) partial interval matriz for the interval matrix in R"'" that contains the left or right subinterval in R. See Figure 4.1 for a sequence of sample partitions of an interval in Rkwith k = 2. Occasionally, in the sequence depicted in this figure, two neighbouring intervals in R are simultaneously subdivided in an individual step of this sequence.
Verification of Asymptotic Stability
373
Figure 4.l:Example of a partitioning strategy for a matrix containing two interval-valued elements
The corresponding algorithm was arranged recursively such that the left partial interval matrix is at first investigated in each step. If asymptotic stability cannot be verified for this interval matrix, it will be subdivided again. Otherwise, the right partial interval matrix is treated next. Provided the property of asymptotic stability has been verified for both partial interval matrices belonging to one step of this recursion, this property is also valid for their union. In that case, the algorithm returns to the preceding step of the recursion. If this is its beginning, the property of asymptotic stability has been verified for all A E [A]. Otherwise, the right partial interval matrix of the preceding step is treated. Consequently, this generates the structure of a tree, see Figure 4.2 for an example.
B. Gross
374
IA1
Figure 4.3: Example for a recursive tree for the partitioning algorithm
This recursion tree is treated by going from left to right. Concerning each individual path in this tree, the total number of steps is said to be the recursion depth of this path. Generally, this depth is non-uniform for different paths of this tree since it is determined as the path is treated recursively. Even though it cannot be fixed in an a priori fashion, the depths of the various paths must be bounded; i.e., the recursive treatment of a particular path will perhaps be aborted somewhere. If this occurs without a verification of asymptotic stability at the (artificial) end point of a path, then there are the following possible causes: (i) The two subintervals as generated a t the end of this path are still "too large".
(ii) Because of reasons as stated in Subsection 3.5, the Cordes-Algorithm does not yield a verification of the desired property. (iii) At least one subinterval of [A] is not asymptotically stable. Consequently, then [A] does not have this property. The subinterval addressed in (iii) can be utilized as an indicator in order to yield a perhaps important information concerning the input data of the real world problem under discussion.
4.3
Examples Employing Subdivisions of the Input Interval Matrix
The examples presented in this subsection have been chosen and treated (by means of other methods) in the International Journal of Control. In papers in this journal, generally only interval matrices in RZxZ have been adopted. For these examples,
Veriikation of Asymptotic Stability
375
the property of asymptotic stability can then be tested immediately. The examples illustrate the employed methods very well. The first example to be treated here was chosen by S. Bialas [3]; it has also been investigated in numerous subsequent papers in the International Journal of Control. The example rests on the choice of
For all A E [A], it was possible to verify the property of asymptotic stability by means of the partitioning method as presented in Section 4.2. For this purpose, it was necessary to verify this property individually for 102 partial interval matrices. It was analogously possible to verify the property of asymptotic stability for the following example by Xu Daoyi (71:
[ -3, -21 [ -6, -51
([
3,
41 [ - 3 , - 2 l
>.
In the execution of the partitioning method, here only 22 partial interval matrices had to be investigated. The interval matrix adopted in the third example,
(I
[-7,-31 [ 0, 21 3, 51 [ - 8 , - 4 l
)'
was originally chosen in the paper by R. K. Yedavalli (161. This interval matrix contains the one treated in the first example. For a verification of asymptotic stability by means of the partitioning method, here a total of 195 partial interval matrices were treated. The total computation time was insignificant in each one of these three examples. In fact, the cost of the Cordes-Algorithm is almost negligible in the case of (interval) matrices in R2".
4.4
Application Concerning The Bus with an Automatic Tracking System
At first, in this subsection, the data chosen for the bus with an automatic tracking system is presented. This is followed by tables exhibiting the computed results for the eigenvalues for the Operational Cases chosen in Subsection 2.2. The data and the tables have been taken from the book by G. Roppenecker [14]. The following system of linear ODES of the order five serves as a simulation of the operational characteristics of the bus with an automatic tracking system j.a(t)
= AaZa(t) $. b a u a ( t )
+ eaz,(t),
y a ( t ) = cTZa(t)
B. Gross
376 where
0 1
5,
=
0
0
[ i], !], (.Ii], 0 0
a23
a24
a25
L [0
d"
0
G=[
:I.
The vanishing eigenvalue of the matrix A, has multiplicity three. Since A, is not stable, a control is required. According to G. Roppenecker [14], the following intervals have to be taken into account for the velocity v of the bus, its mass m, and the coefficient p, respectively.
With reference to the metric units, nondimensional quantities are introduced. The elements of the system matrix A, and the components of the perturbation vector e, then can be determined as follows as function of the input data: a23
a33
+
= 6.12~33
v(a43
= - 2 ( 3 . 6 7 ' 6v
+ 1)1
+1.93'6~)~
+
a24
= 6.12~34
a34
2P = --(3.676v
0 2P a43 = --" ( 3 . 6 7 6 ~ - 1 . 9 3 6 ~-) 1, a44= --(6v Qv
mu2 a26 = 6.12 a35 ~ 2P a35 = --3.676~, 0
+
a45
= -2p 6v,
mu
~
4
5
va44,
- 1.936~),
+ 6H),
~
e2
= -u2,
mu where 0 = 00 11.174(m - mo) : moment of inertia of the bus (in kgm'), 6v = 6 ~ 0 8.430(m - mo) : coefficient of lateral force of the front wheels (in N/rad), 6~ = 6 ~ 0 17.074(m - mo) : coefficient of lateral force of the twin rear wheels (in N/rad), and, in the case of the empty bus (m = 9950kg), 00 = 105700 : moment of inertia (in kgm'), mo = 9950: mass (in kg), 6v0 = 195000 : coefficient of lateral force of the front wheels (in N/rad), 6 ~ = 0 378000 : coefficient of lateral force of the rear wheels (in N/rad).
+ + +
Verificationof Asymptotic Stability
377
The matrix A. is given as follows for the three operational cases as chosen in Subsection 2.2: Operational Case 1 ( v = 3 m/sec, rn = 9 950kg, p = 1)
I
0 0 0 0
A, =
1 0 0 0 0 -154.79826 -113.56743 -122.06784 0 -25.44590 0.26282 -13.54115 0 -0.68978 -38.39196 -13.06533
0 0
0 0 0 0 0
'0 0 A, = 0 0 0
0
0
0
1 0 0 0 0 -46.43948 -113.56743 -122.06784 0.26282 -13.54115 0 -7.63377 -3.91960 0 -0.97208 -11.51759 0 0 0 0
1
-
-
1 0 0 0 0 -17.86862 -89.06990 -94.51381 0 -2.94635 0.30108 -10.41892 0 -0.99185 -4.54563 -1.53750 0 0 0 0 -
The design of the automatic control has been presented in Subsection 2.2. In the absense of the differential filter, this yields the following system of linear ODES of order six '
0 0 0 0
1
0
0
0
0
a23
a24
a25
0
(1.33
a34
a35
0
a43
a44
a45
0 0
0 0
0 0
-0.882 2.588 72.68 -53.55
0 0 0 0
1 -29.52
.
(4.1)
-
Now, Gerschgorin's Disk Theorem is employed to determine six disks, whose union contains the set of eigenvalues of the matrix in (4.1). The disk belonging to the second column in (4.1)is given by I X I < 57.138,where all possible cases are taken into account. This disk contains the other five disks. An inequality 1x1 c P does not allow a conclusion whether or not all eigenvalues are confined to the left halfplane. Consequently, Gerschgorin's Disk Theorem is not useful for a verification of asymptotic stability for the present problem.
B. Gross
378
By means of the partitioning method, it will now be tested whether or not the design of the automatic control for the sample problem in Subsection 2.2 guarantees a stable operational performance for all choices of the parameters v, m and p of the problem. Under the admission of the total input interval for these parameters, the corresponding intervals of the elements of the input matrix [A] possess relatively large widths. This is in particular true for the interval [-959.47684, -8.934311 as the set of the admissible values of the element ~ 2 3 .In the interval matrix [A] & R6x6, k = 9 denotes the number of elements taking values in genuine intervals, because of their dependencies on the mass rn and the coefficient p of the interaction of tires and road. The interval-valued elements a 2 4 , u34, a25 and a35 do not depend on the velocity v. The other interval-valued elements are functions of v, with an increasing rate of changes as v decreases. The parameter variations of the extent just outlined prevented the execution of the partitioning method under the simultaneous admission of the total input interval for (m, v, p)T E IR3, a t least when making use of the employed SAM 68000-computer made by kws. With a total computation time of two weeks, only the following edge of the input interval in IR3 could be treated by means of the partitioning method: V p = 1, m = 9950kg and -E [1,20]. [m/secI This hardware problem is expected to be less serious in the case of an employment of more powerful computer systems supporting the execution of enclosure algorithms. For this purpose, the language PASCAL-XSC may be used in conjunction with any IBM-compatible PCs.
5
Estimates for the Degree of Stability
Concerning the location in G of the eigenvalues of a matrix A E IR"'", four constructive methods will be presented in this section. For this purpose, the verification of asymptotic stability as outlined in Subsection 3.4 will be employed. Consequently, an estimate by any of these four constructive methods is guaranteed to be true.
5.1
The Degree of Stability
For certain classes of applications, it is not sufficient to verify the property of asymptotic stability for a system of ODES, y' = Ay,
y : R -+
R",A E R"'"
In fact, frequently it is important to assess the time response of a dynamic process. Generally, it is desirable that all solutions of y' = Ay approach zero sufficiently rapidly. All solutions of nonhomogeneous systems y' = Ay b will then correspondingly approach a steady-state provided there is one. If this approach is too
+
Verification of Asymptotic Stability
3 79
slow, the simulating process is almost unstable. Consequently, problems are to be expected for the corresponding real world process. The time response of an asymptotically stable system is governed by the location of its eigenvalues in the left half-plane. This will now be investigated. Provided X E R is a negative eigenvalue of A, the approach of zero of ext slows down as 1x1 decreases. In the case that A possesses a pair of conjugate complex eigenvalues A, with
x
X=6+iw, x = b - i w
and b < O ,
then the corresponding solutions of y' = Ay are represented by r1e6*sin(wt)
+ r2e6' cos(wt) = re6tsin(wt + 'p)
with rl, r2, r E R".The time response is bounded by *redt, i.e., the decay slows down as 161 decreases. It is assumed that A either has a negative eigenvalue close to zero or an eigenvalue with negative real part close to zero. The corresponding process then possesses only a small rate of decay due to damping and a slow approach of any steady-state. Provided the real parts of all eigenvalues are negative, then generally the time response is governed by the eigenvalues that are closest to the imaginary axis. This induces the following definition:
Definition: (i) An asymptotically stable matrix A E R"'" possesses the degree of stability u, with u 2 0, if -0 represents the maximal real part in the set of eigenvalues of A. (ii) An asymptotically stable interval matrix [ A ] C R"'" possesses the degree of stability u, with u 1 0, if every matrix A E [ A ] has a t least the degree of stability u. . .
. . .x . . . . .
X' .
. . . . .
. . . . . . . . . . . . . . . . . . . .
. . . .' X . , ' .... .. .. .. . .. .. . . . .x. . '
'
Figure 5.1: Definition of the degree of stability, (I
This stability measure arises naturally in the representation of the sets of solutions of the system y' = Ay considered, here. In fact, the fundamental matrix of a
B. Gross
380
system of this kind is given by @ ( t ) = eAt. In the more general case of systems y' = A ( t ) y with A(t + T) = A(t) for all t E IR and a fixed period T E , 'RI @ (t)= F(t)eK' because of the Floquet theory. Since F ( t 2') = F ( t ) for t E IR is true for the matrix function F , the stability of y' = A(t)y is governed by the eigenvalues of K E IR"'".
+
The following subsections are devoted to a presentation of additional properties of the Mijbius-transformation w = - This is the basis for the design of the z - 1' four constructive methods, each of which yielding a safe lower bound of the degree of stability u. Here, too, the Mijbius-transformation is chosen such that there are corresponding mappings S( A ) of A E RnX" and S ( z )for the points z E Q:, compare Corollary 3.7. +
5.2
Additional Mapping Properties of the Mobius-Transformation w =
5
On the basis of Theorem 3.6, an approach for the verification of the property of asymptotic stability is outlined in Subsection 3.4. This method does not allow an esimate of the degree of stability. For this purpose, an additional consideration -k of the Mijbius-transformation w = -is required; particularly, the images of 2-1
straight lines parallel to the imaginary axis are of interest.
'+
1. The Mobius-transformation w = - maps straight lines Theorem 5.1 2-1 parallel to the imaginary axis lying in the left half-plane onto circles with center on the real axis containing the point +l.
2. The half-plane to the left of any such straight line is mapped into the interior of the image circle, see Figure 5.2.
Figure 5.2: Mapping properties of the Mobius-transformation S(z) in the left half-plane
Verification of Asymptotic Stability
381
Proof:
00 which is mapped into s(00) = 1. Consequently, the images of all straight lines contain the point +l.The point t E R on the real axis will denote the intersection of an arbitrary straight line
1. Every straight line contains the point
with the real axis. The image, w = S(t) = + of this point is real-valued. t-1 Since generalized circles are mapped onto generalized circles, the image of a straight line is either a straight line or a circle. Since angles are preserved by conformal mappings, the images of straight lines parallel to the imaginary axis lying in the left half-plane are circles intersecting the real axis orthogonally.
2. The Mtibius-transformation w = -maps the real axis onto the real axis. 2-1 Since 00 is mapped into +1, the segment of the real axis to the left of the point t is mapped onto the segment of the real axis between the points w = S(t) = + and +l. This completes the proof. t-1 0 +
The constructive methods to be derived make additional use of the following theorem:
Theorem 5.2 The inverse Mobius-transformation z = -maps circles with w-1 the center at the origin and the radius r < 1 onto circles containing the points 1 1 1 y=and - with center M = -(7 -). Concerning these circles in the r-1 7 2 7 w-plane, their interior is mapped onto the interior of the circles in the z-plane.
+
+
c
Figure 6.3: Mapping properties of the inverse Mobius-transformation S-'(w) in the unit circle
Proof: Every circle contained in the unit circle is mapped into the left half-plane.
For the circles in the w-plane with their center at the origin, their intersections f r with the real axis are considered. These points possess the following images: -r+1 -r-1
r-1 1 --ER. r+l 7
r+l --S ( r ) = -=: 7 E R and S(-r) = --
r-1
B. Gross
382
Because of the orthogonal intersections of these circles in the w-plane with the real axis, this kind of intersection must also be true for their images in the zplane. Consequently, these images are circles, with the points 7 and l/7 on their circumference. The origin of the w-plane is in the interior of any circle under consideration in this plane. This origin is mapped into the point z = -1. Consequently, the interior of these circles in the w-plane is mapped onto the interior of their image-circles in 0 the z-plane. This completes the proof.
5.3
First Constructive Method for a Lower Bound of the Degree of Stability
For an asymptotically stable matrix A E R"'", by definition, all eigenvalues are located on or to the left of a straight line in the left half-plane which is parallel to the imaginary axis. A constructive method is desired which yields a close lower bound of u,making use of the Cordes-Algorithm as presented in Subsection 3.5. A parallel to the imaginary axis is considered which intersects the real axis a t -a with a E
R+. By means of w
= z+ and because of Theorem 5.1, this line 2-1
a-1 is mapped onto the circle containing the points 1 and - with its center a t a+1 a M=E R.The half-plane to the left of this line is mapped into the interior a+l of the circle.
Figure 5.4: Transformation of the half-plane { z E C : z
< -a} into the unit circle
This circle is now shifted parallel to the real axis such that it takes up a position with the origin as its center. This shift can be expressed by means of the Miibiustransformation v=w--
a a+1
and
z+l a S ( z ) = -- z-1 a+l'
Verification of Asymptotic Stability
383 1 a + 1'
In the new position, the circle intersects the real axis a t the points k -
Figure 5.5: Centering of the disk by means of a shift parallel to the real axis
The corresponding Mobius-transformation for a matrix B is given by
C:=B--
ff
ff+1
I
with I the identity matrix. When B is replaced by I S(A) := C = -I a+l
+ 2(A -
there follows
+ 2(A - I ) - ' .
The following Theorem shows the relationship between the eigenvalues of A and C.
1 Theorem 5.3 Provided the spectral radius of the matriz S(A) = C = -I +
as1 1 2(A - I)-' is less than - then all eigenvalues of A E Rnxn possess real parts a + 1' smaller than -a.
Proof: An arbitrary but fixed eigenvalue of S(A) = C = -I a+l denoted by p.
+ 2(A - I)-'
is
1 p is confined to the interior of a circle in the va+l' 1 plane with its center a t the origin and its radius -. Because of Theorem 3.5, a+l X = S - l ( p ) is to the left of a straight line in the r-plane parallel to the imaginary axis, which intersects the real axis a t -a. 0 Since this is true for all eigenvalues p of A, this completes the proof. 1 The condition p(C) < - may be replaced by the equivalent condition a+l ~ ( ( a1)C) < 1 which can be directly tested by means of the Cordes-Algorithm. An a close to optimal can be determined by use of a bisection method.
Provided p(C)
+
< - then
384
B. Gross
5.4
Second Constructive Method for a Lower Bound of the Degree of Stability
A matrix A E RnXn is asymptotically stable if and only if all eigenvalues of A are confined to the left half-plane. A Miibius-transformation of A into S ( A ) = B = I 2(A - I ) - l is considered. Because of Corollary 3.7, the eigenvalues XI,.. . ,A, A.+l . of A then are transformed into the eigenvalues p; = ,2=1,..., nofB. A; - 1
+
The set of these eigenvalues p; is confined to the interior of the unit circle. It is assumed that the spectral radius of B is not only smaller than one but also smaller than l / P with a P E R+ such that p > 1. It is then possible to derive a set in the left half-plane which contains all eigenvalues of A. In conjunction with a trial and error approach, a bisection method will be used to determine a suitable value of P. For this purpose, the circle with the center at the origin of the v-plane and the radius l / P is transformed into the t-plane. Because of Theorem 5.2, the image 1
is a circle intersecting the real axis at 7 := -k and - with its center at M := 1-P 7 1 1 -(7 --) E R in the left half-plane. 2
+
t
Figure 5.6: Transformation of the disk { w E CC : IwI < I/a} into the left half-plane
This circle in the z-plane contains all eigenvalues of A , including the one with the maximal real part. Consequently, l / 7 as the larger one of these points of intersection is a lower bound of the degree of stability. For this estimate to be fairly close, the eigenvalue of A with maximal real part must be real and closest to the circle with the center at M.
Verification of Asymptotic Stability
385
t . . . . . . . . . . . . . . . . t
. . . . . . . . . .
. . . . . . . . . .. ..
1 Figure 5.7: Dependency of the disk size from the position of the eigenvalues
This bound of the degree of stability may be rather coarse in the case that 7 is determined by eigenvalues other than the one with the maximal real part. As an example, an eigenvalue X = -10 is considered. Then 7 < -10 and l / y > - l / l O , even though X = -10 is irrelevant concerning the stability of A. As another example, a pair of conjugate complex eigenvalues is considered, All2 = -1 f 1Oi. This pair is transformed into p1/2
25f5i =26
with
lpll = lpll = -x 0.9806.
Provided it can be verified that the spectral radius of B is smaller than 1 / p = 0.99, the inverse transformation yields a circle in the z-plane with its center at M = -99.5025, which intersects the real axis at 7 = -199 and l / y = -1/199 w -0.005025. The pair of eigenvalues All2 is irrelevant concerning the stability of A; in fact, generates a rapidly decaying oscillation that is governed by e-* sin lot. In some cases, it is undesirable that the point -1 is always in the interior of the circle intersecting the real axis at 7 and l/7. In fact, the lower bound l / 7 for the degree of stability is unfavorable in the case that the maximal real part of the eigenvalues is smaller than - 1. Consequently, there are practical restrictions concerning the applicability of the second constructive method. Nevertheless, this method yields a safe information with respect to the location of all eigenvalues in the left half-plane. This may be of interest for problems other than the one of the asymptotic stability of A.
5.5
Third Constructive Method for a Lower Bound of the Degree of Stability
Any suitable matrix norm provides a trivial upper bound of the spectral radius of a matrix A E Elnxn since @ ( A )5 11 A (1 < 00. Both the row sum norm 11 A 11- and
B. Gross
386
11 A 111 of A depend only on the elements of A. If a := min{l( A 111,II A l l m } , then all eigenvalues of A are confined to a circle with
the column sum norm
the center at the origin and the radius a. Correspondingly, p(A) 5 a.
If a > 1, A for the purpose of the subsequent analysis is multiplied by l/a. If min{ 11 A 111,II A llm} 5 1, a is equated to one. As compared with the eigenvalues of the matrix A, the ones of := l/a.A have factors l / a 5 1, and they are confined 5 1. to the unit circle. Consequently,
A
@(A)
t
t
Figure 5.8: Multiplication with the factor l / a
-
With i := 1/a z, the Mobius-transformation 27, :=
i+l
7 is 2-1
now applied with
respect to the unit circle. The image is the union of the left half-plane and the imaginary axis. For matrices, this corresponds to the Mobius-transformation B := I 2 ( A - I)-'.
+
The multiplication of a matrix A with l / a E (0,1] does not affect the location of the eigenvalues either in the left half-plane or on the imaginary axis or in the right half-plane. If a > 1, the eigenvalues of := 1/a.A are closer to the imaginary axis than the ones of A. Consequently, a matrix A E R"'" is asymptotically stable if and only if the matrix := l / a A possesses this property. For this to be true, it is necessary and sufficient that the spectral radius of the correspondingly transformed matrix b := Z 2 ( A - Z)-' is less than one. In the case that p(B) 2 1, there are eigenvalues of A either on the imaginary axis or in the right half-plane.
A
A
+-
If p(b)< 1 then, analogously to the second constructive method, a p 2 1 as large as possible is determined such that p(B) < 1/p. Then all eigenvalues of B are confined to the intersection of the following sets: 0
the union of the left half-plane and the imaginary axis and
0
the circle in the &plane with the center a t the origin and the radius
1/p.
With reference to the left part of Figure 5.9, this is the union of the left half-circle and the segment of the imaginary axis that is contained in this circle.
387
Verification of Asymptotic Stability
The circle in the &plane with the center at the origin and the radius 1 / p is now transformed into the .%plane. Just as in the case of the second constructive method, 1 1 the image is a circle with the center at A? := -(7 -) E R, which intersects the 2 7 1 real axis a t 7 := + and -. 1 - 0 Y 6 + 1 The Mijbius-transformation i: := -maps the union of the left half-plane and w-1 the imaginary axis onto the unit circle and its boundary. The intersection of these 6 + 1 two images in the %plane is lens-shaped. Under the transformation 5 := 6-1' the points &i/P are mapped into the points of intersection of the bounding circles,
+
f -(i + PI2 1+P2'
Figure 5.9: Transformation of the half-circle onto a lens-shaped domain
The inverse transformation from the %plane into the z-plane corresponds to a dilatation by a factor a. Consequently, the eigenvalues of A E WX"are confined to the intersection of the following sets: rn
a circle with the center at the origin and the radius a and
circle with the center at M := aM which intersects the real axis at a7 and 47.
rn a
Therefore, a/7 is a lower bound of the degree of stability. Here too, the condition p ( P @ < 1 can be verified by means of the CordesAlgorithm and a close to optimal p can be determined by means of a bisection met hod.
388
B. Gross
5.6
Fourth Constructive Method for a Lower Bound of the Degree of Stability
A matrix A E R"'" is asymptotically stable if and only if all eigenvalues of A are
confined to the left half-plane. Then there exists a minimal distance of the eigenvalues from the imaginary axis. By definition in Subsection 5.1, this is the degree of stability u of A. Since u is unknown, now a lower bound will be determined. Consequently, a 6 E R+ is to be determined which is as large as possible but still possesses the property of a lower bound of u. As applied to the eigenvalues of A, a Miibius-transformation i := z 6 yields a shift to the right of the spectrum of A. For A, the corresponding shift is given by
+
A := A + 61.
+
A
If 6 < u, the eigenvalues i := X 6 of are confined to the left half-plane. Otherwise, is not asymptotically stable and the degree of stability of A is not bounded by 6. The asymptotic stability of can be verified by means of the constructive method as represented in Subsection 3.4. For this purpose, the left half-plane is mapped into the interior of the unit circle, i+l making use of the Miibius-transformation w = ?. Subsequently, it is tested z-1 i+l whether or not there holds 1ji1 < 1 for the transformed eigenvalues fi := A-1' The following theorem establishes a relationship between the eigenvalues of A and
A
a
B:
Theorem 5.4 For a matriz A E R"'", it is assumed that p(b)< 1 f o r the spectral radius of the transformed matriz B := Z 2 ( A (6 - l)Z)-l. Then, 1. 6 > 0 is a lower bound for the distance of the eigenvalues of A from the imaginary azis and
+
+
2. the degree of stability of A ezceeds 6 . Proof: 1. If := A + 6 Z , then B := Z + 2 ( a - Z)-I. Provided p(b) is ~. < 1, then asymptotically stable because of Subsection 3.4. Consequently, all eigenvalues of A possess negative real parts. Additionally, 6 > 0 is a lower bound of the distance of the eigenvalues of A from the imaginary axis.
A
A
,
2. The degree of stability has been defined by means of -u := max Rex;. i=l(l)n
Therefore, 6 is a lower bound of u since all eigenvalues of A are located to the left of a straight line parallel to the imaginary axis intersecting the real axis at -6.
0
Here too, p ( B ) < 1 can be verified by means of the Cordes-Algorithm. In many cases, a bisection method will yield an acceptable lower bound for the degree of stability. Just as in the case of the first constructive method, there is no further information on the location of the eigenvalues in the left half-plane.
Verification of Asymptotic Stability
5.7
389
Additional Applications of the Four Constructive Methods
The four constructive methods as presented in Subsections 5.3-5.6 have been developed in view of the degree of stability. As a supplement, the second and the third method yield sets in the left half-plane containing all eigenvalues. Additionally and with only slight modifications, in particular the fourth method admits further applications which may be of interest. This method will now be used for the determination of an upper bound of the maximal real part of the eigenvalues of an arbitrary matrix A E IR"'". For this purpose, the shift 6 is replaced by -6. Correspondingly, the spectrum of the matrix A E IR"'" is moved such that all eigenvalues of A - 61 are confined to the left halfplane. Then all eigenvalues of A possess real parts less than 6, as is shown in Figure 5.10.
. . . . . . . .
. . . . .. .. . ......8 .
. .. . .. . .. . .. . .. . .. ... . . . .. .. . . . . . . . . X . ' .. . . . . .. . .. . .. . .. . .. . .. . .. . . . . . . . .. .. .. .. . . . . . . . . . .. .. .. . . . . . . . . .
. . . . . . .
. . . . . . . . . I
. . . . . . . . . . . . . . .. .. . . . . . .F .
.. .. .. .. . . . . .
. . . . . . . ............... . . . . . . . . .. .. .. .. .. .. .. .
__c
6
. .. . .. . .. . .. . .. . .. . .. . . . .. . . .x.:. . . . . . . . . . . . . .. . .. .
. .. . .. . .. . .. . .. . . . . .
. . .
Figure 5.10: Estimates of the maximal real part of the eigenvalues
Additionally and with only slight modifications, the fourth method admits the determination of a lower bound of the real parts of all eigenvalues of A. For this purpose, the matrix A E IR"'" is replaced by the matrix -A. Correspondingly, 0
0
the eigenvalues Xi of A are replaced by the eigenvalues -Xi of -A and the eigenvalue of A with the minimal real part now becomes the one with a maximal real part of -A, which is bounded as has been outlined before.
In this way, a strip can be determined containing all eigenvalues of A and being bounded by straight lines parallel to the imaginary axis. In many cases, this strip represents an acceptable confinement for the eigenvalues of A. Figure 5.11 displays parameters 6' and 6 to be determined by means of bisection methods.
B. Gross
390 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . :x: . . . . . . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .. .. .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .1. . . . . . . . . .".. 6' . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A
8
; .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. . . . . . . . . . . . . . .. .. .. .. . . . . . . . . . . . . . .X'.'....
For the construction of a strip of this kind, the other three methods also may be used. This only requires a shift of the spectra of A or -A, respectively, such that all eigenvalues are confined to the left half-plane. The corresponding Mobiustransformations are A - ( I or -A - (Z, respectively. Consequently, a strip with the properties outlined before may be determined by means of each one of the four methods. Since each one yields a set containing all eigenvalues of A, the intersection of these four sets represents a final enclosure with a correspondingly smaller overestimate. These four sets are (i) strips parallel to the imaginary axis, as following from the first and the fourth method, (ii) a circle due to the second method, and (iii) a lens-shaped domain as due to the third method. The sets addressed in (i)-(iii) yield upper and lower bounds for the real parts of all eigenvalues; the ones addressed in (ii) and (iii) yield corresponding bounds for the imaginary parts.
5.8
Examples for Bounds of the Eigenvalues
In this subsection, examples for applications of the four constructive methods are presented for the case of point matrices. Since these examples serve as test problems, the eigenvalues of the investigated matrices were initially chosen.
The examples were constructed as follows: 0
a Real Canonical form A E WXnof a matrix was adopted by means of its chosen real or conjugate complex eigenvalues;
391
Verification of Asymptotic Stability 0
0
these eigenvalues are simple or multiple ones; they are confined to the left half-plane; furthermore, the numbers chosen for the real or imaginary parts of these eigenvalues are integers or they possess only one additional non-zero digit following the decimal point; by use of an invertible elementary transformation with a matrix T, a nonCanonical matrix A was generated which is similar to A, A := T - l A T ;
the elementary transformation by use of T induces a suitable linear combination of two rows of A; the transformation by means of T-' then induces a corresponding linear combination of two columns of TA. Consequently, the elements of A are represented by means of numbers with only a few non-zero digits in a normalized floating point representation. A sequence of elementary transformations, T I ,T2,. . ., was carried out. This sequence was truncated when the majority of all elements of the finally generated matrix, A , were non-zero. The sequence of these transformations can then be represented by means of one matrix, T . 0
In the first example, all chosen eigenvalues are real: X1/2 = -1 possess the multiplicity two; A3 = -2 and X4 = -10 are simple. A sequence of elementary transformations generated the following matrix:
-5.5 4.5 -4.5 -4.5 4 -6 4 -9 9 -10 -9 8.5 -9.5 8.5 7.5 The following table lists the upper and the lower bounds which were determined for the real parts Re Xi and the imaginary parts Im Xi of the set of eigenvalues, making use of the four constructive methods.
Method
1 9 I 9
3
4
Bounds for the eigenvalues
-10.00003100932 -10.00103384201 -4.95052208968 1-29 -28.931 12558652 - 10.00003020952
I
< Re X < -0.99999729704 < ReX < -0.09998966264 < ImX < 4.95052208968 < ReX < -0.99993316106 < Im X < 28.931 12558652 < Re X < -0.9999969701 9
The first, third, and fourth method yield almost sharp lower bounds for the degree of stability of A. The circular enclosure of all eigenvalues as due to the second method is determined by the eigenvalue X4 = -10. For this eigenvalue, an almost sharp lower bound was calculated by use of the first, the second, and the fourth met hod. The construction of the matrix A in the second example started from the following choices of its eigenvalues: = -1, X314 = -2, = -10, = -0.1 IOi,
+
B. Gross
392
and Xg/lo = -0.1 - lOi, all with multiplicity two. Any numerical approximation of the spectrum of a matrix with these eigenvalues is ill-conditioned. = -0.1 1Oi and = -0.1 - 1Oi is The Real Canonical matrix concerning represented by -0.1 -10 0 10 -0.1 0 0 -0.1 -10 .
+
[:
1
10
A sequence of elementary transformations of -1.9 30 0.9 -1 -0.9 -1 1.71 2 0 1.9 0 10 0 1 0 -0.1 0.5 0.45 0 10
58.5 1.5 0.5 -7.65 2.85 15 1.5 -0.15 -5.25 15
9 1 -1 -8.1 0 0 0 0 0.5 0
30 0 0 1 -0.1 10 1 -0.1 0 10
:I
-0.1
2 yielded the following matrix:
4.7 -1 1 -0.91 -10 -0.1 0 1 -0.5 1.9
0 0 0 0 0 0 0 0 0 0 0 0 -1.1 -10 0.9 10.1 0 0 0 0
39 1 7 -5.1 1.9 10 1 -0.1 -13.5 10
-0.2 -1.8 1.8 -3.42 0 0 0
0 -0.9 -2
The following Table lists the upper and the lower bounds which were determined for the real and the imaginary parts by the four constructive methods: Method 1
2 3 4
Bounds for the eigenvalues -10.0000011231 -1010.1001809740 -505.0495955362 -107.9 -107.8998178053 -10.0000013608
< < < < <
O, the tangent directions of y* and of some of the solutions y* may begin to be strongly different. This then may be followed by a considerable growth of the distance of y* and some of the solutions y* just addressed. In a situation such as this one for an interval of time beginning at te, may remain close to one of these solutions y* rather than close to y*. In [5], this performance of a difference approximation has been denoted by the expression "diversion" of 8, provided the distance of y* and becomes large for some time t>t,. The cau~eof this phenomenon is at any time t€(O,te); however, the effect of moving away from y* takes place for t >te. Remark: For f = f(y) and locally at y, the Lyapunov exponents may be used as a measure of the rate of expansion of the local tangent directions of neighboring true ODE-solutions, e.g., [16, p.2831. A pole of the ODEs (3.1) is considered together with the set of all solutions of (3.1) which do not enter this pole. In a small neighborhood of the pole, this set possesses a complicated topographical structure since, generally, 0 there are large values of the local curvature of orbits coming close to the pole and 0 the local tangent directions of neighboring solutions are strongly different. This is one of the previously outlined situations favoring a diversion of a difference approximation see Subsection 9.2 for an example. The presence and the location of a pole of the ODEs (3.1) can be determind in an a priori fashion by means of an inspection of (3.1) or the calculation of a root. This kind of determination is also possible concerning the real stationary points of (3.1), which satisfy f(y) = 0. Since f in (3.1) is sufficiently smooth, locally attractive stationary points are the only ones which, as t + m, may belong to more than one true solution of (3.1). A
A
8,
A
y
8;
y
y
432
Discretizations: Theory of Failures
In addition to the influence of a pole or a stationary point, the topographical structures of a phase space are also affected by the existence and (if present) the location of 0 attractors of any kind in the set of solutions or 0 stable invariant manifolds, Ms, and unstable invariant manifolds, MU, both belonging to a stationary point or a periodic solution (allowing a local linearization of the ODES), see e. g. [16] and [7, p.5941. Subsequently, the shorter expressions stable (or unstable) manifold will be used. For n = 2 in (3.1), there are only two kinds of attractors, 0 stationary points, with dimension zero, and 0 limit cycles, with dimension one. For n'2, there may be additionally 0 stationary points or periodic solutions, which are locally attractive tangent to MS and locally repellent tangent to MU,and 0 strange attractors (e.g., [16, p.2561) which ocover a bounded subset in the phase space of solutions, .consist of a non-denumerable set of true ODE-solutions which remain in this set for all times after entering it, such that .the dimension of this attractor is not an integer number. Remarks: An attractor is said to be strange if it contains a transversal homoclinic orbit [16, p.2561. This is a property of the horseshoe map [16, p.3191 referred to at the end of Section 2; see [17] and [46] for orbits of this kind. An orbit going from one stationary point (or one periodic solution) to another is said to be heteroclinic. If the orbit returns to the same stationary point (or periodic solution), it is said to be homoclinic. In the phase plane of the pendulum equation p" + &sin9 = 0, there are heteroclinic orbits starting on the manifold Mu of one saddle point and going over to the manifold Ms of another saddle point [20]; see Example 5.22 for this ODE. 3J A strange attractor is topographically characterized by the predominance of locally divergent tangent directions of the orbits. Inside a strange attractor, manifolds MSand MU(if there are any) possess comp&ated topographical shapes. Manifolds Ms or MU,if they exist, either * 0 start and terminate at stationary points, yi, or at periodic solutions, Yper, or 0 MUdoes not terminate as t increases to infinity and MS has this property as t decreases to minus infinity.
a
E. Adams
433
Since Ms and M U consist of true ODE-solutions, for finite t they cannot merge with another true solution. If Msor M Uare (n-1)-dimensional hypersurfaces in the phase space IR", they separate adjacent sets of true ODE-solutions. For any true ODE-solution y* starting outside of MU,now a difference approximation is considered in the case that M U is an (n-l)-dimensional penetrate MU.Generally, hypersurface. Because of numerical errors, then this is practically irrelevant since all true ODE-solutions sufficiently close to M U will subsequently remain in a small neighborhood of MU,at least for a certain interval of time. Now, any true ODE-solution starting outside of MSis considered in the case that Msis an (n-1)-dimensional hypersurface. A difference approximation then penetrate MS.Subsequently, will approach the attractor belonging to Ms, however, on the "wrong side" of MS, as compared with y*; then, close to the attractor, will be deflected to follow MU,however, (iii) in an incorrect direction on M Uwith respect to Ms and y*. Here, the penetration as the "cause" may take place considerably earlier than the "effect",the deflection of to follow M Uin a "wrong direction". Concerning sets of true ODE-solutions, there is generally an increase of the complexities as the dimension n of the system (3.1) of ODES increases, perhaps in the context of a mathematical simulation. Then, there is a correspondingly increased unreliability of computed difference approximations. Therefore, when a large number of degrees of freedom of a real world problem is to be simulated by means of a nonlinear mathematical model (3,1), this model is frequently of no practical significance. l Generally the complexities of the sets of solutions increase strongly as Remarks: J nonlinear ODES are replaced by nonlinear PDEs. 2J Subsection 9.4 is devoted to an introductory discussion of diverting difference approximations in the case of evolution-type PDEs. Up to this point in the present section, the discussion was concentrated on topographical structures of sets of true ODE-solutions and their difference approximations. For the remainder of this section, individual true ODE-solutions will be considered in view of their complexities. In the subsequent discussion of this problem, only a few cases are chosen: @) when the period T of a true T-periodic ODE-solution y*per is large, it is correspondingly difficult by means of traditional numerical methods to distinguish
y
y
y
434
Discretizations: Theoy of Failures
between this true solution and a neighboring true aperiodic solution; 0 now y’ = f(y) in (3.1) is replaced by y’ = g(t,y) where the dependency of g(t,y) on t is governed by input functions hl and h2 such that hl is TI-periodic and h2 is T2-periodic; it is assumed that T1/T2 = p/q with numbers p,q E IN; then y’ = g(t,y) may possess a T-periodic solution if T: = qT1 = pT2; very small changes of TI or T2 may render p and q large such that T becomes correspondingly large; & J provided (3.1) is nonlinear and possesses a T-periodic forcing term, the true solutions of (3.1) may consist of nT-periodic responses [26,p.179]which, for n -&N, are said to be subharmonics; @ J a linear ODE may be on the boundary of stability and, therefore, highly sensitive with respect to perturbations; an example is the dependency of the solutions of y“ + ky’ + y = 0 on the choice of k = 0 or k = + c or k = -c with ccIR+ arbitrarily small. 4. QUALITATIVE THEORY OF DIFFERENCE METHODS AND
SHADOWING THEOREMS Without loss of generality, the discussions in the present section are mainly confined to SCalarIVPS y’ = f(t,y) for tE[t ,t 1; y:IR + IR; fRxD + R; D cIR, y(t0) = yo~D, (4.1) o m f is sufficiently smooth, and applications of (II) explicit one-step methods ([47l, [48]), y j + l = F(yj): = yj + h @(tj,yj,h,f); yj: = y(tj), j = O(1)n-1, (4.2) h: = (tm-t )/n, nEN; yo = yo.
a
0
A family of true difference solutions of (4.2) will be denoted by y(h): = {yj I j = O(l)n} for h = h(n) = (tm- to)/n with n B . (4.3)
The symbol g(h) will be used in the case that there are numerical errors in the evaluation of F(gj) in (4.2). Provided the true solution y* of (4.1) exists, the local discretization error of (4.2) is defined by (4.4) r(t,y*,h): = (y*(t+h) - y*(t))/h - $(t,y*(t),h,f) for h>O. The asymptotic theory of difference methods rests on the following conditions:
435
E. Adams
f and (I are sufficiently smooth; in particular, there exists a Lipschitz-constant Mo E IR' of (I with respect to its dependency on yj; (4.2) is consistent with order p E IN; then there is an No E IR' such that I r(t,y*(tj),h) I 5 NohP for any fixed tj€[tO,ta)]and all
yo E D; the solution y* of (4.1) to be approximated exists for tc[tO,tm]
such that y*(t) E D. The global discretization error is defined as follows: (4.6) e(t,h): = y(t) - y*(t) for any fixed t ~ [ t o , t ~ ] . Provided (i)-(iii) are satisfied and h is sufficiently small, then there holds [48] (4.7) le(t,h)l 5 hP(No/Mo)(eMo(t-to) - 1) for t~(t~,t,J. This implies the (pointwise) convergence to zero of the h-dependent sequence (4.6) as h + 0. Generally, the estimate (4.7) is not practically useful for any fixed and finite h. Property (4.5iii) makes use of y*. Therefore, (4.7) is meaningless in the case that there is 0 a family of spurious difference solutions, ysp(h), (which does not approach a true solution y* for h + 0, see Sections 2 and 8) or 0 for a fixed h>O, a diverting difference approximation, y(h), (which is close to different true solutions for different intervals of time, see Section 3). Remarks: J . l In the case that y' = f(y) represents a system of ODES, discretizations such as (4.2) are applied individually to the components y'i = fi(y) of the system. The estimate (4.7) then is essentially still valid. J . 2 In the case that (4.2) is replaced by a multi-step discretization, an estimate such as (4.7) requires additionally the property of stability for the discretization. This is automatically satisfied for one-step discretizations. J . 3 The property of stability refers to a linearized stability analysis of the discretization with evaluation "at the point" of the true solution y* of the ODE. This linear analysis supplies a partial information concerning the true nonlinear stability behavior of the discretization. Only a global stability analysis provides a complete knowledge of the bifurcation points of the discretization, which depend on h. This bifurcation analysis is essential in the contemporary analysis of spurious
436
Discretizations: The0 y of Failures
difference solutions 7sp(h), see Section 8. Frequently Shadowing Theorems are believed to assert the practical reliability of discretizations of ODES. A theorem of this kind states for all j=O(l)N that a computed difference approximation y(h) is within an c-neighborhood of an unknown true difference solution "yh), provided that the numerical errors of Y(h) are smaller than a number 6. The initial values of and y are not necessarily identical. There holds c = c(6,N) where N represents the (finite) number of time steps, each of length h. If there is an c = c(6,j) for j = 1(1)N, c(6,j) can be evaluated quantitatively provided that 6 is confined to the level of the discretization, i.e., 6 then accounts for local rounding errors or errors in the evaluation of the function q5 in (4.2). The local discretization errors, however, are not quantitatively accessible on this level. The Shadowing Theorem to be presented will subsequently be applied to the (scalar) Logistic ODE y' = f(y): = y(1 - y) for ED: = [0,1] with yo = y(to)ED for to = 0. (4.8) The true solutions y* of (4.8) can be represented explicitly [23]: Yo (4.9) Y*O) = YO + (1 The stationary points y = 1 (or y = 0) of (4.8) are stable (or unstable). In the applied literature, discretizations of (4.8) are usually the starting point for discussions of Dynamical Chaos. In fact, this is Computational Chaos since all difference solutions y(h) are spurious unless they resembly y* at least qualitatively. For a scalar ODE (4.1) with fGC2(D), an explicit one-step method (4.2) yields 0 a true orbit (difference solution) v(h) satisfying (4.10) % j + l = F(yj): = y j + h)I (tj,ij,h,f) for j = O(1)N - 1, F: [0, lI-+[O,l],FEC~[O, 11 and, additionally, 0 a " pseudo-orbit" (difference approximation) Y(h) satisfying l$j+1 - F(yj)l < bF for j = O(1)N - 1 with a 6,dR,' (4.11)
F
where SF accounts for local numerical errors in the computational determination of F(h). For (4.8), an application of the explicit one-step Euler discretization yields the following special case of (4.10): (4.12) "y+l = F(yj): = yj + hyj(1 - "y) + F'(yj): = 1 + h - 2hyj. Concerning (4.10), constants a,r, and M are defined as follows [lo], making
437
E. Adams
use of DF(y): = F'(y):
and (4.15)
M:
I
1
= sup{ ID2(F(y) y~[O,l]}with D2F(y) = d2F(y)/dy2.
The following version of the Shadowing Theorem is stated and proved by Chow & Palmer [lo]. Theorem 4.16: For F as specified in (4.10), it is assumed that N and bF are sufficiently small such that (4.17) 2 M U T 5 1. Then there exists an orbit y(h) such that (4.18) 1 + ; ( I + ~ ~ I ) - ' 5T sup
j=O( l ) N
l y j -yjl
5 2 ( 1 + f l ~ ~ i ~ ~=) :by. -'T
Concerning (4.12), y j for j = O(1)N is now equated to the (locally attractive) stationary point y = 1 of the ODE in (4.8). Then for h # 1 o 5 (N+l)/(l - h)"', T 5 (N+l)bF/(l - h)"+', and M = 1- 2hl + (4.17a)
~ M U 5T4(N + 1)2hSF/(1 - h)2(N+1) 1. This condition is satisfied if bF = bF(h,N) is sufficiently small. An estimate of this kind is not meaningful in a neighborhood of the (locally repellent) stationary point y = Oof (4.8). Remark. In an example in [lo], I yj-yj I hsp: = 2. For h > h,,, the existence of spurious difference solution Ysp(h) has been shown by May [35], see Section 8.2. In particular, then there are oscillating and even 2kh-periodic difference solutions yBP(h)for all kd. The true solutions y* of (4.8) are non-oscillating. In analogy to (4.17a), the condition (4.17) can be satisfied for h > h,, provided bF = bF(h,N) is sufficiently small. The Shadowing Theorem 4.16 then asserts the existence of a true difference solution "yh) that is shadowed by a computed difference approximation y(h). Since y(h) and y(h) are spurious, they are quantitatively and qualitatively different from all true solutions y* with
438
Discretizations: Theory of Failures
representation in (4.9). This relates to the observation made before that bF does not account for the local discretization errors. Generally, therefore, Theorem 4.16 is irrelevant with respect to the true solutions y* of ODEs underlying a discretization (4.10). For systems y' = f(y) with y:R4RR2,fD4RR2, DcR2, fcC2(D), (4.19) Chow & Palmer [ l l ] have proved a Shadowing Theorem concerning explicit one-step discretizations. Additionally in [ll], they have presented a quantitative application with respect to the HCnon (difference) equations whose difference solutions $(h) possess chaotic properties, e. g. [7], [16]. The results by Chow & Palmer [ll] confirm a conjecture in the classical paper on shadowing by Hammel et al. ([19], see also [MI): "Conjecture: For a typical dissipative map F:R2+ R2 with a positive Lyapunov for a exponent and a small noise amplitude S,>O, we expect to find a by 5
4
(pseudo-) orbit of length N zl/&." Therefore, the shadowing property may be true for a surprisingly large number of steps in the recursion yielding the sequence {yj}. Remark: In literature (e.g.[18], [19]), shadowing has been considered for interval mappings Yktl = F(Yk) with Yktl, YkcRq intervals with q = 1 or 2. In this context, the validity of shadowing has been related to the property: (4.20)
N
Yo: = n F-k(Yk) # 4, k=O
with 4 denoting the empty set [19,p.467]. For the case of q = 1 and a rigorous application of interval methods, see [41]. Now, the actual computational determination of a pseudo-orbit inside a strange attractor is considered. Numerical results can be described by the reference to Sparrow's monograph [43] which is quoted in Section 1. For a further characterization of this set, a fixed choice of yo inside a strange attractor is considered. According to [52], the employments of a Cray 1 and a Cyber 205 have yielded grossly different results for j < N with N as small as 50. Remarks: J . l Concerning the influence of the local discretization errors in the multi-dimensional case, the preceding discussions stress the fact that the validity of the shadowing property for a pair $(h) and y(h) is irrelevant with respect to the * true solutions y of the underlying ODEs.
E. Adams
439
3 The shadowing property is valid for a finite number N of steps tj+l-tj=h. Therefore, it is not possible to investigate a sequence :(h) of difference solutions as h =(too-t o)/N+O. 5. ON THE COMPUTATIONAL ASSESSMENT OF SOLUTIONS OF NONLWAR ODES WITH BOUNDARY CONDITIONS (OF PERIODICITY) 5.1 SURVEY OF THE PROBLEM In view of the title of this paper, it is observed that (A) the search for periodic solutions Yier of systems (3.1) leads to boundary value problems (BVPs) with (two-point) boundary conditions of periodicity: (5.1) y' = f(y) for tdR; fD+R", DcR" is open and convex, f is a sufficiently smooth composition of rational and standard functions;
Y(T) - Y(0) = 0; the period T is either prescribed or to be determined; (B) the subsequent discussions exhibit the serious potential unreliability of approximations of true solutions yier, which are determined by means of discretizations; (c) this therefore stresses the theoretical and practical importance of applications of Enclosure Methods. Some of the examples to be presented concerning (B) and (C) employ two-point boundary conditions other than conditions of periodicity: (5.2) y' = f(y) for t~[a,b]; fD+IRn, DcR" is open and convex; g(y(a),y(b)) = 0, a,bER, b#a; f and g are sufficiently smooth compositions of rational and standard functions. A true solution of the BVP (5.2) is denoted by y*. Without a loss of generality, the independent variable t can be chosen such that a = 0. Generally in the case of BVPs, the existence of a (classical) solution y* must be verified. Unless a (totally error-controlled) Enclosure Method is employed, a single local numerical error may cause the computed approximation either a (J deceptively to satisfy all equations of the discretization even though a true solution y* does not exist, or &J to fail to satisfy all these equations since the automatic "guidance and control" is absent which is built into Enclosure Methods. Examples for case (a) or case (b) are presented in Sections 5.5, 7, and 9. Spurious difference solutions ysp provide a different class of examples concerning case (a),
y
y
440
Discretizations: Theory of Failures
see Subsection 8.3. Because of (a) and (b), a reliable computational approximation of periodic solutions Yier is difficult; this is enhanced by the fact that their location in phase space is unknown; frequently, there is more than one solution Yier; depending on their topographical (or dynamical) properties, perhaps only one of these solutions is of interest; a global search of the phase space may be necessary; in the present paper, this problem is reduced to a search for only one individual true solution yier; provided T is relatively large, it is generally difficult to distinguish between a T-periodic true solution Yier and an aperiodic true solution y*; see Section 3. For some classes of BVPs, existence theorems for solutions y* are proved in literature, e.g., [9]. Generally, then 0 additional qualitative and/or quantitative mathematical work is to be executed in order to verify that all conditions of a theorem of this kind are satisfied; 0 only in exceptional cases, it can be verified by inspection whether this is so. Numerous theorems in literature assert qualitatively the existence of a solution y* = y*(c) of a perturbed BVP provided 1c1 is sufficiently small and the unperturbed BVP is known to possess a solution y* = y*(O). Theorems of this kind are, e.g., 0 concerned with nonlinear autonomous or T-periodic ODEs, [9,p.137,p.1391 or 0 the KAM-theory ([7], [16],[20]) concerning periodic solutions of nonlinear perturbed autonomous ODEs. As compared with applications of existence theorems, an employment of Enclosure Methods is advantageous because of their merged properties 0 of a verification of the existence and 0 a totally reliable quantitative assessment. The theoretical and practical importance of totally error-controlled methods is exhibited in Section 7 through discussions of systems of linear ODEs y' = A(t)y with a T-periodic matrix function A and boundary conditions of Tperiodicity, y(T) = y(0). Then there arises the matrix I-$*(T) where I is the identity matrix, $* represents the fundamental matrix satisfying $' = A(t)$ and $(O) = I, and $*(T) is the monodromy matrix. The matrix I-$*(T) must be [9]
441
E. Adams
non-invertible if the homogeneous ODEs are to have T-periodic solutions or invertible provided the ODEs possess an additional forcing term and a unique T-periodic solution is to be determined. Unless @* = @*(t)for t€[O,T] is calculated with total error-control, a reliable distinction between the presence or the absence of the property of invertibility for I - @*(T)is not possible. Additional to existence, the following qualitative questions are relevant for a periodic solution Yier: whether Yier is isolated, whether Yier is asymptotically stable (with respect to perturbations of the @) initial data) and, then, locally attractive, in the case that Yier is T-periodic and unstable, whether Yier possesses a stable manifold Ms and an unstable manifold MU, see Section 3 and [16] or 0
0
4
6
[301.
A
A
Properties (b) and (c) 0 may be present in the case of dissipative ODEs, 0 theyA are absent in the case of conservative ODEs. Property (c) is present in the case of the periodic solutions of the Lorenz equations which are discussed in Section 5.4. Concerning the success of applications of numerical methods, 0 properties (a) and (b) are desirable and A 0 property (c) is undesirable, as is shown for the Lorenz equations in Subsections 5.4 and 9.3. Properties (b) and (c) rest on the linear variational system of the BVP wiih respe? to the solution Yier under discussion. Consequently, a verification of (b) and (c) recquires the reliable knowledge of a sufficiently close approximation of Yiep By means of Enclosure Methods, property (b) can be verified similtaneously with the enclosure of Yier. Remark: A more general "structural stability" of a solution y* of a BVP may be defined with respect to selected perturbations of the ODEs; they may represent, A
h
A
A
A
e. g., 0 0
0
a simulation of procedural or rounding errors or, the influence of celestial bodies not taken into account in the computational determination of an approximation of a (hypothetical) periodic orbit of a planet, or the rotational attachment of a gear drive discussed in Section 7, etc.
442
Discretizations: The0y of Failures
5.2 ON
TRADITIONAL COMPUTATIONAL METHODS FOR THE! OF PERIODIC SOLUTIONS DE"ATI0N In physics and engineering, there are numerous methods for the determination of approximations of periodic solutions of nonlinear ODEs; particularly, 0 averaging methods (see [9] and [26]) and 0 perturbation methods (see [9] and [26]). Generally, these methods yield continuously differentiable functions such that their (aJ residual is quantitatively accessible, which however is not a useful gauge with respect to the oscillatory solutions under discussion; there are no meaningful quantitative error estimates comparable, e.g., to the b (J ones for the local discretization error. Concerning periodic solutions Yier, discretizations of ODEs are frequently employed, supplemented by a suitable approximation y(0) of the "initial'' vector yier(0) of the desired solution. A determination of yier(0) may be carried out either @ by use of a shooting method (e.g.[48]) in conjunction with, e.g., a Newton iteration of y(0) in order to satisfy the boundary condition of periodicity or, starting with a suitably estimated vector y(O), a marching method is used, = y(t,y(O)) will where it is hoped that the computed approximation gradually approach a periodic state, see Section 7.6. For the execution of either (a) or (P), y(0) must be sufficiently close to the true vector Yier(0) of the desired isolated periodic solution Yier. Consequently, both (a) and (P) might have to be preceded by a search in phase space for almost closed orbits. This search is particularly difficult and costly in the case that either the period or the dimension of this space are large. Concerning (P), y(0) must be inside the domain (basin) of attraction of Yier, whose existence and size are not known. A domain of attraction does not ex'ist in the case of conservative ODEs.
y
Provided the ODEs are dissipative, a desired marching approximation of Yier is generally slow in the case of meaningful mathematical simulations of engineering problems; in fact, the dissipation of energy then is relatively small. Consequently, in the course of the correspondingly slow approach of Yier, there may be a considerable accumulation of the influences of numerical errors.
443
E. Adams
Therefore, the traditional numerical methods outlined in the present subsection suffer from significant reliability problems. This suggests applications of Enclosure Methods, at least on a spot check basis. 5.3 ON "HE VWIFIED COMPUTATIONAL ASSESSMENT OF SOLUTIONS OF B W s OR OF PERIODIC SOLUTIONS For the purpose of a verifying enclosure of a solution y* of the BVPs (5.1) or (5.2), a simple shooting method will be used. Consequently, an auxiliary IVP is now assigned to either one of these BVPs: (5-3) y' = f(y) for tdR; tM", DcR" is open and convex; f is sufficiently smooth; y(0) = XED. The solutions of (5.3) are represented by means of an operator @{: y*(t,s) = : @fs provided y*(~,s)eDfor r~[O,t]. (5.4) The topic of this subsection will now be discussed for the case of the BVP (5.1). A vector (s,T)ER"+~ is then to be determined such that I
*
y;er(O,s) 5 Yper(T,s). Consequently, s is located on a T-periodic orbit yier. Suitable discretizations of (5.1) and (5.3) are considered which generate approximations 9 of the true solutions Y;Ser = Yrfer(t) and y* = y*(t,s) of (5.1) and (5.3), respectively; see Subsection 5.4 for examples. Because of (a) and (b) in Subsection 5.1, a totally error-controlled treatment of (5.1) and (5.3) is desirable. This is not possible on the level of a discretization. Concerning BVPs, the following Enclosure Methods are available: @ the one by Lohner ([31] or [32]) for general two-point BVPs (5.2) or the special BVP (5.1) of this kind and the one by Kiihn [29] for BVPs (5.1) with the boundary condition of T-periodicit y. Remark: By means of an ad hoc Enclosure Method for the phase plane, the periodic solution of the Oregonator problem has been verified in [l]. Since (p) has not been published in the generally accessible literature, an outline of this method is presented in Subsection 5.7. Concerning the presently available hardware and software supporting the construction of enclosures, applications of (a) or (p> to nonlinear BVPs are confined to a sufficiently small total order n of the system of ODES. For examples which so far have been completed successfully, there holds n 5 4. (5.5)
a
S: =
444
Discretizations: The0y of Failures
Remark: For linear BVPs, examples with n 5 22 have been completed successfully, see Subsection 7.5. The practical confinement of n to values 5 4 is mainly due to the fact that (a) and (8employ the determination of enclosures for the following extended systems of ODEs: the given nonlinear system y' = f(y) with order n; the enclosure of the @ solution y* = y*(t,y(O)) of this system serves as the (interval-valued) input for n auxiliary systems of linear ODEs, each of order n. ( ' i i J Generally, the correspondingly high computational cost prohibits the employment should ) be used as of (a) or (p) as interval iteration methods; rather (a) or (/I verification methods for candidate enclosures (with very small widths) 0 that are nearly centered with respect to a highly accurate approximation of the desired true solutions, a whose computational determination requires the execution of a perhaps costly preceding search for in the phase space.
y
y
For a system of ODEs, y' = f(y), it is assumed that a periodic solution yier and its period T have been enclosed and verified by means of Kiihn's enclosure method (8. In view of corresponding applications of discretizations, the stability properties of yier are of interest. For an investigation of this problem, it is at first assumed that a T-periodic solution, y& = y&t) and its period T are (explicity) known. It is then possible to derive the linear variational system [9,p.10] of y' = * * f(y) with respect to y = yper: 7' = A(t)q for t€[O,T]with A(t): = f (y&(t)) (5.6) and f the Jacobi matrix of f. That fundamental matrix $:IR+L(IR") of (5.6) is considered which satisfies $(O) = I with I the identity matrix. Then, there holds: a the monodromy matrix $*(T) possesses an eigenvalue X = 1 [9,p.98], 0 therefore, (5.6) possesses a T-periodic solution vier = &r(t), [9,p.98]. According to [9,p.98-99], there holds: Theorem 5.7 The system (5.6) is considered. If n-1 of the eigenvalues of $*(T) satisfy I X j I < 1, then Yier possesses the property of asymptotic orbital stability. By means of the Floquet theory [9,p.58], it follows immediately that yier is unstable in the case that there is an eigenvalue x k of $(T) with the property 1 Xkl >1. Obviously, Yier and its period T are (explicitly) known only in trivial
E. Adams
445
cases. By means of Kiihn's Enclosure Method (p), (tight) enclosures can be determined, both for T and yier(t) for t€[O,T]. Then 0 there is an interval matrix [A(t)] in (5.6) for t~[o,T],making use of the enclosures of Yier(t) and TE[S T ] = :fr]and an interval matrix [$(t)] has to be determined for t€[O,T].
S ENcu)suRE AND VERIFICATION OF PERIODIC SOLUTIONS OF THE LORENZ EQUATIONS Kiihn [29] has enclosed and verified several periodic solutions of the Lorenz equations 5.4 m
-4Y2-
(5.8)
y' = f(y): =
-yly3
Yl
Yl)
+
ryl-ya
fort 2 0 ;
y: =
y2
,
.Y3, , Y l Y 2 - by3 where b,r, IJWare arbitrary but fixed. This system is a "simple" special case of (3.1). Starting from the PDEs of Fluid Mechanics, Lorenz [33] has derived (5.8) by means of a Fourier method, [8]. Concerning (5.8), the monograph [43] by Sparrow is devoted to discussions of true solutions y* and approximations y. The ODES (5.8) are dissipative [43]. Generally they are considered to be the paradigm in the theory of Dynamical Chaos. There are three real-valued stationary points [43] (9 for all b,r,aER+,the origin (O,O,O)T and for all b, r-l,oEIR', the points C1 and CZwith (5.9) the positions (& 1, the stationary point (O,O,O)T is locally a saddlepoint. Attached to this point, there are a two-dimensional stable manifold Ms and a one-dimensional unstable manifold MU,see [30], [43] and Section 3. For special choices of b,r, UER', 0 all solutions y* = y*(t,b,r,a,yo) can be represented in an essentially explicit fashion or @ ,( the set of solutions y* is known to possess simple ("non-chaotic") topographical properties. For arbitrary b = 2a E IR' and r E R+, W. F. Ames [5] has found the following equivalent representation of (5.8):
e:
446
Discretizations: The0y of Failures
For arbitrary b = 2a, r E R' and in the limit as t+m, all true solutions y* of (5.8) are confined to a fixed hypersurface in the phase space. Since y i and yl can be represented explicitly as functions of y;, the set of solutions y* of (5.8) then cannot possess "chaotic properties". In fact, this set then is topographically simple since (5.8) is autonomous and satisfies a Lipschitz-condition. In particular, this excludes the existence of a strange attractor, see Section 3. The system (5.8) is invariant under the transformation (y1,yz,ydT (-y1,-y2,ydT defined by +
(5.11)
-1 0 0 f(Sy) = Sf(y) with S: = 0 -1 0 0 0 1
+ Sk = I with k
= 2.
Provided there is a T-periodic solution yier of (5.8) with the point yo on its orbit, A*
then there is also a T-periodic solution Yper with the point Syo on its orbit. This solution is defined by (5.12) yper(t): = Syier(t) for tE[O,T]. A* Generally, the orbits of Yier and Yper are different. It is assumed that there are a point yo and a time T > 0 such that A*
(5.13)
$:Yo
= SYO.
For systems y' = f(y) possessing the properties (5.11) and (5.13), with a k-lEN, Kiihn's Theorem 5.55 (see Subsection 5.7) asserts the existence of a T-periodic solution Yier with T = kr, which is invariant under S. Provided a transformation S possessing these properties is known for a system y' = f(y), the boundary condition y(T) = y(0) may then be replaced by y ( ~ )= Sy(0); this is advantageous since any numerical method then is to be used only for t€[O,T/k]. For arbitrary b,r,a E R', Kiihn [29] has shown that all solutions y* = y*(t,y(O)) of (5.8) are contained in the following ellipsoid in the limit as t*: (5.14) V(yl,y~,y~):= ry? + uy3 + a(y3 - 2r)2 5 (Y with cr = a(b,r,a); e.g.,cy < 20071 for b = 8/3, u = 6, and r = 28; see also [30, p.261. Consequently, the strange attractor, the stationary points, and all periodic solutions of (5.8) (if there are any) are confined to V. Kiihn [29] has organized and applied
E. Adams
447
an algorithm for the execution of the Enclosure Method (as outlined in Subsection 5.7) for the purpose of the enclosure and verification of periodic solutions Yier and their periods; @) a C-compiler enabling him to run this algorithm on a HP-VECTRA computer, making use of codes available in the language PASCAL-SC; @ for the verified periodic solutions yie,, an approximation of the eigenvalues addressed in Theorem 5.7 and subsequently. Remark: The language PASCAL-XCS [27] was not yet available in the winter of 1989/1990 when Kiihn carried out the numerical work addressed in the list (a) -
a fJ
(c).
For b = 8/3, r = 28, and o = 6, Kiihn [29] has proved the following theorems, which can be stated making use of the notation a t for an enclosure of a number aER, with b and c as the last mantissa digits of a floating point representation of bounds of a. *(1) Theorem 5.15: For (5.8), there exists an isolated T ( 1 )-periodic solution Yper whose orbit intersects the interval (6.83111896933, 3.2213122112, 27)TcR3. The period T(1) is in the interval 0.689918686827cR. The fundamental matrix @ * (T (l)) of the linear variational system possesses the eigenvalues X I ( ' ) = 1, X 2 ( 1 ) x "*(1) *(1) 1.05092, and X3(1) x 0.00120. There is a T(l)-periodic orbit Yper # Yper a~ defined by (5.12). Remark: Concerning X = 1, see the reference to [9,p.98] in conjunction with Theorem 5.7. *(2) Theorem 5.16: For (5.8), there exists an isolated T ( 2 )-periodic solution Yper whose orbit intersects the interval (0.51219159, 2.24983738, 27)T. The period T(2) is in the interval 1.75168488%.The fundamental matrix @*(T(z))of the linear variational system possesses the eigenvalues X l ( 2 ) = X 2 ( 2 ) ::4.69500, and X 3 ( 2 ) x *(2) 9.4324.10-9. The orbit Yper is invariant under the transformation S defined in (5.11). *(3) Theorem 5.17: For (5.8), there exists an isolated T (3 )-periodic solution Yper whose orbit intersects the interval (10.952283216& 21.71601368$, 20)T. The period T(3) is in the interval 2.59427762798. The fundamental matrix @*(T(3))of the linear variational system possesses the eigenvalues X l ( 3 ) = 1 , X 2 ' 3 ) :: 9.14122, and X 3 ( 3 ) x 1.4052.10-'3. There is a T(3I-periodic "*(3) *(3) orbit Yper # Yper as defined by (5.12). Figure 5.18 exhibits the projections into the yl-y2-plane of the orbits of
448
Discretizations: Theory of Failures *(1)
71 := Yper and *(3)
*(2)
72 := Yper
. Figure 5.19 shows this projection for the orbit of 73 :=
Yper
Remark: 1.) Since T(3) is relatively large, Kiihn [29] determined the verifying *(3) enclosures of Yper and T(3) by means of a multiple shooting method [48], which is *(2) not represented in Section 5.7. In the case of yper and T(2), Kiihn employed a simple shooting method for tc[O, T(2)/2]. J . 2 Interval methods have been used in [12] and [42] for the purpose of a verification of the existence of periodic solutions yier of the Lorenz equations (5.8). Since the treatment of this problem in the papers just referred to is incomplete, Kiihn’s verification [29] of the existence of solutions yier for (5.8) is the first complete proof.
Figure 5.18: [29] Projections of T(l)-periodic solution 71 := *(2) periodic solution 72 := Yper Of (5.8)
*(1)
Yper
and T ( 2 ) -
449
E. Adams
nI
y2! 1
/
g . . . . . . . . . . . . . . .
-10
; . . . . . . . . . . . . . . . . + y ,
0
10
*(81
Firmre 5.19: [29] Projection of T(3)-periodic solution 73 := Yper of (5.8)
An arbitrary starting point y ( i, (0) with i = l of 2 or 3 is chosen inside any
one of the intervals IiclR3 as given in Theorems 5.15, or 5.16, or 5.17. Since the orbits
* ( i) Yper
with i = l or 2 or 3 are unstable, the orbits y
generally move away from *(3)
.A
i) Yper ; *(
*(
i)
:= y*(t,y
(i)
(0)) will
see Example 5.20 for an approximation i) Yper
N
of
is to be corresponding growth of the widths of an enclosure of expected as t increases and the total interval IicW is admitted at time t =O. For an investigation of this issue, H. Spreuer used the enclosure algorithm with Rufeger’s step size control ([38] and [39]) and, as an additional feature, the estimate by Chang and Corliss of the local radius of convergence of the employed Taylor polynomials, see [39]. Spreuer obtained the following results: * ( 1) J . l After 66 revolutions past Yper , the widths of the three computed scalar enclosing intervals are smaller than Yper
*(
450
Discretizations: Theory of Failures *(3)
3 At
t=20.8, the width of the three scalar intervals enclosing Yper can be represented by the following vector: (0.102, 0.106, 0.297)T. At this time, the 9th *(3) revolution past Yper is almost complete. By means of this application of [39] and its supplement, H. Spreuer additionally found out that there is a misprint in Kiihn's original thesis [29] and in the representation of [29] in Theorem 5.17: T ( 3 ) E 2.62612378.
5.5 On Failures of Traditional Numend Methods as Applied to B W s The present subsection is devoted to the presentation of examples for the unreliability of discretizations of nonlinear BVPs in the case that a total error-control is absent. The following two kinds of failures will be observed: the computational assessment of an approximation which either (A) deceptively "determines" a non-existing true solution y* or (B) fails to stay close to a true solution y*, even though the difference initially is zero or arbitrarily small. Example 5.20 For the Lorenz equation (5.8), W.Espe [14] attempted the *(3) determination of an approximation of the T(3)-periodic solution Yper as asserted in Theorem 5.17. This solution is highly unstable because of the eigenvalue X 2 ' 3 ) . Espe 0 employed a Runge-Kutta-Fehlberg method of order eight ([47],[48]) with 0 its automatic control of the step size h and, for the starting vector y(0) of a marching method, he chose the midpoint of the interval vector as stated in Theorem 5.17. The distance between this choice of y(0) and y;:;)(O) is at most lo-''. Figures 5.21a - 5.21d present Espe's results: 0 in the execution of the first revolution for tE[O, T(3)], moves gradually *(3) away from Yper such that their distance can already be detected within graphical accuracy at t M T ( 3 ): Figure 5.21a; 0 in the execution of the second revolution for tE[T(3), 2T(3)] and as *(3) and Yper simultaneously come close to the stationary point (O,O,O)T, they start to move apart rapidly such that their distance reaches relatively large Figures values; in Sections 3 and 9, this is interpreted as a diversion of 5.21b and 5.21d;
y
y
y
y:
45 1
E. Adams 0
in the execution of the subsequent four revolutions, the aperiodic orbit shown in Figures 5 . 2 1 ~and 5.21e was generated; this orbit resembles those which, in the non-mathematical literature, are believed to indicate the presence of " chaos", e. g. [16, p.841 or [43, p.31, p.661.
-1 1
-3
Yl
13
Figure 5.21a: [14] Projection of the first revolution of the Runge-Kutta-Fehlberg starting in the midpoint of the interval as stated in Theorem approximation *(3) 5.17; is an approximation of the T (3) -periodic solution yper of (5.8) exhibited in Figure 5.19
5
y,
-7
1
9
17
Figure 5.21b: [14] Projection of the first and the second revolution of the approximation continuation of Figure 5.21a
y;
Dascretizataons: Theory of Failures
452
-9
9
0
Figure 5.21~: [14] Projection of the first to the fourth revolution of the approximation continuation of Figure 5.21b
y;
0
2
1
y~
3
4
5
Figure 5.21d [14] Component of the approximation as a function of te[O, 4.51, with T := T (3) M 2.59 demarcated; this relates to Figure 5.21b
E. A dams
453
0
4
8
12
16
Figure 5.21e: [14] Component f~of the approximation f as a function of t€[0,15], with T: = T(3) ::2.59 demarcated; this related to Figure 5.21~ This completes Example 5.20 and the demonstration of a failure of type (B). The subsequent Example 5.22 for a failure of type (A) 0 is primarily concerned with a problem in elastostatics, 0 however, it can be recast into an example in dynamics with periodic solutions [20]. Examde 5.22 1241: The BVP of the nonlinear buckling of a beam is considered such that one end (x=O) is clamped and the other (x=l) is free [51,p.70]: y" + Xsin y = 0 for x~[O,l];y(0) = 0, y'(1) = 0; A: = P/EI€IR+, (5.23) with P the axial (non-follower) load and EI the flexural rigidity of the beam. Concerning every (classical) solution of (529, a continuation for XER yields a 4-periodic solution. Under consideration of the boundary conditions, the integration of y" + Xsin y = 0 for x~[O,1]yields
m.
y'(0) = :s 5 (5.24) The true solutions y* = y*(x,X,s*) of the ODE in (5.23) can be represented by means of elliptic integrals [51,p.71]. Therefore and by use of the linearized ODE y" + Xy = 0 with y(0) = y'(1) = 0, it is known that 0 there is a sequence of bifurcation points :A, = (2n - 1)%?/4, for all ndl; 0 for X E ( X ~ , X ~ + ~ there ), are 2n+l real-valued solutions y* = y*(x,X,fs;) with si = 0 and s; > 0 for i = l(1)n allowing the boundary conditions to be satisfied.
454
Discretizations: The0y of Failures
The following auxiliary IVP is assigned to the BVP (5.23), making use of yl: = y
The shooting parameter s = s(X) is to be determined such that the boundary condition 1 (5.26) F(s): = yz(l,X,~)t 0 is satisfied by the solutions y* = y*(x,X,s) of (5.25). If it exists, a root of (5.26) will be denoted by s* = s*(X). An interval Newton method [6,p.75] will by used for an enclosure and a verification of a root s*. The correspondingly required derivative F' of F can be determined by means of enclosures of auxiliary functions qi = qi(X,X,S): = 6'yy(x,X,s)/& for i = 1 or 2. These functions satisfy the following linear auxiliary IVP, whose ODES can be derived by means of a differentiation * * with respect to s of the equations in (5.25) with yt, y2 replaced by yl, y2: for xc[O, 11 with 771(0,X,s) = 0 77; = 772, (5.27) 772(0,W = 1. 77; = -(Xcos y;(x,X,s))q1 A d + fixed
I
The interval Newton method then is given by [Sk+l1 : = m([sk]) - [ya(l,X,m([sk]))]/[~a(l,X,[ski)] (5.28)
for k = l € H ; [SO]clR is chosen and XdR+ is fixed, [Sk] =
[ s k,ik],
m([Sk]:=(sk+ik)/2.
The external brackets bi(...)I and [7;( ...)I refer to an employment of Lohner's Enclosure Method ([31],[32]) with respect to the semi-coupled IVP (5.25),(5.27). For any fixed A, the set of roots {s;(X)li = l(l)n} c (0,of (5.26) can be characterized as follows: with si = 0 and for i = l(l)n, the distance sy - S;-I of the roots of a (J F(s) = 0 decreases strongly as s z ( O , v ] increases; this distance is very small, particularly for large A; close to as X - Xn changes its sign from negative to positive for any fixed nclN, a new ( i J root, Siew, starts from s = O; for all xc[O,l], the solution y* = y*(x,X,srtleW)possesses a magnitude that d (J approaches zero as X - Xn -+ O+; because of (b) and (d), a computational determination of a root s* is e (J or zero, respectively. ill-conditioned close to either For X = 417 > X7 :: 416.99, an approximation 8,, of Siew will now be
v,
E. Rdams
455
discussed for the case of a replacement of Lohner's Enclosure Method by a classical Runge-Kutta method with step size h. For h = 1/50 or 1/100 or 1/200, the shooting method yielded approximations y2( 1, 417,s) which were negative for s = lo-'' and positive for s = 5 . lo-''. If correct, these values imply the existence of a root Siew in the interval [10-l2, 5.10-"]. When this problem was reworked by means of Lohner's Enclosure Method, it turned out that there is no root Siew in this interval 5.10-"]. As X - X7 goes from negative to positive values, the 0 new root srtlewis in the interval [0.27, 0.281. The subsequent Example 5.31 demonstrates another failure of type (A). The following nonlinear Mathieu equation with period T, = 1 is considered: (5.29) y" + 0.1 y' + (3 + cos(27lt))y + 0.1 y3 = -10. If there are T-periodic solutions Yier of (5.29), then T must satisfy T
=
kT, with
any kdN. For a verification of this property, (5.29) is equivalently represented by 1 ~ . validity of y*(t +T) A y*(t) means of the system y' = f(t,y) with y: = ( y ~ , y z ) The for its solutions yier and all tdR implies that 1 (5.30) Yier(t) = f(t,yier(t)) yier'(t +T) = f(t+T,yier(t+T)) = f(t +T,yier(t)). Consequently, f must be T-periodic or, more generally, T/k-periodic with any kEN. Examde 5.31 f211: Concerning (5.29), y(0) and y'(O) were systematically varied in order to search for candidates for periodic solutions, which then were to be verified by use of Lohner's Enclosure Methods ([31],[32]). For this purpose, a classical Runge-Kutta method was used with the choice of a step size h = 2.10-4. A "candidate" f was found for T = 0.3782 because of A
(5.32)
1-8.03
I
2.3
I
-8.036518
I
2.300605
1
The corresponding orbit is "closed with more than graphical accuracy. Because of discussions with respect to (5.30), T = 0.3782 is far away from the candidates 0 T = k d for periods of T-periodic solutions of (5.29). Additional exmples for failures of discretizations are as follows: 0 computed "approximations" f of non-existing solutions Yier of BVPs in Subsection 7.3 and 7.4 of the present paper; 0 for failures of computed difference approximations f as caused by diversions, A
456
Discretizations: Theory of Failures see Subsection 9.2 in [4] and [3], [49], and the paper [39] by Rufeger & Adams in this volume.
5.6 CONCLUDI" REMARKS
Generally, nonlinear ODEs with boundary conditions constitute a situation exhibiting unusual difficulties with respect to 0 the verification of the existence of true solutions y* and the determination of reliable computational approximations of solutions y*. Concerning both (a) and (b), these difficulties hold irrespective of the employment of traditional numerical methods or of Enclosure Methods. In view of the unreliability of discretizations, the presented Examples 5.22 and 5.31 are convincing since the investigated systems of ODEs 0 are not "suitably chosen" for demonstration purposes, but rather have been taken from physics literature; 0 their complexities are rather low since their total orders are small and their nonlinearities are simple. This completes Part I, which is mainly devoted to qualitative discussions with respect to the reliability of discretizations of DEs. Concerning this problem, 0 there are two quantitative examples in Subsection 5.5 of this Part I; 0 a more general discussion of practical failures of discretizations is presented in Part I1 [4]. APPENDIX: K-S ENCLQSURE METHOD FOR PERIODIC SOLUTIONS OF NONLINEAR ODES The nonlinear BVP (5.1) is considered for the case that the unknown endpoint T is to be determined simultaneously with a T-periodic true solution y* of this BVP. Since the ODEs y' = f(y) are autonomous, any solution y* of (5.1) can immediately be continued to all intervals [a, (I+l)T], for all ldN. For this purpose, The goal of Kiihn's construction [29] is y*(t) for t€[O,T] is replaced by y*(t +a). 0 the computational determination of an enclosure of a solution y* of (5.1), 0 merged with the automatic verification of the existence of y*. For these purposes, a hyperplane V (with a representation in R") is chosen such that
5.7
E. Adams
457
fa
Dnv#$, (5.33) S* = {per(O)€D n V provided Y*per exists and, (iii) in a neighborhood of s*, f(y) for ED n V is not orthogonal to grad V(y); here $ denotes the empty set and s represents the shooting vector introduced in (5.3). Obviously, s* = Y*per(O) can be satisfied only as the result of a convergent iteration for s as a root of (5.5). For y E V, a new (n-1)-dimensional Cartesian basis is introduced by means of a bijective map $: (5.34) Q': V + Rn-' and w = $-l(y). Without a loss of generality, this basis is now chosen such that (5.35) V: = {YER" I Yn = c = const.}. The following PoincarC map P is introduced: (5.36) P:D n V + D n V and Py: = $f y for y€D n V T(Y1 with t = ~ ( y the ) first return time to D n V of an orbit which, at t = 0, starts at y€D n V. Remark: This map involves unknown true solutions y* of y' = f(y). Consequently, a PoincarC map is not finite-dimensional since, implicitly, the function space of the solutions y* is involved. Provided there exists a periodic solution y;er with y;,,(O) =: S*EDn V and y;,,(~(y))€D n V, then there is a smallest kEDI such that
(5.37)
k-1
Pks*: = ( P - P....P)s* = S*EDn V with Po: = I and T := .E ~(Pjs*), J
=o
where T represents the total time between the departure from D n V at s* and the return to this point. If (5.37) is true for s*, then there holds Pmks*= Pks* = s* for all mdN, i. e., y;er is T-periodic. Since S*E D n V, there are n unknown components of the vector (w,T) to be determined. Since y = D n V, (5.37) is equivalent with I (5.38) G(w): = $f $~(w)- flu) 5 0. T(4
flu) for the points on
Theorem 5.39: If and only if F(w,T): = $;flu) - f l u ) = 0, with T: = !"(w) and an arbitrary but (5.40) fixed h N , then G as defined in (5.38) possesses a root w*. Proof: fa If G ( w ) = 0 and T = T(w),then F(w,T) = 0.
458
0If
Discretizations: Theory of Failures F(w,T) = 0, then T must be an integer multiple of T(w). Consequently,
G(w)= 0. For both (i) and (ii), the existence of a root w* follows from the existence of the 0 solution y* with representation by means of the operator ${. Since T is unknown, F as defined in (5.40) corresponds to the boundary condition of a free BVP concerning y' = f(y). This BVP will now be represented equivalently by means of a BVP with fixed endpoints 0 and 1, making use of a suitable transformation of the time scale. For this purpose, a function cp(t,yo): = $fyo is introduced, and the operator ${ is replaced by (5.41) $!yo: This implies that
= cp(Tt,yo).
=: h($kYo)
+Y!O
(5.42)
=
d
Tq=j-q
cp(Tt,yo) = T
d
m $&Yo
=
Tf(cp(Tt,yo)). By means of y' = f(y), therefore, (5.43) My) = Tf(y) for YEDand $!?YO = ${YO and (5.44)
= f(cp(T,Yo)).
*?YO
This allows the following representation of the function F, which is defined in (5.40): (5.45) F(w,T) = $!?flu)- flu). A*
Consequently, a root s of A
h
F(s) = 0 with s: = (u,T)*EIR" is to be determined, where T has been incorporated in the choice of the function h. The problem (5.46) is now eqtivalentlyArepre:e?ted thro:gh the classical mtan value theorem, 0 = F(s ) = F(s0) + F'(si,).(s-so) with SOEDn V fixed and Sim the intermediate argument. A rearrangement of terms yields the equivalent h representation s =, H(s) with h H(s): = SO - rF(s0) +(I - I'F'(Sim))(S - SO), where s o ~ Dn V is fixed and I'EL(IRn) is arbitrary but fixed and invertible. According to Rump [40],this gives rise to the following interval extension, where I' now is notAassuyed to be invertible: H([sJ): = SO - I'F(s0) + (I - rF'([sJ))([s] - SO), where both (5.47) A h S ~ E [ S ] and I'EL(!Rn) are arbitrary but fixed. (5.46)
A*
A
A
h
h
h
h
A
E. Adams
459
This induces an interval iteration (5.48) [Sk+l]: = H([Sk]) with [so]clR" suitably chosen. The following theorem has essentially been proved by Rump [40]: Theorem 5.49: Provided the enclosure condition (5.50) [Sk t l]c [sk] is satisfied, there hold: ( i J r is invertible and there exists one and only one fixed point s E[Sktl] of H. Remark: For (5.50) to be satisfied, s must be isolated. For applications of Theorem 5.49, the Jacobian of F is needed: dF wT = Y ( 1 ) q with Y(t): = (Yik(t)): = (5.51) A
h
A
A
A
A*
A*
9
and, because of (5.40) - (5.43, (5.52) ~ I+ ~lFwT =
= +h($?$(w)) = f($?$(w)).
Because of (5.35) and making use of the Kronecker symbol &j, 8F, wT (5.53) = 6ij = Yik(1) &j - &j = Yij(1) - 6ij.
@$'$
wj
Concerning Kiihn's investigation of the symmetry properties (5.11)-(5.13), the details will now be discussed. Provided y* is a solution of y' = f(y) and (5.11) is valid, then trivially, Sy* is also a solution (5.54) (Sy*(t)f = Sy*'(t) = Sf(y*(t)) = f(Sy*(t)) = > S@€yo= @yo since this implication is true at t = 0 and ${ is the operator representing the solutions y*. Whereas generally an orbit is not invariant with respect to S with a matrix S such that Sf(y) = f(Sy), the conditions of the following theorem may then be satisfied, which has been proved by W.Kiihn [29]: Theorem 5.55: Provided there is an invertible matrix S€L(lR") such that Sf(y) = f(Sy), (bJ a k - ~ E I Nsuch that Sk = I, and quantities yo, T with the properties as stated in (5.13), then there hold: the orbit as determined by yo is invariant with respect to S and @ 0. ${yo is periodic with period T = kT. Proof @ Because of (5.54), there holds (5.56) sy: = {syIy€y} = {S$€yoJt€R}= {@fSyoIt€lR}= y since Syo~y. @ Because of (5.13), (5.56), and Syoq, there holds f (5.57) 4:.Jo = $fK-J( yo = $(n-1)7syo =...= Sky0 = Iyo = yo. 0
a
460
Discretizations: Theory of Failures
LIST OF REFERENCES E. Adams, A. Holzmuller, D. Straub, The Periodic Solutions of the Oregonator and Verification of Results, p. 111-121 in: Scientific Computation with Automatic Result Verification, editors: U. Kulisch, H. J. Stetter, Springer-Verlag, Wien, 1988. E. Adams, Enclosure Methods and Scientific Computation, p. 3-31 in: Numerical and Applied Mathematics, ed.: W. F. Ames, J. C. Baltzer, Basel, 1989. E. Adams, Periodic Solutions: Enclosure, Verification, and Applications, p. 199-245 in: Computer Arithmetic and Self-validating Numerical Methods, ed.: Ch. Ullrich, Academic Press, Boston, 1990. E. Adams, The Reliability Question for Discretizations of Evolution Problems, 11: Practical Failures, in: Scientific Computing with Automatic Result Verification, editors: E. Adams, U. Kulisch, Academic Press, Boston, will appear (this volume). E. Adams, W. F. Ames, W. Kiihn, W. Rufeger, H. Spreuer, Computational Chaos May be Due to a Single Local Error, will appear in J. Comp. Physics. G. Alefeld, J. Herzberger, Introduction to Interval Computations, Academic Press, New York, 1983. J. Argyris, H.-P. Mlejnek, Die Methode der finiten Elemente, 111: Einfiihrung in die Dynamik, F. Vieweg & Sohn, Braunschweig, 1988. C. Canuto, M. Y. Hussaini, A. Quarteroni, T. A. Zang, Spectral Methods in Fluid Dynamics, Springer-Verlag, New York, 1988. L. Cesari, Asymptotic Behavior and Stability Problems in Ordinary Differential Equations, 2nd edition, Springer-Verlag, Berlin, 1963. S.-N. Chow, K. J. Palmer, O n the Numerical Computation of Orbits of Dynamical Systems: the One-Dimensional Case, will appear in J. of Complexity . S.-N. Chow, K. J. Palmer, On the Numerical Computation of Orbits of Dynamical Systems: the Higher Dimensional Case, will appear. S. De Gregorio, The Study of Periodic Orbits of Dynamical Systems, The Use of a Computer, J. of Statistical Physics -83 p.947-972, 1985. H.-J. Dobner, Verified Solution of Integral Equations with Applications, in: Scientific Computing with Automatic Result Verification, editors: E. Adams, U. Kulisch, Academic Press, Boston, will appear (this volume).
46 1
E. Adams
W. Espe, ijberarbeitung von Programmen zur numerischen Integration gewohnlicher Differentialgleichungen,Diploma Thesis, Karlsruhe, 1991. R. Gaines, Difference Equations Associated With Boundary Value Problems for Second Order Nonlinear Ordinary Differential Equations, SIAM J. N u . Anal. 11,p. 411-433, 1974. J. Guckenheimer, P. Holmes, Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields, 2nd printing, Springer-Verlag, New York, 1983. J. K. Hale, N. Sternberg, Onset of Chaos in Differential Delay Equations, J. of Comp. Physics 22, p. 221-239, 1988. S. M. Hammel, J. A. Yorke, C. Grebogi, Do Numerical Orbits of Chaotic Dynamical Processes Represent True Orbits?, J. of Complexity 5 p, 136- 145, 1987. S. M. Hammel, J. A. Yorke, C. Grebogi, Numerical Orbits of Chaotic Processes Represent True Orbits, Bull. (New Series) of the American Math. SOC. p. 465-469, 1988. Hao Bai-Lin, Chaos, World Scientific Publ. Co.,Singapore, 1984. R. Heck, Lineare und Nichtlineare GewoMiche Periodische Differentialgleichungen, Diploma Thesis, Karlsruhe, 1990. B. M. Herbst, M. J. Ablowitz, Numerically Induced Chaos in the Nonlinear p. 2065-2068, 1989. Schrodinger Equation, Phys. Review Letters H. Heuser, Gewohnliche Differentialgleichungen, B. G. Teubner, Stuttgart, 1989. A. Holzmiiller, EinschlieBung der Lasung linearer oder nichtlinearer gewohnlicher Randwertaufgeben, Diploma Thesis, Karlsruhe, 1984. A. Iserles, A. T. Peplow, A. M. Stuart, A Unified Approach to Spurious Solutions Introduced by Time Discretisation, Part I: Basic Theory, SIAM J. Numer. Anal. 28, p. 1723-1751, 1991. D. W. Jordan, P. Smith, Nonlinear Ordinary Differential Equations, Clarendon Press, Oxford, 1977. R. Klatte, U. Kulisch, M. Neaga, D. Ratz, Ch. Ullrich, PASCAL-XSC Language Reference with Examples, Springer-Verlag, Berlin, 1992. P. E. Kloeden, A. I. Mees, Chaotic Phenomena, Bull. of Math. Biology 47, p. 697-738, 1985. W. Kiihn, EinschlieBung von periodischen Lasungen gewohnlicher Differentialgleichungen und Anwendung auf das Loremsystem, Diploma
a
462
Discretizations: Theory of Failures
Thesis, Karlsruhe, 1990. G. A. Leonov, V. Reitmann, Attraktoreingrenzung ftir nichtlineare Systeme, B. G. Teubner Verlagsgesellschaft, Leipzig, 1987. R. Lohner, Einschliehng der Liisung gewohnlicher Anfangs- und Randwertaufgaben und Anwendungen, Doctoral Dissertation, Karlsruhe, 1988. R. J. Lohner, Enclosing the Solutions of Ordinary Initial and Boundary [321 Value Problems, p. 255-286 in: Computerarithmetic, eds.: E. Kaucher, U. Kulisch, Ch. Ullrich, B. G. Teubner, Stuttgart, 1987. E. N. Lorenz, Deterministic Nonperiodic Flow, J. of the Atmosph. Sc. 20, p. 130-141, 1963. E. N. Lorenz, Computational Chaos - A Prelude to Computational Instability, Physica D g, p. 299-317, 1989. R. May, Simple Mathematical Models With Very Complicated Dynamics, Nature 261. p. 459-467, 1976. R. E. Moore, Methods and Applications of Interval Analysis, SIAM, Philadelphia, 1979. J. M. Ortega, W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Academic Press, New York, 1970. W. Rufeger, Numerische Ergebnisse der Himmelsmechanik und Entwicklung einer Schrittweitensteuerung des Lohnerschen EinschlieBungs-Algorithmus, Diploma Thesis, Karlsruhe, 1990. W. Rufeger, E. Adams, A Step Size Control for Lohner’s Enclosure Algorithm for Ordinary Differential Equations with Initial Conditions, in: Scientific Computing with Automatic Result Verification, editors: E. Adams, U. Kulisch, Academic Press, Boston, will appear (this volume). 1401 S. M. Rump, Solving Algebraic Problems with High Accuracy, p.51-120 in: A New Approach to Scientific Calculation, eds.: U. Kulisch, W. L. Miranker, Academic Press, New York, 1983. P. Schramm, Einsatz von PASCAL-SC und Anwendungen genauer Arithmetik bei zeitdiskretisierten dynamischen Systemen, Diploma Thesis, Karlruhe, 1990. Ya. G. Sinai, E. B. Vul, Discovery of Closed Orbits of Dynamical Systems with the Use of Computers, J. Statistical Physics p.27-47, 1980. C. Sparrow, The Lorenz Equations: Bifurcations, Chaos, and Strange Attract ors, Springer-Verlag, New York, 1982.
a,
E. Adams
463
H. Spreuer, E. Adams, Pat hologische Beispiele von Differenzenverfahren bei nichtlinearen gewohnlichen Randwertaufgaben, ZAMM 52, p. T304-T305, 1977. H. Spreuer, E. Adams, On Extraneous Solutions With Uniformly Bounded Difference Quotients for a Discrete Analogy of a Nonlinear Ordinary Boundary Value Problem, J. Eng. Math. l9,p. 45-55, 1985. H. Steinlein, H.-0. Walther, Hyperbolic Sets, Transversal Homoclinic Trajetories, and Symbolic Dynamics for C1-Maps in Banach Spaces, J. of Dynamics and Differential Equations 2, p. 325-365, 1990. H. J. Stetter, Analysis of Discretization Methods for Ordinary Differential Equations, Springer-Verlag, Berlin, 1973. J. Stoer, R. Bulirsch, Einfiihrung in die Numerische Mathematik, 11, Springer-Verlag, Berlin 1973. D. Straub, Eine Geschichte des Glasperlenspieles - Irreversibiliat in der Physik Irritationen und Folgen, Birkhauser-Verlag, Basel, 1990. A. Stuart, Nonlinear Instability in Dissipative Finite Difference Schemes, SIAM Review 31, p. 191-220, 1989. S. Timoshenko, Theory of Elastic Stability, McGraw-HillBook Co., Inc., New York, 1936. G. Trageser, Beschattung beweist: Chaoten sind keine Computer-Chimaren, Spektrum der Wissenschaft, p. 22-23, 1989. C. Truesdell, An Idiot’s Fugitive Essays on Science - Methods, Criticism, Training, Circumstances, Springer-Verlag, New York, 1984. W. Walter, Differential and Integral Inequalities, Springer-Verlag, Berlin, 1970. W. Wedig, Vom Chaos zur Ordnung, GAMM Mitteilungen 1989, Heft 2, p. 3-31, 1989. H. C. Yee, P. K. Sweby, On Reliability of the Time-Dependent Approach to Obtaining Steady-State Numerical Solutions, Proc. of the 9th GAMM Conf. on Num. Methods in Fluid Mechanics, Lausanne, Sept. 1991. H. C. Yee, P. K. Sweby, D. F. Griffiths, D y ~ m i c a lApproach Study of Spurious Steady-State Numerical Solutions for Nonlinear Differential Equations, J. of Comp. Physics $V,p. 249-310, 1991. A. ieniiek, Nonlinear Elliptic and Evolution Problems and their Finite Element Approximations, Academic Press, London, 1990.
This page intentionally left blank
11. PRACTICAL FAILURES Whereas failures of discretizat ions of differential equations are mainly discussed qualitatively in Part (I), [5], this continuation presents examples concerning spurious difference solutions (or diverting difference approximations) and a quantitative treatment of selected evolution problems of major contemporary interest. Examples are mathematical models for gear drive vibrations, the Lorenz equations, the Restricted Three Body Problem, and the Burgers equation. In each one of these examples, Enclosure Methods were instrumental in the discovery of large deviations of computed difference solutions or difference approximations from true solutions of the differential equations.
6. INTRODUCTION The numbering of the sections in the present chapter is in continuation of the one in Part (I), [5]. A literature reference [I.n] points to reference [n]in Part (1).
Discretizations of nonlinear DEs (ODEs or PDEs) represent the most important practical approach for the determination of approximations of unknown true solutions y*. An approximation is said to fail practically 0 if a suitably defined distance of and y* takes sufficiently large values or 0 if there is no true solution y* which is approximated by a difference solution y. As is shown in [5], classes of examples for such failures are provided by "spurious difference solutions" f, " diverting difference approximations" or more generally, any kind of computational chaos. There is a rapidly rising number of publications on the corresponding unreliability of discretizations of DEs. A qualitative discussion of this subject area is offered in [5]. For the purpose of a quantitative substantiation, only a few cases of practical failures of discretizations are treated in Subsection 5.5 in [5]. The present continuation of [5] is concerned with the following areas: 0 for a class of linear ODEs with periodic coefficients, in Section 7 a case history of systematic practical failures of discretizations;
F
5
y,
465
466
Discretizations: Practical Failures
for nonlinear ODEs or PDEs, in Section 8 a discussion of spurious difference solutions 7 on the basis of literature and examples; 0 for nonlinear ODEs or PDEs, in Section 9 the observation of diverting difference approximations y as caused by properties of the underlying DEs * or the phase space of their true solutions y . The areas covered in Sections 7-9 address problems of major current interest in the mathematical simulation of the real world. Consequently, the failures of discretizations as observed in these sections possess grave practical implications. Even though the problems considered in these sections are mainly concerned with ODEs, all their implications extend to PDEs. In fact, 0 there are qualitatively identical influences of numerical errors in the cases of discretizations of ODEs and PDEs; 0 frequently in the applied literature, evolution-type PDEs are approximated by systems of ODEs, as is discussed in Section 1 and in Subsection and in 0
9.4.
Generally in the areas covered in Sections 7-9, the practical failures of discretizations of DEs cannot be recognized on the level of these difference methods. In the absence of explicitly known true solutions y*, Enclosure Methods are the major practical approach for a reliable quantitative assessment of unknown true solutions y*. In fact, these methods were instrumental in the recognition of the majority of the practical failures as discussed in Sections 7-9. 7. APPLICATIONS CONCERNI" PERIODIC VIBRATIONS OF GEAR DRIVES 7.1 INTRODUCTION OF THE PROBLEM The case history presented in this section demonstrates 0 the diagnostic power of Enclosure Methods, which were instrumental in the discovery that 0 local numerical errors smaller than 10-8 were large enough to allow the practical computability of difference approximations y of non-existing true * periodic solutions yper. This case study relates to the simulation of vibrations of one-stage, spur-type industrial gear drives. Consequently, two mated gears are considered, each of which is mounted on a shaft that 0 is carried by bearings and 0 one end of which is attached to an external clutch.
E. Adams
467
The physical problem can be characterized by the following geometric or dynamic data of gears: 0 a diameter of up to 5 m, 0 a performance of up to 70 OOO kw, 0 a peripheral speed of up to 200 m/s, 0 an unusually high geometric precision of the teeth, whose deviations must not exceed a few micrometers (pm), [16], [25], [34]. Vibrations of gear drives are mainly induced by (A) the periodically varying number of mated pairs of teeth and (B) geometric deviations of the manufactured teeth from their intended ideal configurations, e. g. pitch deviations. Because of (A), the tooth stiffness function cz of the mated pairs of teeth is a T-periodic function of time t, where T = l/w; the tooth engagement frequency w takes values of up to 5000 l/s. In the subsequent analysis, only (A) is taken into account. The T-periodicity of cz presupposes that small T-periodic vibrations are superimposed onto the constant (nominal) angular velocities q and w2 of the (two) shafts. Obviously, 2a.w = q z l = w2z2 with z1 and z2 the numbers of teeth of the mated gears. Concerning a stochastic dynamical treatment of pitch deviations of teeth, see the paper [3] by D. Cordes, H. Keppler, and the author. The quantitative results in [3] demonstrate that pitch deviations of more than the few micrometers (pm) as admitted in the industrial quality standards ([16], [25], [34]) may cause unacceptably intense vibrations. Additionally to these physical sensitivities, there are mathematical and numerical difficulties as will be shown. The case study to be presented concentrates on the closely interrelated aspects of 0 a Physical Simulation of the real world, 0 Mathematical Theory, 0 Numerical Analysis, and 0 Computer and Compiler performance. The present section rests on 0 the physical simulation in the doctoral dissertation [14] (1984) of H. Gerber, 0 the mathematical and numerical analysis in the doctoral dissertations of D. Cordes [12] (1987) and U. Schulte [29] (1991), and 0 H. Keppler’s additional analysis [18] which was carried out after the completion of Schulte’s dissertation.
468
Discretizations: Practical Failures
Beginning in 1987, Schulte's and Keppler's work has been supported by DFG in order to enable @ a theoretical analysis of real gear drive vibrations, which have been and are being investigated by use of a test rig in the Institute for Machine Elements of the Technical University of Munich. Concerning (b), the doctoral dissertations of H. Gerber [14] and R. Muller [23] are referred to. These authors have provided the experimental results to be used subsequently for a validation of the mathematical and physical simulation to be discussed. In fact, a simulation of a real world problem is meaningful only in the case that a satisfactory agreement with observed physical results can be reached. This then may be the basis for a confident employment of the model for purposes of predictions. 7.2 ON THE I"IGATED CLASSES OF MODELS On the basis of Gerber's dissertation [14], a multibody approach has been used for the purpose of the following physical simulation:
the gears are replaced by two solid bodies carrying elastic teeth to be treated by means of the tooth stiffness function c,; (B) each shaft is replaced by one or several solid bodies; (c) the thus generated N solid bodies are coupled as follows: (c1) in cross sections of their artificial separation, by springs and parallel dashpots; i.e., the T-periodic tooth spring with stiffness coeficient c, and the tooth dashpot with a constant damping coefficient k,; (c2') springs and parallel dashpots correspondingly simulate the bearings, whose foundations are at rest; (c3) it is optional whether or not the shafts are correspondingly coupled with the (heavy) external clutches, which are assumed to rotate uniformly, i. e., without vibrations. If present, a coupling of this kind will be called a "rotational attachment". Figure 7.6 displays a simulation by means of N = 4 solid bodies. The dashpots and the optional rotational attachments are not shown in this figure. As a minimal interaction of the shafts and the clutches, there must be a transfer of the constant external torques, see Figure 7.6: 0 MI to the shaft with the nominal angular velocity q and 0 M2 to the other shaft with In the absence of the rotational attachment (C3), MI and Mz act as external loads @)
*.
E. Adams
469
onto one end-face of each shaft, respectively. In the presence of (C3), MI and M2 are still assumed to be constant since a feedback of the vibrations onto the heavy clutches has been excluded by assumption. Gerber’s multibody approach [14] employs a simulation by means of (A), (B), (Cl), and ((2). The omission of (C3)is an essential property of 0 the ”original class of models” to be defined subsequently, 0 which in 1987 was the starting point of Schulte’s [29] work. The absence of (C3) caused mathematical and numerical difficulties to be discussed in Subsections 7.3 and 7.4. These problems did not arise when (C3) was additionally taken into account in the case of a special model with N = 2: 0 in the doctoral dissertations of Kiiciikay [20] (1981) and Naab [24] (1981) and 0 in the one of D. Cordes [12] (1987). Additionally to (A), (B), and (C), a physical simulation consists in the choice of a total of n (2 N) degrees of freedom to be assigned to the N solid bodies. Each one of these degrees corresponds to one of the following kinematic coordinates (a) or (b), compare Figure 7.6: (& translational coordinates xi of the motions relative to the bearings and/or rotational coordinates which are either @) (bl) the absolute coordinates vi describing the superposition of the vibrations and the nominal solid body rotation wjt of the shafts and the gears, with either the sign + or the sign - (depending on i and j) or (b2) the relative coordinates vi Wjt or (b3) the relative coordinates @i which are defined as differences of coordinates vi belonging to neighboring solid bodies in the simulation. Obviously, there is no contribution of * wjt to the coordinates vi * wjt or 9i. These contributions preclude the existence of a periodic state for the absolute coordinates (bl). Possibly the existence of a periodic state can be proved for the relative coordinates (b2) or (b3). Even though these kinematic facts are obvious, they were obscured by numerous quantitative results that were determined without a total error control, see Subsection 7.3. For a fixed tooth engagement frequency w, the results of major engineering interest are 0 the tooth deformation s, which is a linear combination of all coordinates belonging to the two solid bodies representing the mated gears, and the tooth force F. 0
470
Discretizataons: Practical Failures
Provided 1 s(t) I and I s'(t) I are sufficiently small for all t E R, F can be expressed as follows: F(t) : = cz(t)*s(t) + kZ*s'(t). (7.1) Mathematically, both s and F consist of 0 static contributions, due to the external torques MI and Mz and, addit ionally, 0 vibrational contributions that may still be present in the case of MI = M2 = 0, compare (7.30). Provided the vibrations are bounded for all t E IR, it is physically obvious that I s(t) I and I s ' (t) 1 also have this property. Correspondingly, in the dependency of s on rotational coordinates 0 there is no contribution of the uniform rotation iwjt, as is seen by inspection of (7.7b)and (7.7d), and no influence of a replacement of the coordinates pi by either pi Ujt or by qi, as is seen by inspection of (7.8a). The translational and/or rotational coordinates to be employed are Concerning their now interpreted as the components of a vector (function) y : Ran. physical simulation, the dynamical interaction of the n corresponding degrees of freedom can beAexpressed by a system of n linear ODEs, each of the second order: + Dy ' + (A -t B(t))y = g for t E R with y: R -+ R" M : = diag(m1, ...,m,) E L(R") and mi E IR' for i = l(1)n; (7.2) B(t) := c,(t)C; A, C, D E L(IR"); cZ(t+T) = cZ(t)for all t E R; c,. is (piecewise) continuous; T = l / w ; g E IR". Subsequently, M will be equated to the identity matrix I because, without a loss of generality, the system of ODEs can be premultiplied by M-1. Since s is a linear function of certain components of y, the tooth force F is immediately known when * * a solution y = y (t,w) of (7.2) has been determined. For engineering purposes, the most important final result is the function
yy"
A
pertaining to a T-periodic representation sperof the tooth deformation s. For engineering purposes, the resonance excitations of Fdyn are to be determined, i. e., the frequencies w belonging to the local maxima of Fdyn. A countably infinite set of candidates w for these excitations is furnished by the Floquet theory, e. g., p.9, p. 751. The goal of the engineering analysis under discussion is an
E. Adams
47 1
identification of those candidates actually yielding significant resonance excitations. For engineering purposes, additionally, the property of asymptotic stability of (7.2) is of interest for the case when g is equated to zero. If it exists, a periodic solution of the nonhomogeneous ODEs (7.2) then is globally attractive. A rotational attachment (C3)is physically realized by means of springs and parallel dashpots, with corresponding coefficients of stifEness and damping. It is convenient to express these coefficients as scalar multiples of corresponding coefficients appearing in the simulation. Denoting the scalar factors by a E IRA, the matrices in (7.2) can be expressed as follows, according to [29]: (7.4) A = A o+cu A1 and D =Do + CUD1 since B = B(t) simulates the interaction of the gears. According to Diekhans [13] or Schulte [29], the influence of a on Fdyn is practically negligible for cr 5 10-3. In the case of an employment of translational coordinates xi and/or relative rotational coordinates @i, Schulte [29] has shown the validity of the following decomposition of the system of ODEs: if a = 0, there are n-1 ODEs just like the ones in (7.2) and (compare (7.8a)), f I J (II) by use of a suitable linear combination u = u(y1, ...,yn), the additional ODE u" = 0. (7.5a) The solutions of (7.5a), * u (t) = a + /3t with a, /3 E IR, (7.5b) (iJ can be determined without a discretization error, making use of any consistent difference method; (iiJ they relate to the uniform rotation * q t ; i. e., the solutions of the subsystem (I) represent the vibrational motions. Because of the bijectivity of the transformation relating the coordinates Vi and 'J'i, the composition of the subsystems (I) and (11) is equivalent to the original system (7.2). Remark: For the case of n = 2 rotational degrees of freedom, Gerber [14] has derived a Mathieu-ODE, since then n-1 = 1. This is the "simplest" one in the set of models which have been derived and employed by Gerber [14] and Schulte [29]. The following types of mathematical simulations will be distinguished: (iJ the "original class of models" with an employment of absolute rotational coordinates Pi and the choice of a = 0; (iiJ the "class of modified models" which coincides with (i), with the exception that a=O is replaced by a>O; the corresponding rotational attachment to
Discretizations: Practical Failures
472
uniformly rotating external masses causes an automatic replacement of the coordinates qi by Vi * wi; @) the "class of relative models" which, for a=O,employs the relative rotational coordinates $i. For the purpose of a vibrational analysis, the three classes of models are "physically equivalent" provided Q 5 10-3.The three classes are mathematically equivalent provided Q = 0 and the ODE u" = 0 is added to the subsystem (I). As an example for (7.2), now the special case of N = 4 bodies and n = 6 degrees of freedom is considered in the context of the original class of models. Figure 7.6 displays 0 the translational coordinates XI and x2, 0 the absolute rotational coordinates (01, M, (p3, and 94, 0 the stiffness coefficients c1, c2, c3, and c4 of coupling springs, with parallel dashpots not indicated in the figure, 0 the tooth spring with stiffness coefficient c,(t) 5 h(t) and the parallel tooth dashpot not indicated in the figure. Figure 7.6 does not display the (optional) rotational attachments of the shafts.
C
9 Firmre 7.6: (from [29])Physical simulation of gear drive
473
E. Adams
Under an additional consideration of the rotational attachments of the shafts with a>O, the mathematical model is represented by the following system of ODES:
(70%) where
s := x1
+ x2 + r l n + r2cpz,
rlwi = r 2 a
because of the kinematic consistency, r2M1 = rlM2 because of the external equilibrium; (7.7c) ml and m2 are masses, and J1, J2, J3, J4 are moments of inertia. For cr = 0 [29, p. 551, this is a model in the original class of models with then 9, cpz, (p3, (p4 absolute rotational coordinates. In the vectorial representation (7.2) of (7.7), the matrices A, C, and D are not invertible for all fixed choices oft E R. For cr > 0, this is a model in the modified class of models. The symbols 9, cpz, (p3, Ujt: (p4 in (7.7a) and (7.7b) then represent the following relative coordinates vj rpk
-a t
+at
for j = 1or 3 and for k = 2 or 4.
In both cases cr = 0 or cr > 0, (7.7a) is satisfied by the uniform rotation *Ujt: = - a t for k = 2 or 4 + s = 0. (7.7d) x1 = x2 = 0, qj = wit for j = 1 or 3, For cr=O, this is obvious by inspection of (7.7a) and (7.7b). For a>O, this follows from the meaning of @,...,(p4 just referred to. For all n > l , the models in the original and in the modified classes can be characterized as follows:
474
Discretizations: Practical Failures
the coordinates 9i and their derivatives Pi' occur in pairs, as sums or as differences with certain weights; the tooth deformation s is expressed by means of (7.7b) with XI,x2, rl(p1, and r 2 the ~ coordinates of the gears; (7.8) (iii) for Q = 0 and all t E IR, Schulte [29] has shown that the matrices A, C, and D in (7.2) are not invertible, see Appendix 7.8; (iv) for Q > 0 and all t E IR, the matrices A, C, and D generally are invertible. Schulte [29, p. 68 - 691 has derived the model in the class of relative models which, for d, belongs to (7.7). This model consists of the subsystem (I) with five ODEs, each of the second order, for the coordinates (7.8a) XI,x2, @I := (p1 - 93, @2 := cp1 + (r2/r1)~,and *3 := M - 94, and the ODE u" = r2M1 - r1M2 = 0 with u := r2J1cp1 + r2J393 - rlJ2cpz rlJ494, where u' = const. represents the conservation of angular momentum. In the engineering community, the original class of models (for Q = 0) has been the standard mathematical approach to an analysis of vibrations of industrial gear drives. Because of the mathematical and numerical problems to be shown for all models in this class, the author and his coworkers have adopted the other two classes of models. Concerning the original class, a literature survey has shown that traditionally marching methods have been used for a computational approximation of the true T-periodic tooth deformation, s&.~.Prior to a discussion of this approximation in Subsection 7.6, a determination of sier by means of the corresponding BVP of periodicity will be carried out. For this purpose, the system of ODEs (7.2) is represented equivalently by means of the following system, making use of a redifinition of the symbol y: y = A(t)y + g with h
A
A(t) : = (7.9)
A
["
-M-'(A + B(t)) -M-'D A : IR -, L(R2") is (piecewise) continuous with A(t A
A
A
,
A
+ T)
A
= A(t) for all
475
E. Adarns A
For g = 0, the homogeneous system possesses the following representations of its general solution: (7.10) Y(t) = 4*(9 Y(0). The fundamental matrix 4: R + L(R2") solves the matrix-IVP (7.11) 4 ' = A(t)6 for t 2 0 with 4(0) = I. All solutions of (7.9) can be represented by means of (7.12) The system of ODEs (7.9) is now supplemented by the boundary condition of T-periodicit y, (7.13) Y(T) - Y(0) = 0. Remark: There is not necessarily a solution of the BVP (7.9), (7.13). If this solution exists, it can always be continued for all t E R to represent the desired T-periodic solution y;er Since y;er does not exist when absolute rotational coordinates are employed, the BVP (7.9), (7.13) then cannot possess a solution. Substitution of (7.12) into (7.13) yields the following system of linear algebraic equations for the initial vector y(0) of the desired T-periodic solution of (7.91,
YGer:
(7.14)
(1 - ~*(T))Y(o)= Yiart(T)*
*
Provided X = 1 is not an eigenvalue of the monodromy matrix #-(T), (7.14) posesses one and only one solution: (7.15) l Y*(o) = (1 - 4 * ( ~ ) -Yiart(T)* The T-periodic solution y;er of (7.9) then can be represented as follows [12]: (7.16)
x
= 1is not an eigenvalue of
#*(T).
This solution is unique and globally attractive if the homogenous ODEs are asymptotically stable. Because of a theorem in D.9,p.58], this property holds true if and only if the spectral radius of #*(T) satisfies the inequality (7.17) P(4*(T)) < 1. This then yields
476 0
Discretizations: Practical Failures
*
the unique T-periodic tooth deformation, sper, making use of (7.7b) and
*
those components of yper which represent XI, 0
*
x2,
rl(p1, and r 2 ~ ;
the unique T-periodic tooth force, Fper, whose maximum value for t E [O,T]
is the desired dynamic tooth force, F&, as defined in (7.3). Remarks: J l For the computational determination of yper for t E [O,T], it is
*
advantageous to use a marching method, starting with the true initial vector y (0) * of yper, compare Subsection 7.6. J . 2 The homogeneous ODEs in (7.9) possess T-periodic solutions if and only if X =1
is an eigenvalue of d * ( ~[1.9,p.591. ) 3 J In the case that X = l is an eigenvalue of $*(T), the system (7.14) possesses solutions if and only if an orthogonality condition is satisfied, see Appendix 7.8. Then, there are infinitely many T-periodic solutions of (7.9). These solutions are not attractive since the system of homogeneous ODEs in (7.9) then is not asymptotically stable.
7.3 ON CONFUSING RESULTS FOR THE ORIGINAL CLASS OF MODELS In his doctoral dissertation [12], D.Cordes applied Enclosure Methods for an investigation of a simulation with N = 2 bodies and n = 4 degrees of freedom. This simulation can be illustrated by means of Figure 7.6 provided rotational attachments replace the portions of the shafts which, in Figure 7.6, possess the coordinates ’p3 and (p4. For a large set of values w, Cordes 0 enclosed the true dynamic force Fiyn = Fiyn(w) and 0 he verified the condition (7.17) for asymptotic stability for almost all values of w in the chosen set. Cordes’ quantitative results are practically identical with the ones in the doctoral dissertations of Kiiciikay [20] and Naab [24] where traditional numerical methods had been used earlier for this model with N = 2 and n = 4 as investigated by Cordes. For the author’s research project with support by DFG (since 1987), it was planned 0 to investigate models with n E {2,4,6,8,10,12}in the original class of models; 0 to enclose or approximate F:yn(w) by use of the representation (7.16) of
*
Yper, 0
to compare results of Enclosure Methods and traditional numerical methods
477
E. Adams
as applied to these models; to use the codes which had been prepared and applied by D.Cordes [12]. In this context, M. Kolle [19] (1988) employed 0 a classical Runge-Kutta method p.481 for the determination of an 0 approximation of the fundamental matrix $*, for t E [O,T] and 0 a Gauss-Jordan method for an approximation of (I - $(T))-I. Kolle computed results for for approximately lo4 different choices of the vector v of the input data, with w one of its components. His results can be summafized as follows: A for roughly 50% of the chosen vectors v, p($(T,v)) < 1; (A) A (B) for almost all choices of v, the matrix (1 - $(T,v))-' was computable, resulting in a meaningful dependency on v; 0
3
3
N N
A
the computed approximations Fdyn(v) agreed fairly well with the available physical observations for Fdyn(v), [14]. Simultaneously with Kolle's work and for all chosen input vectors v, U. Schulte's [29] corresponding attempts concerning the original class of models failed
(c)
A
&) & A
to verify that p($*(T,v)) < 1,
to enclose (I - $*(T,~))-I. Therefore, in 1989, U. Schulte [29] investigated the properties of the true solutions of the systems of ODES (7.2) under consideration. She arrived at and proved Theorem 7.28 (in Appendix 7.8). Consequently, for all choices of v and all models in the original class, X = 1 is an eigenvalue of $*(T). Therefore,
@
P(@*(T,V))= 1 and (I - $*IT,v))-' does not exist. A Concerning (A) versus (a): Since p($*(T,v)) = 1, an appeal to statistics suggests that the computed values of p($(T,v)) are expected to exceed one in approximately 50% of all cases and to be smaller than one in the remaining cases. A Concerning (B) versus (b): it is concluded 0 for the approximately lo4 choices of the input vector v, that the classical Runge Kutta method as used by M.Kolle almost always failed to recognize A property (b) but, rather, yielded "approximations" (I - &T,v))-' of the non-existing true matrix (I - $*(T,v))-~; 0 this is in contrast to the failure-message which was given in every attempt to enclose (I - $*(T,v))-~. The systematically incorrect nature of the computed results for
&)
h
478
Discretizations: Practical Failures A
A
(I - V(T,v))-' was not suspected initially because of the property (C). In fact, (C) may be sufficient for an engineer to accept computed results for (I - (T,v))-' and
9
N
thus Fdyn(v), without asking any further questions. It cannot be ruled out, though 0 that property (C) happens to be present in the case of the available physical observations but 0 that this property may be absent for different real world problems leading to models in the original class. In view of the reliability question of discretizations, it is mandatory to determine the reason for the almost consistent failure of the classical A Runge-Kutta method to recognize property (b). This issue has recently been investigated by H. Keppler [18]. In an example for N = n = 2 belonging to the original class of models, he employed Enclosure Methods for the determination of columnwise enclosures [$(t)] of the true fundamental matrix $*(t) for t E [O,T]. This model is illustrated by Figure 7.6 provided the attention is confined to the bodies representing the gears and, additionally, only to their absolute rotational coordinates PI and M. Concerning the subsequent discussion, it is assumed that rl+q is the v - th component of y : IR + R 4 and r 2 is~the p - th component of the vector y. Keppler recognized that the v - th and the p - th column vectors of the computed interval matrices I - [$(T)]cL(W) differ by vectors comparable with the widths of the enclosures [$(T)]. Now, (7.14) is replaced by the following system of linear algebraic equations for the determination of an approximation yper(0) of Y*per(O): (7.18) (1 - (T,v))Y(o) = Ypart(T). Since I - $(T) is almost singular, generally there are large errors in the computed approximation y(0) =: Yper(0) of a solution of (7.18). This is in particular true concernig the components r1$(0) and rZs(0) of y(0). In Keppler's example, the individual errors cancelled each other almost totally in the sum r1$1(0) + r2&(0) which is part of the computed approximation t(0) for the tooth deformation s*(O) with s given in (7.7b). Starting with the computed vectors yper(O), Keppler employed a marching method for the determination of approximations yper(t) of Y*per(t) for t E [O,T]. The individual components of yper(t) then are as erroneous as the components of y(0). Keppler recognized that the individual errors of the components of rl&(t) and &(t) cancel each other almost totally in the expression for their sum, which is part of the expression for t(t). A
9
E. Adams
479
In additional examples for n > 2, Keppler reached identical conclusions. Consequently, Keppler's numyicalAexperim%nts may serve as a heuristic explanation for the "success" (A), (B), and (C) of Kolle's numerical work [19] referred to before. This hereby completed subsection in the case history under discussion suggests the following conclusions: 0 it highlights the unreliability of discretizations that are not totally error- controlled; 0 the observed almost total cancellation of the influences of the individual errors in the terms contributing to the sum for s should not be interpreted as an indication that this "saving grace" is always to be expected in a " meaningful" mathematical simulation of real-world problems; 0 the unreliability of the employed discretization would have been revealed in Kolle's work [19] if the computed individual components of $(t) had been inspected rather than only g(t). Keppler's numerical experiments [18] referred to before have been carried out by use of the classical Runge-Kutta method with a fixed step size and by use of a given computer and compiler. Concerning the sign of 1 - p($(T)) and the computability of (I - $(T))-', the influence of these "tools" has been investigated by means of numerical experiments to be discussed in the next subsection.
7.4 FURTHER NUMERICAL EXPERIMENTS ON PRAflICAL FAILURES OF DISCRETEATIONS Concerning the numerical experiments referred to at the end of Subsection 7.3, a first series has been conducted by U.Schulte [29]. In the original class of models, she chose an (unstable) simulation with N = 2 bodies and n = 2 degrees of freedom. The input data of the differential operators have been chosen such that the corresponding model in the relative class (a Mathieu ODE) belongs to the domain of asymptotic stability, with even a considerable distance from the boundary of stability. For the numerical experiments, Schulte [29] used a classical Runge-Kutta method with various choices of the step size h and executions by means of the following compilers, languages, and computers:
Discretizations: Practical Failures
480
Computer HP Vectra with a 80386187 processor II
IBM 4381 (VM)
Compiler/Language PASCAL-SCwithout a utilization of interval arithmetic or the optimal scalar product WATFOR 87, Version 3 FORTRAN 77 FORTVS, FORTRAN 77
Short Notation PASCAL
WATFOR FORTVS
Table 7.19 By use of these three computer-based executions of the classical Runge-Kutta method, approximations $ of the true fundamental matrix #*:lR+L(!R4) were determined for t E [O,T] with T scaled to one. The employed values of the step size h are listed in Table 7.20. The value of h = 0.00390625 was chosen since this is a machine number in the hexadecimal system; i.e., then there is no error in the representation of the period T = 1. Enclosure methods were used as follows: @ the Cordes-Algorithm for the determination of an enclosure b($(T))]; @.) ACRITH [17] or PASCAL-SC [ll] routines for the enclosure [(I - ($(T))-'] of (I - $(T))-' if this was computationally possible; 0 for the purpose of comparisons, the Lohner-Algorithm for IVPs (p.311, p.321) in conjunction with ACRITH [17] and an execution by means of the IBM 4381 computer addressed in Table 7.19. For i, j = 1(1)4, the widths of the computed enclosures [#ij(T)] = [@ij(T),$ij(T)] are smaller than approximately with ($ij(T)( exceeding one only insignificantly. In dependency on h and the executions addressed in Table 7.19, Table 7.20 lists the following computed results, which are represented by the symbols + : if $ij (T) E [#ij(T)] is true for the element $ij of $(T), with i,j = 1(1)4; correspondingly, : if $ij (T) @ [#ij(T)]is not true for $ij; S : if b($(T))] < 1 is true; I: if [I - $(T))-l] was computable.
-
E. Adams
48 1
PASCAL. 0.00390625
++++++++++++++++ ++--++--++++++++ ++--++--++++++++ ++--++----------
WATFOR 0.001 0.00390625 0.005 0.01
I
-
++--++---+--+---
++--++----++--++
---------- + + - - + + S FORTVS
0.001 0.00390625 0.005 0.01 Table 7.20 For the true fundamental matrix $*(T), there hold (7.21) p($*(T)) = 1and (I - $*(T))-' does not exist. Consequently, a "computational determination" of the properties I or S is caused provided by the errors of &T). These errors are smaller than approximately there is a row of 16 plus-signs in Table 7.20. An inspection of this table indicates the following major results: a in the case of "PASCAL", the accuracy of the approximations &j(T) increases as h decreases; this is not necessarily so in the case of "FORTVS ; for all choices of h, the incorrect property I has been "shown" by use of "FORTVS; the incorrect property S then was "shown" only in 50% of the four cases (compare a corresponding result in Kijlle's work [19]); it is of particular interest that the incorrect property I was "shown" when PASCAL" was used in conjunction with a relatively small step size. Summarizing Schulte's numerical experiments [29], it is observed that I&j(T) - $*ij(T)I < lo-' for i,j = 1(1)4 was not small enough to avoid the It
N
computability of a an enclosure p - $(T))-l] implying the existence of (I - $(T))-l with the true matrix I - $*(T) not being invertible.
482
Discretizations: Practical Failures
Conclusion: The chance of a computational determination of the (invalid) property I increases as the accuracy of the computed approximation $ij decreases. Remark Inaccurate computational tools may increase the chances of always producing " results" ! Subsequent to the completion of Schulte's dissertation [29], H.Keppler [18] y r i e d out numerical experiments for @ an (unstable) model with N = 4 and n = 6 in the original class of models and A @ the corresponding (asymptotically stable) model with N = 4 and n = 5 in the class of relative models. In both cases, w = 2400 was chosen. Keppler employed the following numerical methods: (A) a classical Runge-Kutta method with step size h = T/100 and evaluations with double precision (REAL 8) for the determination of approximations ($ij(T)) = $(T)and (Fpsrt,i(T)) = ypart(T) for t E [O,T]; as compared to the corresponding enclosures, the errors of &j(T) and Xypart, i(T) are smaller than an approximation of (I - $(T))-I by use of a Gauss-Jordan method and an evaluation in REAL 8; this yields y(0) = (I - $(T))-'ypart(T); 0 an enclosure @(O)] of the solution F(0) of the system (I - $(T))y(O) = ypart(T), making use of the subroutine liberary ACRITH [17]; the distance of the bounds for the components of @(O)] is less than starting from y(0) or @(O)], an approximation yper(t) of the true T-periodic solution Y*per(t) was determined for t E [0,2T], making use of a marching method with an execution just as in the case of (A); 0 concerning y(0) or @(O)], the results of (C) were compared at times tj E [O,T] and tk E [T, 2T] such that tk - tj = T. For the components yper,i(t) of yper(t) this yields differences Gi(t): = yper,i(L + T ) - yper,$t), with t E [O,T] and i = 1(1)6 (or 1 = 1(1)5) in the case of model (a) (or model (P)). These components Gi define an error vector G = G(t) characterizing the deviation of Yper from a true T-periodic state. An upper bound of the supremum norm IlG(t)ll, of G for t E[O,T] is presented in Table 7.22.
f'u
Concerning the last column, see Subsection 7.8.
483
E. Adams
For the determination of vper employment of .
method (B 1)
(B2)
For t€[O,T]: IlG(t)ll,
0. non-spurious in this limit. This then is In fact, there may be a bifurcation point hl > 0 such that a real-valued spurious difference solution "ysp = isp(h) bifurcates from a non-spurious solution = "yh,y(O)) and, if h = 0, the ( i i J limit as h+O of this sequence "yp(h) perhaps is not in D. Property (i) will now be demonstrated by means of an example, followed by a more general discussion of bifurcating h-dependent sequences of difference solutions. Concerning bifurcating sequences of spurious difference solutions, a classical example rests on the Logistic ODE (4.8) with its set of monotonic true solutions y* = y*(t,y(O)), in (4.9). An application of the explicit Euler one-step discretization generates the famous Logistic (difference) Equation (4.12) in population dynamics [1.28], which is one of the traditional starting points in the theory of Dynamical Chaos. The standard form of this equation is p.351: (8.2) ?j+1 = A?j(l-?j) for j+&IN, where ?j: = (h/(l+h))yj and A(h): = l+h. The h-dependent difference solutions ?j = q(h,v(O)) possess a monotonically increasing sequence of bifurcation points Ak with Ak: = A(hk) and A1 = 3 such that p.351, omitting the subscript k of h, 0 at Ak, a stable real-valued (spurious) 2kh-periodic difference solution bifurcates from a 2k-1h-periodic real-valued difference solution which is unstable for A 2 Ak; 0 there is a point of accccumulation, A x 3.57OO..., of {Ak}; 0 there is a point A M 3.8284...such that there are (spurious) &periodic difference solutions for A > and all &IN; additionally then there is an uncountable number of initial points giving totally aperiodic trajectories p.351.
i
N
x
496
Discretizations: Practical Failures
x
This point is said to be the onset of chaos [21]. In its parameter range of stability, a 2kh-periodic difference solution, Remarks: !per, is locally attractive. Concerning difference solutions y asymptotically approaching ?per, their set of initial values y(0) shrinks as k increases. Consequently, it becomes more and more difficult to determine a 2kh-periodic difference solution by means of a marching method (compare Subsection 7.6). 2.) Concerning the Logistic ODE (4.8) or some other scalar ODEs, applications of numerous discretizations have been investigated by Yee at al (e.g.,p.56]). In all these cases, the authors have shown the existence (or occurrence) of spurious difference solutions "yp, either explicitly or by means of the computed difference approximations 9. Since it is generic to the existence of real-valued spurious difference solutions, bifurcation theory will now be briefly reviewed, following Iserles et al. D.251. For this purpose, the special case of a stationary point Yitat = ystat(0) of an IVP with ODEs and an unspecified initial vector y(0) is considered. Due to consistency, Yitat is a root of the function F in the discretization (8.3) yj+l = F(yj,h). The difference solution Yitat is assumed to be locally stable for hE(O,h,). Bifurcation from yitat occurs (subject to various non-degeneracy conditions) when an eigenvalue of the Jacobian of F(yitat,h) passes through the unit circle in the complex plane C as h 5 h, is replaced by h > h, p.251. At b, then a spurious difference solution bifurcates from y*. According to D.251 and D. 16,p.145-147], there are the following three possibilities:
a
Ystat,
*
..
\
,
Figure 8.4a Transcritical bifurcation horizontal straightline: lh-periodic difference solution Yitat curve: spurious difference solution yap
497
E. Adams
*
Ystat
OO
+ h
Firmre 8.4b Pitchfork bifurcation; see Figure 8.4a for the horizontal straight line and the curve if an eigenvalue passes through + 1 d , then a spurious fixed point (period h) of (8.3) bifurcates from Yitat; this typically occurs as a transcritical bifurcation shown in Figure 8.4a; if an eigenvalue passess through -let, then a spurious solution of (8.3) with (II) period 2h bifurcates from yitat; this is a period-doubling pitchfork bifurcation, see Figure 8.4b; (III) if a pair of complex conjugate eigenvalues passes through the unit circle, then a spurious closed invariant curve Yper for (8.3) bifurcates from Yitat by means of a Hopf bifurcation. Remark: The sequence {Ak} with Ak = A(h) concerning (8.2) is an example of type (11) bifurcations. In the case of a bifurcation of the types (I) - (111), a branch B bifurcating at a point h,>O is of particular interest provided B is real-valued for h < h,; the stationary point of the discretization, Yitat = y:tat(0),then is stable by assumption. The property of being real-valued may still be true for B as h+O; B then must be spurious. In fact, and by assumption, the "genuine" (non-spurious) difference solution y(h,ystat(O)) = Yitat coincides with the stationary point Yitat of the ODE. Generally for a non-constant branch B concerning discretizations of explicit ODEs, the property of being spurious manifests itself in the property that the limit as h+O of Ysp(h)is not in D; see 0 the subsequent Example 8.13 and 0 Stuart's PSO,p.205] observations for IBVPs with parabolic PDEs (which also apply to ODEs): "Typically, as the mesh is refined,... spurious solutions will either move off to infinity in the bifurcation diagram or...". Up to this point, the discussion has been confined to explicit discretizations. For implicit difference methods, there are additional ways to generate spurious difference solutions. In fact, at any grid point tj, the employed difference equations may possess more than one real-valued solution; consequently, each such point is a
498
Discretizations: Practical Failures
(timewise) bifurcation point, see Examples 8.5 and 8.10. Another relevant observation is concerned with the distinction between true solutions 7 of the discretization and their computed approximations A (suitably defined) distance of 7 and then may be large, particularly in the case that this discretization is unstable; see [15] for an example exhibiting the governing influence of the numerical precision. This situation gives rise to still another type of (pseudo-)spurious difference approximations.
y.
8.3 ON SPURIOUS DIFFERENCE SOLUTIONS OF ODEs WITH BOUNDARY CONDITIONS (OF PERIODICITY) The consideration of a BVP may be motivated as follows: either by the search for a T-periodic solution of an ODE, making use of a boundary condition of periodicity (see (5.5) or (7.13)) or a time-independent (steady-state) solution of an IBVP; see Section 8.1 for practical consequences for the approximation of a non-spurious steady-state solution of an IBVP. The present subsection is concerned with BVPs consisting of nonlinear ODEs of the second order with either separated two-point Dirichlet boundary conditions (as in (5.2)) or boundary conditions of periodicity (as in (5.1)). The discretizations to be investigated employ equidistant grids, for the derivatives of the second order, the usual discretization possessing second order of consistency, and for derivatives of the first order, the usual forward, backward, or central difference quotients of first or second order of consistency. Generally for BVPs, the verifications of the existence and the uniqueness of true solutions y* are major tasks. For the BVPs to be considered here, at least the existence of a (classical) true solution y* will be known. The problems to be investigated here are concerned with sequences of difference solutions = "yh) which either @ as h+O, serve as a pointwise approximation of a true solution y*, or @ J for all hE(O,Ei) represent a sequence of spurious difference solutions, isp(h). Difference solutions of type (b) were first reported in literature in 1974 in the paper [I.15] by Gaines. He
499
E. Adams
employed an unstable discretization of a nonlinear BVP of the type under consideration here, and he 0 discussed a sequence of spurious difference solutions which, as h+O, becomes more and more pathological" . In 1977, Gaines' work p.151 was followed by the one of Spreuer & Adams p.441 which presents the subsequently treated three Examples 8.5, 8.10, and 8.13. Example 8.5 (D.451 see also n.441): The BVP (8.6) - ( Y " ) ~ + 12y' = 0, y(0) = 0, y(1) = 7 possesses only the following classical solutions: (8.7) y c (x): = (x + 113 - 1 0
I'
r,
and (8.8) y;z,(x): = (x - 2)3 + 8. The following consistent discretization was chosen:
In D.451 it has been shown that (8.9) possesses a total of 2"-' difference solutions where each one is determined by one of the 2"-' sign patterns in the solutions of the n - 1 quadratic equations for yj+l as following from (8.9); two of these solutions are non-spurious, and they approach the true solutions (8.7) and (8.8), respectively; 2"-' - 2 of these are spurious difference solutions; the alternating sign pattern +,-,+,-,...yields a sequence approaching the limiting function 7x as h+O; this function does not satisfy the ODE in (8.6); for a si n pattern that is fixed independently of ndN, limiting functions y (L)
i,
N
and JR) of a sequence of spurious difference solutions have been constructed where ~ y ( is~ a) polynomial which is valid for xc[O, 1/21 and y(R) is a polynomial valid for x~[1/2,13; othese polynomials and their one-sided derivatives of the first order possess coincident values at x = 1/2, respectively; .the one-sided derivatives of the second order possess a finite
Discretizations: Practical Failures
500
discontinuity at x = 1/2; consequently, and y(R) define a I' nonclassical" solution of (8.6); 0 in p.451, there is also another sign pattern, again yielding functions y(L) and y(R) with these properties. 0 Remark: There are no spurious difference solutions provided discretizations identical with the ones in (8.9) are employed in the following equivalent explicit representation of the implicit ODE in (8.6): y" f ,/F = 0. A sufficiently smooth system of ODEs y' = f(y) is considered in conjunction with a two-point boundary condition such that there is a solution y*. Additionally, a sufficiently smooth discretization is considered that is consistent and stable in a neighborhood of y*. For this situation, a theorem in [28] asserts the pointwise convergence as h+O of an h-dependent sequence of difference solutions. Therefore, the occurence of 2"-' spurious difference solutions of (8.9) may be due either to (A) the possibility that the discretization (8.9) is not stable or (B) in the case of the existence of a limiting function, that this is not a true solution of the ODE. Examde 8.10 D.441: The BVP (8.11) (Y'"')~- y"' = 0, ~"'(0) = ~ " ( 1 = ) 1, y(0) = ~ ' ( 0 )= 0 does not possess a (classical) true solution. A consistent discretization is chosen employing the usual (central) difference quotients of the lowest order. This discretization possesses a non- denumerable infinity of difference solutions 7 = y(h,c) which are defined by (8.12)
g(2jh), ;zj+l: = g((2j + 1)h) + h4/8 with g(x): = (x3 + cx2 - h2x)/6 for all c E R.
y2j: =
As h-10, the pointwise convergence of
y
y(h,c) to the function y(x): = (x3 + cx2)/6 is obvious. Consequently, (8.12) represents infinitely many sequences of spurious difference solutions. Here too, the spurious character may be 0 due to either one of the reasons (A) or (B). So far, the following two kinds of spurious difference solutions have been discusssed here with respect to (consistent) discretizations of ODEs: either as h-10, a sequence 7 = "yh) or a corresponding sequence concerning an employed difference quotient of do not converge to a continuous limiting function such as
&
=
501
E. Adams
the ones discussed in the case of IVPs or the ones reported by Gaines p.151 in the case of BVPs, or A all the sequences referred to in (a) converge to continuous limiting functions and their respective continuous derivatives; however, these functions are not solutions of the ODES. Example 8.13 D.441: The BVP (8.14) -5"+ ( Y ' ) ~+ y = g(x); y(0) = ~ ( 1=) 1; g(x): = 6 - 5x + 11x2 - 8x3 possesses the solution (8.15) y*(x): = 1 + x(l - x). Since this BVP is inverse-monotone ([IS41 and [l]), y* is the only (classical) solution of (8.14). The following consistent discretization is chosen: 0
0
6
(8.16) -2[Yi'l
-zi
+
+
[TI +3 = g(jh) for
yi-l]
yj
j = l(1)n - 1; yo = 1, y n = 1; h = l/n; ncl is even. This system is not inverse-monotone for any ndN. The following candidate for a difference solution y is chosen as a function of free parameters Pj E IR: (8.17) yj: = 1 + ((-l)j - 1)h1I2 + ,djh for j = O(1)n; ncl is even; PO = Pn = 0; PjdR for j = l(1)n - 1 is suitably determined. The boundedness of the parameters Pj(h) as h+O is shown in Appendix 8.6. Therefore, as h+O, the sequence of difference solutions defined in (8.17) converges to the function y(x): = 1 which does not satisfy the ODE. The (employed) 0 difference quotients are not bounded as h+O. Remarks: 1.1 In the limit as h+O, (8.17) for all jdN defines the stationary point yj = 1 which satisfies (8.16) subsequent to its multiplication by h2. Provided, this version of (8.16) is employed, y j = 1 almost satisfies this system of equations for any sufEicient ly small h > 0. Depending on riel with n 2 20, H.Spreuer [30] has determined numerous additional spurious difference solutions of (8.16), each for a fixed value of h = l/n. For this purpose, Spreuer employed a shooting method which, starting at Yn = 1, satisfies the condition yo = yo(yn-l) = 1by means of an iterative determination of Yn-1. ExamDle 8.18: The function (8.19) y*(x): = 1 + ( - l ) k ( ~- k)(l - x + k) for x e k k + 11for all k d is a non-classical2-periodic solution of the ODE
Discretizations: Practical Failures
502
(8.20)
-5"+ ( Y ' ) ~+ y
A
XEIR with g(x): = -2y*"(x) + (Y*'(X))3 + Y*(X). A
= g(x) for
A corresponding generalization of (8.17) generates a sequence of spurious difference
solutions. As h+O, this sequence approaches the "steady-state difference solution" y(x): = 1 which does not satisfy the ODE. In both Example 8.13 and Example 8.18, the spurious nature of the difference solutions = (h) is suggested by means of the heuristic test (7) referred to in Subsection 8.1. Concerning (consistent) discretizations of nonlinear BVPs belonging to certain classes, the existence of spurious difference solutions can be excluded provided the sufficient conditions in theorems are satisfied which have been proved in 1981by Beyn & Doedel [9]. In 1981, Peitgen at al.[26] have investigated BVPs ) 0, PER, (8.21) y" + pf(y) = 0, y(0) = y ( ~ = making use of the symmetric discretization of y" of the lowest order. Provided there are nElN equidistant grid points in ( 0 , ~ )and f possesses k zeros in this interval, then one of the results in [26] asserts the existence of k" difference solutions, almost all of which are "numerically irrelevant", i.e., spurious. In [26], they are "characterized in terms of singularities of certain embeddings of finite difference approximations with varying meshsize, where the meshsize is understood as a homotopy parameter." 8.4 ON SPURIOUS DIFFERENCE SOLUTIONS OF DISCRETIZATIONS OF IBVPS The discussions of spurious difference solutions concerning ODES carry over to the case of IBVPs, provided traditional numerical methods are used. In fact, @ I W s then arise by means of any one of the traditional approximation methods listed at the end of Section 1; BVPs then arise either by means of a horizontal method of lines or as a vehicle for the determination of a steady-state solution, see Subsection 8.1. Concerning (ah see Subsection 9.4. Example 8.22 (for the case (0)): The BVP (8.14) is now generalized to become a nonlinear hyperbolic IBVP with non-specified initial functions at t = 0:
E. A d a m
(8.23)
503 Ytt + cyt - 2yxx + (yx)3 + y = g(x) for (x,tND: = [0,11 [QTI with CEW; y(0,t) = y(1,t) = 1; y(x,O) and yt(x,O) are free with the exception of y(0,O) = y(1,O) = 1and yt(0,O) = yt(1,O) = 0; g(x) as defined in (8.14).
An explicit discretization of (8.23) is chosen with step sizes h
= l/n and kER+
which, concerning yx and yxx, is identical with the one in (8.16) and, concerning yt and ytt, employs difference quotients which are formally identical with the ones for yx and yxx, respectively. Obviously, (8.17) then represents an h-dependent steady-state sequence of spurious difference solutions, ysp, of this discretization. This sequence, {isp}, is locally attractive provided c is sufficiently large. Correspondingly, there are initial sets {y(xj,O) I j = l(1)n - 1) and {yt(xj,O) I j = l(1)n - 1) such that the difference solutions with these initial data approach the spurious difference solution ispas defined in (8.17). Because of the increasingly "pathological" character of (8.17) as h+O, the same must be true for the initial sets just referred to. This problem is of practical relevance in the case of fixed choices of h = l/n and k. In fact, a time-dependent difference approximation then may approach the o time-independent spurious difference solution ysp as defined in (8.17). Remark: Concerning the "pathological" character, compare Remark 1.) subsequent to (8.2). Stuart ([IS01 and [31]) has investigated discretizations of the following parabolic IBVP: 0 0
(8.24) yt - yxx - cf(y,yx) = 0 for c and D as defined in (8.23), f(0,O) = 0, y(0,t) = y(1,t) = 0; y(x,O) = y ~ ( x )is given.
The equidistant grid and the discretization are chosen just as the ones with respect to (8.23). For the case of y(x,O) = 0, Stuart considers the trivial steady-state solution y*(x,t) = 0 on D. In [31,p.473], he asserts the validity of the following implication:
(8.25) provided the discretization is linearly unstable at = 0 (for a choice of step sizes), then there exist spurious periodic difference solutions
Discretizations: Practical Failures
504
possessing the following properties: 0 they are real-valued even for arbitrarily small time steps k [31,p.483] and 0 their norm tends to infinity as k+O [31,p.483]. In the conclusions of [31], Stuart raises the question :" What classes of initial data will be affected by the spurious periodic solutions...For general initial data, the question is open and indeed it is not well defined until a precise meaning is attached to the word 'affected"'. According to p.50,p.192], the performance of the discretizations under discussion is governed by "the nonlinear interaction of a high wave number mode, which is a product of the discretization, and a low wave number mode present in the governing differential equation", i.e., the PDE. Highfrequency oscillations of a computed approximation arise in numerous contexts, and they have to be removed by means of suitable smoothing operations. Examples are: spurious pressure modes in applications of spectral methods in Computational Fluid Dynamics D.81, or W Gibbs oscillations due to the truncation of an Ansatz, or (iii) multigrid methods for systems of linear algebraic equations, or (iv) discretizations of nonlinear hyberbolic or parabolic IBVPs, etc. Concerning (i), the following nonlinear parabolic IBVP is discussed in [2]: (8.26) yt = (f(y)yx)x + g(y) + 7(x,t) on D as defined in (8.23); EIR4+, g:R4, 7 : D I ; f, g, yo are sufficiently smooth; boundary and initial conditions corresponding to the ones in (8.24).
In dependency on f and g, the functions yo(x) = y(x,O) and -y = dx,t) are chosen such that (8.27) y*(x,t): = e-ct(sinrx) with any CEIR', (x,t)ED, is a solution of (8.26). An equidistant grid is chosen, making use of 0 h: = (n + l)-l and k: = T/N for any n, NED(,TER', and 0 grid points xi = ih and tj = jk for i = q l ) n + 1, j = 0(1)N, respectively. A consistent implicit discretization with two time levels tj and tj-1 is chosen for the determination of an approximation yjj: = (ylj,...,ynj)TERn of YJ: (y*(xl,tj),...,y*(Xn,tj))T: A
(8.28) F(yj,yj-1) = f j
+ zj for j E DI.
505
E. Adams A
The vector f j follows from the data. Local errors of all kinds are represented qualitatively by the vectors zj. The solution y j of (8.28) depends on r: = k/h*, f, g, and z: =(zl,...,~j)~dRj", In [2], the function y j = yj(z) is approximated by means of a linear Taylor-polynomial. The partial derivative zj(p9v):= 6j/aZ, is the solution of a system of linear algebraic equations: (8.29) PjZj(p9v' = Zj-l(p'v) + Sjve(p) with Pj E LW), Sij
the Kronecker symbol;
:= (Siv) E R"; i,p E (1 ,...,n}; v E (1 ,...,j}, and Pj =
Pj(r, f, g). For "favorable choices" of f and g, Pj is an M-matrix p.371. This property is valid for all (meaningful) choices of h E (0, 1/21 and k > 0 provided that f E R+ and g E R. This is well known since (8.26) then is linear. Remark Since (8.26) employs a PDE, it is not possible to enclose Z j simultaneously with the solution of this IBVP. This is the basis of Loher's enclosure algorithm for ODES (D.311, D.321). By use of matrix-theory and numerical experiments concerning (8.28) and (8.29), the author of the present paper observed in [2] that @ lly; - "yllmis a slowly increasing linear function of j if Pj is an M-matrix for
j = l(1)j;
0.
lly; -
"yI/,
is a strongly growing linear or nonlinear function of j subsequent
to the (always irreversible) transition at i from the presence to the absence of this property; (iii) prior to this transition, "spurious oscillations" of the computed sequences of the vectors y j and Zj were observed, with respect to both X i and tj; the amplitudes of these oscillations were small or even decreasing when Pj remained an M-matrix as j increased; this can always be enforced by a sufficiently small local choice of the time step k. The spurious character of oscillating computed vectors yj and Zj follows from the facts that 0 the true solution y* to be approximated by the vectors yj is monotone and 0 that this then is true for the corresponding vectors zj solving (8.29). In [2], the empirically recognized importance of the property of an M-matrix for Pj has been theoretically substantiated by means of the theory of M-matrices. In [2], identical empirical conclusions concerning Pj were also drawn for a three-level implicit discretization of the hyperbolic IBVP simulating nonlinear vibrations of a string fixed at both ends [7,p.201].
506
Discretizations: Practical Failures
8.5 CONCLUDING REMARKS The existence of spurious difference solutions ysp has the following practical implications: @ they are true solutions of a chosen discretization; consequently and on the level of the discretization, they cannot be distinguished from non-spurious difference solutions 7, not even when Enclosure Methods are employed & on this level; in this context, see Subsection 7.4; for discretizations of ODEs with initial conditions and as h-0, a sequence of spurious difference solutions ysp = ysp(h) approaches a limit outside of the domain D of f in the ODE y’ = f(y); for a practically chosen h, the computed approximation ysp may still be in D ; a spurious character perhaps then can be detected by means of the tests (a)-(c)referred to in Subsection 8.1.; A & for discretizations of ODEs with boundary conditions, there are the classes (a) an? (b) of spurious difference solutions referred to in Subsection 8.3; class (b) cannot be detected on the level of the discretization, not even by means of the tests (a)-(c)in Subsection 8.1; concerning spurious difference solutions in the case of IBVPs, the relatively small body of knowledge seems to indicate a situation comparable to the one in the case of ODEs; as a practical (heuristic) test in the case of implicit discretizations, the property of an M-matrix may be used, which is related to Enclosure Methods on the level of the discretization; the practically most important influence of spurious difference solutions is (cJ the one with respect to the size of the domain of attraction of a (difference) solution to be approximated; h the situation addressed by (a)-(c)calls for a (spot-check) application of Enclosure Methods with respect to solutions of DEs to be approximated; concerning PDEs, the status of these methods is characterized in Section 1. A
a h
h
A
6 A
A
h
8.6 APPENDM: A SUPPLEMENT TO EXAMPLE 8.13 The practical relevance of the sequence of spurious difference solutions in (8.17) suggests a presentation of the complete verification of the employed analysis. The proof [30] of the subsequent Lemma 8.30 has not been published before. Lemma 8.30 The sequence of parameters pj in (8.17) is uniformly bounded for j = l(1)n -1 and all nGN.
507
E. Adams
Proof: (I) The Boundedness of the Auxiliarv Sequence {AB~} Making use of apj: =
pj
- pj-l
pj
and
=
CAP,, (8.16) can be represented as
k-1
follows:
+ 2Apj + 6(-1)jt1 h1/2(Apj+1)2+ h(Apj+~)~ + h2k-1h Apk
IOApjtl (8.31)
= hg(jh) - h
and Po = 0 and On =
+
=
A
((-l)jt1+l)h3/2 =: hg(jh) for j = O(1)n - 1
E Apk = 0.
k=l
Consequently, the roots Apjtl of the following equation have to be determined: Gj(Apj.1): = A p j t l - Ajtl/Bjt1 = 0 for j = l(1)n-1, where A
(8.32)
Ajt1: = hg(hj) - 2Apj - h2
C
Apk k=l
Bjtl: = 10 + 6(-l)jt1h1'2A@jt1
and
+ h(A@jtl)2.
Making use of the existence of an MEW such that
A
,J
g(jh) 5 M, it is assumed
that there is a KEIR' independent of h such that I A D I~ 2 K; e.g., M = 5.2. Without a loss of generality, h will be confined to the interval (O,ho] where ho: = Min{K/(M+K), (10K)-2}. The following assumption of an induction is chosen: (8.33) IApj I 5 K for j > 2. There follow (8.34) Gj(K/3) > 0 and Gj(-K/3) < 0 for j 2 1. In the general step of the induction, the existence of a root Apjtl of (8.31) is shown such that I Apj+l I 5 K/3 for j > 1. (11) The Boundedness of the Sequence { O i l The estimate (8.33) will now be sharpened. For this purpose, (8.31) and the estimates leading to (8.33) are employed to yield (8.35) I Apjtl I 5 h(M + K)/7 + (2/9)"K. There holds (8.36)
pn =
k!kpk
The estimate (8.37)
5 lhpkl
k=2
2 A01 - k5= 2 I Apk 1.
iK
508
Discretizations: Practical Failures
is satisfied for K = M/5, which allows the choice of K = 1. Consequently, there fdlaw /In> 0 (or < 0) for = 1 (or -1). Therefore, I ,8j is bounded because of
I
(8.38)
I PI ~= I>:P~/
5
.h
k = l IAPkl
for j = l(l)n,
and this sum is bounded since (8.37) has been satisfied by the choices API = 1 or - 1.
[III) The Continuous Dependencv of W on DL The sequences {APj,,} and {APjtl} are considered in their dependencies on @I = A P i and p1 = Ap1, respectively. Making use of dj: = A P j - APj, (8.31) implies that
f:
djtl = Cjtl/Djtl where Cjtl: = -h2k = ldk - 2dj and
(8.39)
Djtl: = 10 + 6(-l)jt1h1’2(Afij+1 + Apjtl) Ejtl: = (Aojt1)2 + APj+l
*
APjtl
+ h.Ejt1 where
+ (APjt1)2*
Since I A @ j + l l , I APj+lI 5 K/3, there follows Djtl > 9 for j 2 1. If I dk I < 6 for k = l(l)j, then (8.39) implies that (8.40) Idj+l( < (h6 + 26))/9 < 6. Therefore, 1 dj I < 6 for j = 2(l)n if I dl I = 6 and, because of (8.39), (8.41) I djtl I < (h6 + 21 dj 1)/9. Consequently, in analogy to estimates leading to (8.35), (8.42) 1 djtl I = h6/7 + (2/9)’6 and (8.43) I Pn - Pn I 5 (17/7)6. Therefore
(8.44) I Pn - Pn 1 5 (17/7) I PI - PI I. [IV) Existence of a 131 such that A = 0 There exists a ,81 = AD1 such that
@n =
n
c Pk = 0 because of k=l
(8.38) and the implication as stated subsequent to (8.37) and 0 the continuous dependency of Pn on /31 as asserted by (8.44). This PI, such that Pn = 0, can be determined by means of 0 a shooting method, starting with /31 and, for j = l(1)n - 1, 0 the determination of a root A P j + l of (8.32). This completes the proof of the existence and the uniform boundedness of the 0 sequence of parameters pI,...,Pn-l for all ndi, which are introduced in (8.17). 0
509
E. Adams
9. ON DIVERTING DIFFERENCE APPROXIMATIONS CONCERNING ODEs OR PDES 9.1 DIVERSIONS IN THE CASE OF DISCRETIZATIONS OF ODES
y
Diversions of difference approximations have been defined as item (d) in the beginning of Section 2. If the condition starting with "such that ..." were absent in (d), the phenomenon would be irrelevant. In fact, as t j increases or decreases, at tj continuously coincides with locally different true solutions y* = y*(tj). Diversions of difference approximations 9 are caused by the interaction of @ topographically suitable situations and &) local discretization- or local rounding-errors acting as perturbations triggering a deviation that becomes a diversion, either immediately or at some later time. As has been discussed in Section 3, the situations addressed in (a) are in particular (al) neighborhoods of poles of the ODEs where the cause and the effect are almost coincident in this neighborhood and (a2) the presence of an (n - 1)-dimensional stable manifold Ms (in the n-dimensional phase space of the ODEs) which is penetrated due to one or several consecutive local numerical errors, followed by the subsequent manifestation of the diversion in a neighborhood of the unstable manifold
y
MU.
The triggering phenomenon (b) therefore may be due to a single or a few consecutive local errors, which are unavoidable in the execution of the computational determination of This phenomenon can be characterized as follows: 0 the strong causality principle addressed in Section 1is violated; 0 diversions are possible for arbitrarily small h > 0 and an arbitrarily high numerical precision, provided a total error control is absent; 0 the triggering process is uncontrollable and therefore random since it depends on the interplay between the choices of .the numerical method and its artificial parameters with .the employed computer and compiler; 0 the randomness of the phenomenon indicates the futility of a search for a mathematical theory for the (non-)occurrence of diversions. Remarks: J l Concerning this randomness, see Subsections 7.3 and 7.4. J . 2 Spurious difference solutions ispof the following kinds have found particular
y.
510
Discretizations: Practical Failures
attention in literature (p.501 or p.571): they are either stationary points or periodic solutions. It is then possible that there exist (n - 1)-dimensional stable sets asand both belonging to the difference solution ysp under consideration. unstable sets Rounding errors then may trigger the diversion of a computed difference approximation 7.
a", N
9.2 EXAMPLES FOR THE OccURRENcE OF DIVERTING DIFFERENCE APPROXIMATIONS IN THE CASE OF THE RJ3!TRICTED THREE BODY PROBLEM The following idealization of Celestial Mechanics is considered [33,p.51: ( i J the orbits of the earth E and the moon M are confined to a plane in IR3; ( i i Jin this plane, there is a suitably rotating Cartesian yl-y2-basis whose origin is attached to the center of gravity, C, of E and M; the points E, C, and M are on the yl-axis; in figure 9.4, C and E are almost coincident since (m) the position of C relative to E and M is determined by the ratio p = 1/82.45 of the masses of M and E; consequently, -p is the location of E and A: = 1 - p is the location of M on the yl-axis; in the yl-y2-plane, trajectories of a small satellite S are to be determined; for these trajectories, the phase space possesses the Cartesian coordinates yl, (vJ y2, ~ 3 =: Y'I and y4 = ~ ' 2 . For the Restricted Three Body problem defined by (i), (ii), and (iv), (v), the equations of motion are as follows [33, p.51 in the employed rotating basis: y'1
(9.1)
= Y3,
y'3 = y1
y72
+ 2y4 -
lilq'+ p) - p(yl -
r; y)4 = y2 - 2y3 - xyz r8 rl where rl: = ((yl + p)2 + ~ 2
=
*
*
A)
r;
~ ) "and ~
r2: = ((yl -
+ Y22)"2.
*
For any true solution y* = (yl, y2, y3, yt)' of (9.1), the Jacobi-integral, J, takes a fixed value:
In agreement with numerous papers in literature (e.g. [lo]), the following
511
E. Adams
point is chosen: (9.3) yp(0): = (1.2, 0, 0, -1.04935750983)T. Starting at yp(0) and for the ODES (9.1), a classical Runge-Kutta method with step size h = yielded an orbit whose projection into the yl-y2-plane is displayed in Figure 9.4 (see Figure 1 of the paper p.391 in this volume). It is noted that the presently used superscript double tilde, NN, corresponds to the superscript single tilde, N,in p.391. With much more than graphical accuracy, applications of " high-precision" difference methods in literature have yielded (almost) closed orbits with representation in Figure 9.4. These orbits (almost) return to yp(0) at
y,
yp
yp
N
the time t = T: = 6.192169331396, [lo]. Therefore, is believed to be an approximation of a hypothetical T-periodic solution, yier, of (9.1) with period T x N
T.
0.5
0
-0.5
-1.0
1.0
0
Figure 9.4: Projection into yl-y2-plane of almost closed orbit with starting point y(0) in (9.3). Determination by means of classical Runge-Kutta method with step size h = 10-3. In a numerical experiment referred to in p.391, the classical Runge-Kutta method with h = 5.10-3 was used to approximate the true solution, y , of (9.1), that is displayed in (9.3) by means of a computed difference approximation Figure 9 . 5 4 see Figure 3 in [1.39]. The sharp tip of one of the loops in Figure 9.5A is a consequence of the projection of the orbit from R4 into R2. For an investigation of the gross deviation of from with y,(O) = yq(0), the values J of the Jacobi-integral, J, were computed at the grid points employed in the A*
yq
yq
yp
yq
Discretizations: Practical Failures
512
determination of yq Figure 9.5B displays the results J for the leading four loops depicted in Figure 9.5k It is seen y2
0.5
0
Yl
-0.5 -1
1
0
Figure 9.5A Projection into yl-y2-plane of orbit pq with starting point s,(O) in (9.3). Determination by means of classical Runge-Kutta method with step size h = 5.10-3. J
y2t
.
I
-2.llQ03.. J I-2.114025..
I
\
.
0.51
t
-0.5
J
-2W720..
J = -1.788917... f
.
.. ,Y1
I
I
-1
I I
I
0
I #
I I
1
Figure 9.5B: Leading four loops of *yq from Figure 9.5A with the values of the Jacobi integral J listed. that the change A J of J is 0 negligibly small past any individual loop, however, 0 there is a large decrease A J in the transfer from the (i - 1)th loop to the i-th loop; this transfer takes place in a small neighborhood of the pole E of the ODES (9.1).
E. Adams
513
yq
This suggests that as displayed in Figures 9.5A and 9.5B has been generated by a sequence of diversions, each one taking place in a small neighborhood of E. Close to E, four initial vectors r ] ( ,...,r ] , 4 ) were chosen as follows: 0 0
yq
they coincide with values computed for at certain grid points such that at 7]( i ) is beginning to traverse one of the leading loops shown in Figure
yq
9.5 B.
i 0.5
0
-0.5
Figure 9.6: Projection into yl-y2-plane of enclosed true orbit (3): = r ] ( 3) of orbit displayed in Figure 9.5A. The true orbit y*( i ) starting at a fixed position
r]( i)
y:3)
starting at point
was enclosed, making use of
the step size control developed in D.381, see D.391. Figure 9.6 depicts the enclosure of Y : ~ ) . At r]( 3 ) , yq diverts from y:3) to the continuation of as shown in Figure N
yq
9.5A. Consequently, the occurrence of a diversion of the difference approximation
ys has been verified and its properties have been demonstrated. N
Remarks: J . l Figure 2 in p.391 presents the enclosure of the true orbit at
r]( 2 ) =
y:2,
starting
2 which is listed in (16) in D.391.
a Concerning the (almost) closed orbit in Figure 9.4, the loops in Figure 9.5A are spurious.
Swing-by maneuvres of space vehicles are executed in near neighborhoods of planets. Because of the proximity of a pole of the ODES of Celestial Mechanics, only the employment of Enclosure Methods can reliably avoid diversions of computed orbits. For real space vehicles, in-flight trajectory corrections can be carried out by use of engines and a sufficient
5 14
Discretizations: Practical Failures
supply of fuel. Consequently and at the expense of the payload, in-flight corrections of orbits are possible which have been computed incorrectly in advance. Remarks: J . l Concerning diversions of difference solutions for (9.1), there is a detailed discussion of the early stages of the investigations of this problems in D.31; additionally in p.31, there are numerous graphs presenting sequences of loops as caused by diversions. 2 ) Figure 1 in [27,p.2] displays a pattern of loops that is closely related to the one in Figure 9.5A. As an explanation of the loops, the legend of this Figure 1 refers to “the chaotic motion of a small planet around two suns...”. This is an example for the interpretation of Computational Chaos as Dynamical Chaos in the set of true solutions of the ODES of this problem. 9.3 EXAMPLES FOR THE? OccURRENcE OF DIVERTING DIFFERENCE! APPROXIMATIONS IN THE CASE OF THE LoRENz EQUATIONS
In Subsection 5.4, there is a review of Kiihn’s work p.291 on enclosed (and verified) periodic solutions of the Lorenz equations (5.8), see also p.31. Additionally, in p.291 Kiihn has studied certain aperiodic solutions of the Lorenz equations, making use of the choices of b = 8/3, r = 28, u = 6, and employing @ Enclosure Methods for a determination of true, (presumably aperiodic) solutions y* and, simultaneously, &) Runge-Kutta methods for a determination of a corresponding difference approximations Both (a) and (b) were executed by use of a C-compiler running on an HP-Vectra. Figure 9.7 depicts the projection into the yi-yn-plane of an enclosed aperiodic true solution y* and its approximation both starting at a point yo, close to in Figure 5.19. The width of the computed enclosure is smaller than In Figure 9.7, (A) the enclosed true solution y* is demarcated symbolically by boxes and (B) the solid line represents a difference approximation which was determined by use of a classical Runge-Kutta method with a step size h = ho: = 1/128. As y* and have (almost) reached the stationary point (O,O,O), they begin to separate for the remainder of the interval [O,ta)] for which they have been
y.
y,
yiQ?)
determined. This can be explained by a diversion of at the stable manifold Ms of (O,O,O)T, taking place before coming close to this point (O,O,O)T. This diversion
515
E. Adams
occurs presumably as remarkable
y
penetrates the stable manifold in one time step. It is
................
........._..
Yl
Figure 9.7: Projection into yl-yz-plane of enclosed true solution y*: approximation by means of classical Runge-Kutta method with step size h = ho = 1/128 : y* and start at point yo close to y;i!) as displayed in Figure 5.19.
o o
y
that y* and
y coincide with much more than graphical accuracy for all t E [O,tm]in
the case of the choices h = 1 / 3 6 < ho and h = 1/64 > ho. This non- monotonic dependency of Ily*(t) - y(t)ll, on h is unpredictable. For a starting vector yo close to the one in Figure 5.19, Kiihn found cases where y* and are & practically coincident for tE[O,td] c [O,t,], however,
y
their Euclidean distance d = d(t) oscillates for t > td, reaching values comparable with the Euclidean distance of the stationary points C1 and CZ introduced in (5.9). Property (p) can presumably be explained by a diversion of at t x td, followed by ( i J a certain winding pattern of y* about C1 and W a different corresponding pattern of about C2. The time t d depends unpredictably on the employed numerical method: in the four investigated cases, td increased from 13.5 to 20 as
y
5 16
Discretizations: Practical Failures
the step size h of a classical Runge-Kutta method was reduced or as the precision t of a Runge-Kutta method with control of h was chosen closer to zero. On the basis of a Taylor-polynomial of (variable) order p, H.Spreuer ([30], PSI) has developed an explicit one-step method with a near-optimal control of h and p by means of the following conditions: the moduli of each one of the three last terms of the polynomial are required to be smaller than an €0 which initially was chosen as and the computational cost is to be as s m a l l as possible. The (uncontrolled) local rounding errors are characterized by the fixed double numerical precision, correspondig to 15 decimal mantissa digits, of the employed HP-Workstation. In applications concerning the Lorenz equations (5.8), @ Spreuer chose yo on a PoincarC-map defined by y2 - yl = 0, then (II) he determined the first intersection, fint, of f with y2 - y1 = 0, and then ( I I I J he returned from yint through replacing t by -t. for - yell, reached a minimum of less than Spreuer observed cases where
a h
A
y
the returning difference approximation and cases where this distance was large. The latter cases can presumably be explained by diversions of f. Additionally in conjunction with a suitably Spreuer ([30], PSI) employed values to