This page intentionally left blank
DISCRETE INVERSE AND STATE ESTIMATION PROBLEMS With Geophysical Fluid Applications...
133 downloads
1275 Views
6MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
This page intentionally left blank
DISCRETE INVERSE AND STATE ESTIMATION PROBLEMS With Geophysical Fluid Applications
The problems of making inferences about the natural world from noisy observations and imperfect theories occur in almost all scientific disciplines. This book addresses these problems using examples taken from geophysical fluid dynamics. It focuses on discrete formulations, both static and time-varying, known variously as inverse, state estimation or data assimilation problems. Starting with fundamental algebraic and statistical ideas, the book guides the reader through a range of inference tools including the singular value decomposition, Gauss–Markov and minimum variance estimates, Kalman filters and related smoothers, and adjoint (Lagrange multiplier) methods. The final chapters discuss a variety of practical applications to geophysical flow problems. Discrete Inverse and State Estimation Problems: With Geophysical Fluid Applications is an ideal introduction to the topic for graduate students and researchers in oceanography, meteorology, climate dynamics, geophysical fluid dynamics, and any field in which models are used to interpret observations. It is accessible to a wide scientific audience, as the only prerequisite is an understanding of linear algebra. Carl Wunsch is Cecil and Ida Green Professor of Physical Oceanography at the Department of Earth, Atmospheric and Planetary Sciences, Massachusetts Institute of Technology. After gaining his Ph.D. in geophysics in 1966 at MIT, he has risen through the department, becoming its head for the period between 1977–81. He subsequently served as Secretary of the Navy Research Professor and has held senior visiting positions at many prestigious universities and institutes across the world. His previous books include Ocean Acoustic Tomography (Cambridge University Press, 1995) with W. Munk and P. Worcester, and The Ocean Circulation Inverse Problem (Cambridge University Press, 1996).
DISCRETE INVERSE AND STATE ESTIMATION PROBLEMS With Geophysical Fluid Applications CARL WUNSCH Department of Earth, Atmospheric and Planetary Sciences Massachusetts Institute of Technology
Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge , UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521854245 © C. Wunsch 2006 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2006 - -
---- eBook (NetLibrary) --- eBook (NetLibrary)
- -
---- hardback --- hardback
Cambridge University Press has no responsibility for the persistence or accuracy of s for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
To Walter Munk for decades of friendship and exciting collaboration.
Contents
Preface page ix Acknowledgements xi Part I Fundamental machinery 1 1 Introduction 3 1.1 Differential equations 4 1.2 Partial differential equations 7 1.3 More examples 10 1.4 Importance of the forward model 17 2 Basic machinery 19 2.1 Background 19 2.2 Matrix and vector algebra 19 2.3 Simple statistics: regression 29 2.4 Least-squares 43 2.5 The singular vector expansion 69 2.6 Combined least-squares and adjoints 118 2.7 Minimum variance estimation and simultaneous equations 125 2.8 Improving recursively 136 2.9 Summary 143 Appendix 1. Maximum likelihood 145 Appendix 2. Differential operators and Green functions 146 Appendix 3. Recursive least-squares and Gauss–Markov solutions 148 3 Extensions of methods 152 3.1 The general eigenvector/eigenvalue problem 152 3.2 Sampling 155 3.3 Inequality constraints: non-negative least-squares 164 3.4 Linear programming 166 3.5 Empirical orthogonal functions 169 3.6 Kriging and other variants of Gauss–Markov estimation 170 vii
viii
Contents
3.7 Non-linear problems 4 The time-dependent inverse problem: state estimation 4.1 Background 4.2 Basic ideas and notation 4.3 Estimation 4.4 Control and estimation problems 4.5 Duality and simplification: the steady-state filter and adjoint 4.6 Controllability and observability 4.7 Non-linear models 4.8 Forward models 4.9 A summary Appendix. Automatic differentiation and adjoints 5 Time-dependent methods – 2 5.1 Monte Carlo/ensemble methods 5.2 Numerical engineering: the search for practicality 5.3 Uncertainty in Lagrange multiplier method 5.4 Non-normal systems 5.5 Adaptive problems Appendix. Doubling Part II Applications 6 Applications to steady problems 6.1 Steady-state tracer distributions 6.2 The steady ocean circulation inverse problem 6.3 Property fluxes 6.4 Application to real oceanographic problems 6.5 Linear programming solutions 6.6 The β-spiral and variant methods 6.7 Alleged failure of inverse methods 6.8 Applications of empirical orthogonal functions (EOFs) (singular vectors) 6.9 Non-linear problems 7 Applications to time-dependent fluid problems 7.1 Time-dependent tracers 7.2 Global ocean states by Lagrange multiplier methods 7.3 Global ocean states by sequential methods 7.4 Miscellaneous approximations and applications 7.5 Meteorological applications References Index Colour plates between pp. 182 and 183.
171 178 178 180 192 214 229 232 234 248 250 250 256 256 260 269 270 273 274 277 279 280 282 309 311 326 328 331 333 335 340 341 342 351 354 356 357 367
Preface
This book is to a large extent the second edition of The Ocean Circulation Inverse Problem, but it differs from the original version in a number of ways. While teaching the basic material at MIT and elsewhere over the past ten years, it became clear that it was of interest to many students outside of physical oceanography – the audience for whom the book had been written. The oceanographic material, instead of being a motivating factor, was in practice an obstacle to understanding for students with no oceanic background. In the revision, therefore, I have tried to make the examples more generic and understandable, I hope, to anyone with even rudimentary experience with simple fluid flows. Also many of the oceanographic applications of the methods, which were still novel and controversial at the time of writing, have become familiar and almost commonplace. The oceanography, now confined to the two last chapters, is thus focussed less on explaining why and how the calculations were done, and more on summarizing what has been accomplished. Furthermore, the time-dependent problem (here called “state estimation” to distinguish it from meteorological practice) has evolved rapidly in the oceanographic community from a hypothetical methodology to one that is clearly practical and in ever-growing use. The focus is, however, on the basic concepts and not on the practical numerical engineering required to use the ideas on the very large problems encountered with real fluids. Anyone attempting to model the global ocean or atmosphere or equivalent large scale system must confront issues of data storage, code parallelization, truncation errors, grid refinement, and the like. Almost none of these important problems are taken up here. Before constructive approaches to the practical problems can be found, one must understand the fundamental ideas. An analogy is the need to understand the implications of Maxwell’s equations for electromagnetic phenomena before one undertakes to build a high fidelity receiver. The effective engineering of an electronic instrument can only be helped by good understanding
ix
x
Preface
of how one works in principle, albeit the details of making one work in practice can be quite different. In the interests of keeping the book as short as possible, I have, however, omitted some of the more interesting theoretical material of the original version, but which readers can find in the wider literature on control theory. It is assumed that the reader has a familiarity at the introductory level with matrices and vectors, although everything is ultimately defined in Chapter 2. Finally, I have tried to correct the dismaying number of typographical and other errors in the previous book, but have surely introduced others. Reports of errors of any type will be gratefully received. I thank the students and colleagues who over the years have suggested corrections, modifications, and clarifications. My time and energies have been supported financially by the National Aeronautics and Space Administration, and the National Science Foundation through grants and contracts, as well as by the Massachusetts Institute of Technology through the Cecil and Ida Green Professorship.
Acknowledgements
The following figures are reproduced by permission of the American Geophysical Union: 4.8, 6.16–6.20, 6.23–6.26, 6.31, 7.3–7.5 and 7.7–7.10. Figures 2.16, 6.21, 6.27, 6.32, 7.1 and 7.11 are reproduced by permission of the American Meteorological Society. Figure 6.22 is reproduced by permission of Kluwer.
xi
Part I Fundamental machinery
1 Introduction
The most powerful insights into the behavior of the physical world are obtained when observations are well described by a theoretical framework that is then available for predicting new phenomena or new observations. An example is the observed behavior of radio signals and their extremely accurate description by the Maxwell equations of electromagnetic radiation. Other such examples include planetary motions through Newtonian mechanics, or the movement of the atmosphere and ocean as described by the equations of fluid mechanics, or the propagation of seismic waves as described by the elastic wave equations. To the degree that the theoretical framework supports, and is supported by, the observations one develops sufficient confidence to calculate similar phenomena in previously unexplored domains or to make predictions of future behavior (e.g., the position of the moon in 1000 years, or the climate state of the earth in 100 years). Developing a coherent view of the physical world requires some mastery, therefore, of both a framework, and of the meaning and interpretation of real data. Conventional scientific education, at least in the physical sciences, puts a heavy emphasis on learning how to solve appropriate differential and partial differential equations (Maxwell, Schr¨odinger, Navier–Stokes, etc.). One learns which problems are “well-posed,” how to construct solutions either exactly or approximately, and how to interpret the results. Much less emphasis is placed on the problems of understanding the implications of data, which are inevitably imperfect – containing noise of various types, often incomplete, and possibly inconsistent and thus considered mathematically “ill-posed” or “ill-conditioned.” When working with observations, ill-posedness is the norm, not the exception. Many interesting problems arise in using observations in conjunction with theory. In particular, one is driven to conclude that there are no well-posed problems outside of textbooks, that stochastic elements are inevitably present and must be confronted, and that more generally, one must make inferences about the world from data that are necessarily always incomplete. The main purpose of this introductory chapter 3
4
Introduction
is to provide some comparatively simple examples of the type of problems one confronts in practice, and for which many interesting and useful tools exist for their solution. In an older context, this subject was called the “calculus of observations.”1 Here we refer to “inverse methods,” although many different approaches are so labeled.
1.1 Differential equations Differential equations are often used to describe natural processes. Consider the elementary problem of finding the temperature in a bar where one end, at r = rA , is held at constant temperature TA , and at the other end, r = rB , it is held at temperature TB. The only mechanism for heat transfer within the bar is by molecular diffusion, so that the governing equation is κ
d2 T = 0, dr 2
(1.1)
subject to the boundary conditions T (rA ) = TA , T (rB ) = TB .
(1.2)
Equation (1.1) is so simple we can write its solution in a number of different ways. One form is T (r ) = a + br,
(1.3)
where a, b are unknown parameters, until some additional information is provided. Here the additional information is contained in the boundary conditions (1.2), and, with two parameters to be found, there is just sufficient information, and TB − TA rB TA − rA TB r, (1.4) + T (r ) = rB − rA rB − rA which is a straight line. Such problems, or analogues for much more complicated systems, are sometimes called “forward” or “direct” and they are “well-posed”: exactly enough information is available to produce a unique solution insensitive to perturbations in any element (easily proved here, not so easily in other cases). The solution is both stable and differentiable. This sort of problem and its solution is what is generally taught in elementary science courses. On the other hand, the problems one encounters in actually doing science differ significantly – both in the questions being asked, and in the information available.
1.1 Differential equations
5
For example: 1. One or both of the boundary values TA , TB is known from measurements; they are thus given as TA = TA(c) ±TA , TB = TB(c) ±TB , where the TA,B are an estimate of the possible inaccuracies in the theoretical values Ti(c) . (Exactly what that might mean is taken up later.) 2. One or both of the positions, rA,B is also the result of measurement and are of the form (c) rA,B ± rA,B . 3. TB is missing altogether, but is known to be positive, TB > 0. (c) 4. One of the boundary values, e.g., TB , is unknown, but an interior value Tint = Tint ± Tint is provided instead. Perhaps many interior values are known, but none of them perfectly.
Other possibilities exist. But even this short list raises a number of interesting, practical problems. One of the themes of this book is that almost nothing in reality is known perfectly. It is possible that TA , TB are very small; but as long as they are not actually zero, there is no longer any possibility of finding a unique solution. Many variations on this model and theme arise in practice. Suppose the problem is made slightly more interesting by introducing a “source” ST (r ), so that the temperature field is thought to satisfy the equation d2 T (r ) = ST (r ), dr 2
(1.5)
along with its boundary conditions, producing another conventional forward problem. One can convert (1.5) into a different problem by supposing that one knows T (r ), and seeks ST (r ). Such a problem is even easier to solve than the conventional one: differentiate T twice. Because convention dictates that the “forward” or “direct” problem involves the determination of T (r ) from a known ST (r ) and boundary data, this latter problem might be labeled as an “inverse” one – simply because it contrasts with the conventional formulation. In practice, a whole series of new problems can be raised: suppose ST (r ) is imperfectly known. How should one proceed? If one knows ST (r ) and T (r ) at a series of positions ri = rA , rB , could one nonetheless deduce the boundary conditions? Could one deduce ST (r ) if it were not known at these interior values? T (r ) has been supposed to satisfy the differential equation (1.1). For many purposes, it is helpful to reduce the problem to one that is intrinsically discrete. One way to do this would be to expand the solution in a system of polynomials, T (r ) = α 0r 0 + α 1r 1 + · · · + α m r m ,
(1.6)
6
Introduction
and ST (r ) = β 0r 0 + β 1r 1 + · · · + β n r n ,
(1.7)
where the β i would conventionally be known, and the problem has been reduced from the need to find a function T (r ) defined for all values of r, to one in which only the finite number of parameters α i , i = 0, 1, . . . , m, must be found. An alternative discretization is obtained by using the coordinate r. Divide the interval rA = 0 ≤ r ≤ rB into N − 1 intervals of length r, so that rB = (N − 1) r. Then, taking a simple one-sided difference: T (2r ) − 2T (r ) + T (0) = (r )2 ST (r ), T (3r ) − 2T (2r ) + T (1r ) = (r )2 ST (2r ), .. .
(1.8)
T ((N − 1) r ) − 2T ((N − 2) r ) + T ((N − 3)r ) = (r )2 ST ((N − 2) r ) . If one counts the number of equations in (1.8) it is readily found that there are N − 2, but with a total of N unknown T ( pr ). The two missing pieces of information are provided by the two boundary conditions T (0r ) = T0 , T ((N − 1) r ) = TN −1 . Thus the problem of solving the differential equation has been reduced to finding the solution of a set of ordinary linear simultaneous algebraic equations, which we will write, in the notation of Chapter 2, as Ax = b,
(1.9)
where A is a square matrix, x is the vector of unknowns T ( pr ), and b is the vector of values q( pt), and of boundary values. The list above, of variations, e.g., where a boundary condition is missing, or where interior values are provided instead of boundary conditions, then becomes statements about having too few, or possibly too many, equations for the number of unknowns. Uncertainties in the Ti or in the q( pr ) become statements about having to solve simultaneous equations with uncertainties in some elements. That models, even non-linear ones, can be reduced to sets of simultaneous equations, is the unifying theme of this book. One might need truly vast numbers of grid points, pr, or polynomial terms, and ingenuity in the formulation to obtain adequate accuracy, but as long as the number of parameters N < ∞, one has achieved a great, unifying simplification. Consider a little more interesting ordinary differential equation, that for the simple mass–spring oscillator: d2 ξ (t) dξ (t) +ε (1.10) + k0 ξ (t) = Sξ (t), 2 dt dt where m is mass, k0 is a spring constant, and ε is a dissipation parameter. Although m
1.2 Partial differential equations
7
the equation is slightly more complicated than (1.5), and we have relabeled the independent variable as t (to suggest time), rather than as r, there really is no fundamental difference. This differential equation can also be solved in any number of ways. As a second-order equation, it is well-known that one must provide two extra conditions to have enough information to have a unique solution. Typically, there are initial conditions, ξ (0), dξ (0)/dt – a position and velocity, but there is nothing to prevent us from assigning two end conditions, ξ (0), ξ (t = t f ), or even two velocity conditions dξ (0)/dt, dξ (t f )/dt, etc. If we naively discretize (1.10) as we did the straight-line equation, we have εt k0 (t)2 εt ξ ( pt + t) − 2 − − ξ ( pt) − − 1 ξ ( pt − t) m m m Sξ (( p − 1) t) = (t)2 , 2 ≤ p ≤ N − 1, (1.11) m which is another set of simultaneous equations as in (1.9) in the unknown ξ ( pt); an equation count again would show that there are two fewer equations than unknowns – corresponding to the two boundary or two initial conditions. In Chapter 2, several methods will be developed for solving sets of simultaneous linear equations, even when there are apparently too few or too many of them. In the present case, if one were given ξ (0), ξ (1t), Eq. (1.11) could be stepped forward in time, generating ξ (3t), ξ (4t), . . . , ξ ((N − 1)t). The result would be identical to the solution of the simultaneous equations – but with far less computation. But if one were given ξ ((N − 1)t) instead of ξ (1t), such a simple timestepping rule could no longer be used. A similar difficulty would arise if q( jt) were missing for some j, but instead one had knowledge of ξ ( pt), for some p. Looked at as a set of simultaneous equations, there is no conceptual problem: one simply solves it, all at once, by Gaussian elimination or equivalent. There is a problem only if one sought to time-step the equation forward, but without the required second condition at the starting point – there would be inadequate information to go forward in time. Many of the so-called inverse methods explored in this book are ways to solve simultaneous equations while avoiding the need for all-at-once brute-force solution. Nonetheless, one is urged to always recall that most of the interesting algorithms are just clever ways of solving large sets of such equations.
1.2 Partial differential equations Finding the solutions of linear differential equations is equivalent, when discretized, to solving sets of simultaneous linear algebraic equations. Unsurprisingly, the same is true of partial differential equations. As an example, consider a very familiar problem:
8
Introduction
Solve ∇ 2 φ = ρ,
(1.12)
for φ, given ρ, in the domain r ∈ D, subject to the boundary conditions φ = φ 0 on the boundary ∂ D, where r is a spatial coordinate of dimension greater than 1. This statement is the Dirichlet problem for the Laplace–Poisson equation, whose solution is well-behaved, unique, and stable to perturbations in the boundary data, φ 0 , and the source or forcing, ρ. Because it is the familiar boundary value problem, it is by convention again labeled a forward or direct problem. Now consider a different version of the above: Solve (1.12) for ρ given φ in the domain D. This latter problem is again easier to solve than the forward one: differentiate φ twice to obtain the Laplacian, and ρ is obtained from (1.12). Because the problem as stated is inverse to the conventional forward one, it is labeled, as with the ordinary differential equation, an inverse problem. It is inverse to a more familiar boundary value problem in the sense that the usual unknowns φ have been inverted or interchanged with (some of) the usual knowns ρ. Notice that both forward and inverse problems, as posed, are well-behaved and produce uniquely determined answers (ruling out mathematical pathologies in any of ρ, φ 0 , ∂ D, or φ). Again, there are many variations possible: one could, for example, demand computation of the boundary conditions, φ 0 , from given information about some or all of φ, ρ. Write the Laplace–Poisson equation in finite difference form for two Cartesian dimensions: φ i+1, j − 2φ i, j + φ i−1, j + φ i, j+1 − 2φ i, j + φ i, j−1 = (x)2 ρ i j ,
i, j ∈ D, (1.13) with square grid elements of dimension x. To make the bookkeeping as simple as possible, suppose the domain D is the square N × N grid displayed in Fig. 1.1, so that ∂ D is the four line segments shown. There are (N − 2) × (N − 2) interior grid points, and Eqs. (1.13) are then (N − 2) × (N − 2) equations in N 2 of the φ i j . If this is the forward problem with ρ i j specified, there are fewer equations than unknowns. But appending the set of boundary conditions to (1.13): φ i j = φ 0i j ,
i, j ∈ ∂ D,
(1.14)
there are precisely 4N − 4 of these conditions, and thus the combined set (1.13) plus (1.14), written as (1.9) with, x = vec{φ i j } = [ φ 11 φ 12 . φ N N ]T , b = vec{ρ i j , φ i0j } = [ ρ 22 ρ 23 . ρ N −1,N −1 φ 011 . φ 0N ,N ]T ,
9
1.2 Partial differential equations
Figure 1.1 Square, homogeneous grid used for discretizing the Laplacian, thus reducing the partial differential equation to a set of linear simultaneous equations.
which is a set of M = N 2 equations in M = N 2 unknowns. (The operator, vec, forms a column vector out of the two-dimensional array φ i j ; the superscript T is the vector transpose, defined in Chapter 2.) The nice properties of the Dirichlet problem can be deduced from the well-behaved character of the matrix A. Thus the forward problem corresponds directly with the solution of an ordinary set of simultaneous algebraic equations.2 One complementary inverse problem says: “Using (1.9) compute ρ i j and the boundary conditions, given φ i j ,” which is an even simpler computation – it involves just multiplying the known x by the known matrix A. But now let us make one small change in the forward problem, making it the Neumann one: Solve ∇ 2 φ = ρ,
(1.15)
for φ, given ρ, in the domain r ∈ D subject to the boundary conditions ∂φ/∂ m ˆ = φ 0 on the boundary ∂ D, where r is the spatial coordinate and m ˆ is the unit normal to the boundary. This new problem is another classical, much analyzed forward problem. It is, however, well-known that the solution is indeterminate up to an additive constant. This indeterminacy is clear in the discrete form: Eqs. (1.14) are now replaced by φ i+1, j − φ i, j = φ 0i j ,
i, j ∈ ∂ D
(1.16)
10
Introduction
etc., where ∂ D represents the set of boundary indices necessary to compute the local normal derivative. There is a new combined set: Ax = b1 , x = vec{φ i j }, b1 = vec{ρ i j , φ 0i j }.
(1.17)
Because only differences of the φ i j are specified, there is no information concerning the absolute value of x. When some machinery is obtained in Chapter 2, we will be able to demonstrate automatically that even though (1.17) appears to be M equations in M unknowns, in fact only M − 1 of the equations are independent, and thus the Neumann problem is an underdetermined one. This property of the Neumann problem is well-known, and there are many ways of handling it, either in the continuous or discrete forms. In the discrete form, a simple way is to add one equation setting the value at any point to zero (or anything else). A further complication with the Neumann problem is that it can be set up as a contradiction – even while underdetermined – if the flux boundary conditions do not balance the interior sources. The simultaneous presence of underdetermination and contradiction is commonplace in real problems.
1.3 More examples A tracer box model In scientific practice, one often has observations of elements of the solution of the differential system or other model. Such situations vary enormously in the complexity and sophistication of both the data and the model. A useful and interesting example of a simple system, with applications in many fields, is one in which there is a large reservoir (Fig. 1.2) connected to a number of source regions which provide fluid to the reservoir. One would like to determine the rate of mass transfer from each source region to the reservoir. Suppose that some chemical tracer or dye, C0 , is measured in the reservoir, and that the concentrations of the dye, Ci , in each source region are known. Let the unknown transfer rates be Ji0 (transfer from source i to reservoir 0). Then we must have C1 J10 + C2 J20 + · · · + C N JN 0 = C0 J0∞ ,
(1.18)
which says that, for a steady state, the rate of transfer in must equal the rate of transfer out (written J0∞ ). To conserve mass, J10 + J20 + · · · + JN 0 = J0∞ .
(1.19)
This model has produced two equations in N + 1 unknowns, [J10 , J20 , . . ., JN 0 , J0∞ ], which evidently is insufficient information if N > 1. The equations
11
1.3 More examples
J10, C1 J20, C2 J0∞ C0
JN0, CN Figure 1.2 A simple reservoir problem in which there are multiple sources of flow, at rates Ji0 , each carrying an identifiable property Ci , perhaps a chemical concentration. In the forward problem, given Ji0 , Ci one could calculate C0 . One form of inverse problem provides C0 and the Ci and seeks the values of Ji0 .
have also been written as though everything were perfect. If, for example, the tracer concentrations Ci were measured with finite precision and accuracy (they always are), the resulting inaccuracy might be accommodated as C1 J10 + C2 J20 + · · · + C N JN 0 + n = C0 J0∞ ,
(1.20)
where n represents the resulting error in the equation. Its introduction produces another unknown. If the reservoir were capable of some degree of storage or fluctuation in level, an error term could be introduced into (1.19) as well. One should also notice that, as formulated, one of the apparently infinite number of solutions to Eqs. (6.1, 1.19) includes Ji0 = J0∞ = 0 – no flow at all. More information is required if this null solution is to be excluded. To make the problem slightly more interesting, suppose that the tracer C is radioactive, and diminishes with a decay constant λ. Equation (6.1) becomes C1 J10 + C2 J20 + · · · + C N JN 0 − C0 J0∞ = −λC0 .
(1.21)
If C0 > 0, Ji j = 0 is no longer a possible solution, but there remain many more unknowns than equations. These equations are once again in the canonical linear form Ax = b.
12
Introduction
Figure 1.3 Generic tomographic problem in two dimensions. Measurements are made by integrating through an otherwise impenetrable solid between the transmitting sources and receivers using x-rays, sound, radio waves, etc. Properties can be anything measurable, including travel times, intensities, group velocities, etc., as long as they are functions of the parameters sought (such as the density or sound speed). The tomographic problem is to reconstruct the interior from these integrals. In the particular configuration shown, the source and receiver are supposed to revolve so that a very large number of paths can be built up. It is also supposed that the division into small rectangles is an adequate representation. In principle, one can have many more integrals than the number of squares defining the unknowns.
A tomographic problem So-called tomographic problems occur in many fields, most notably in medicine, but also in materials testing, oceanography, meteorology, and geophysics. Generically, they arise when one is faced with the problem of inferring the distribution of properties inside an area or volume based upon a series of integrals through the region. Consider Fig. 1.3, where, to be specific, suppose we are looking at the top of the head of a patient lying supine in a so-called CAT-scanner. The two external shell sectors represent a source of x-rays, and a set of x-ray detectors. X-rays are emitted from the source and travel through the patient along the indicated lines where the intensity of the received beam is measured. Let the absorptivity/unit length within the patient be a function, c(r), where r is the vector position within the patient’s head. Consider one source at rs and a receptor at re connected by the
13
1.3 More examples
Figure 1.4 Simplified geometry for defining a tomographic problem. Some squares may have no integrals passing through them; others may be multiply-covered. Boxes outside the physical body can be handled in a number of ways, including the addition of constraints setting the corresponding c j = 0.
path as indicated. Then the intensity measured at the receptor is re c (r (s)) ds, I (rs , re ) =
(1.22)
rs
where s is the arc length along the path. The basic tomographic problem is to determine c(r) for all r in the patient, from measurements of I. c can be a function of both position and the physical parameters of interest. In the medical problem, the shell sectors rotate around the patient, and an enormous number of integrals along (almost) all possible paths are obtained. An analytical solution to this problem, as the number of paths becomes infinite, is produced by the Radon transform.3 Given that tumors and the like have a different absorptivity to normal tissue, the reconstructed image of c(r) permits physicians to “see” inside the patient. In most other situations, however, the number of paths tends to be much smaller than the formal number of unknowns and other solution methods must be found. Note first, however, that Eq. (1.22) should be modified to reflect the inability of any system to produce a perfect measurement of the integral, and so, more realistically, re I (rs , re ) = c (r (s)) ds + n(rs , re ), (1.23) rs
where n is the measurement noise. To proceed, surround the patient with a bounding square (Fig. 1.4) – to produce a simple geometry – and divide the area into sub-squares as indicated, each numbered in sequence, j = 1, 2, . . . , N . These squares are supposed sufficiently small that c (r) is effectively constant within them. Also number the paths, i = 1, 2, . . . , M.
14
Introduction
Then Eq. (1.23) can be approximated with arbitrary accuracy (by letting the subsquare dimensions become arbitrarily small) as Ii =
N
c j ri j + n i .
(1.24)
j=1
Here ri j is the arc length of path i within square j (most of them will vanish for any particular path). These last equations are of the form Ex + n = y,
(1.25)
where E ={ri j }, x = [c j ], y = [Ii ], n = [n i ] . Quite commonly there are many more unknown c j than there are integrals Ii . (In the present context, there is no distinction made between using matrices A, E. E will generally be used where noise elements are present, and A where none are intended.) Tomographic measurements do not always consist of x-ray intensities. In seismology or oceanography, for example, c j is commonly 1/v j , where v j is the speed of sound or seismic waves within the area; I is then a travel time rather than an intensity. The equations remain the same, however. This methodology also works in three dimensions, the paths need not be straight lines, and there are many generalizations.4 A problem of great practical importance is determining what one can say about the solutions to Eqs. (1.25) even where many more unknowns exist than formal pieces of information yi . As with all these problems, many other forms of discretization are possible. For example, the continuous function c (r) can be expanded: c (r) = anm Tn (r x ) Tm (r y ), (1.26) n
m
where r =(r x , r y ), and the Tn are any suitable expansion functions (sines and cosines, Chebyschev polynomials, etc.). The linear equations (4.35) then represent constraints leading to the determination of the anm . A second tracer problem Consider the closed volume in Fig. 1.5 enclosed by four boundaries as shown. There are steady flows, vi (z) , i = 1, . . . , 4, either into or out of the volume, each carrying a corresponding fluid of constant density ρ 0 · z is the vertical coordinate. If the width of each boundary is li , the statement that mass is conserved within the volume is simply 0 r li ρ 0 vi (z) dz = 0, (1.27) i=1
−h
where the convention is made that flows into the box are positive, and flows out are negative. z = −h is the lower boundary of the volume and z = 0 is the top
15
1.3 More examples
v1
L1 L4
L2
L3
v3
v4
v2
h
Figure 1.5 Volume of fluid bounded on four open vertical and two horizontal sides across which fluid is supposed to flow. Mass is conserved, giving one relationship among the fluid transports vi ; conservation of one or more other tracers Ci leads to additional useful relationships.
one. If the vi are unknown, Eq. (1.27) represents one equation (constraint) in four unknowns:
0
−h
vi (z) dz,
1 ≤ i ≤ 4.
(1.28)
One possible, if boring, solution is vi (z) = 0. To make the problem somewhat more interesting, suppose that, for some mysterious reason, the vertical derivatives, vi (z) = dvi (z) /dz, are known so that vi (z) =
z z0
vi (z) dz + bi (z 0 ),
(1.29)
where z 0 is a convenient place to start the integration (but can be any value). bi are integration constants (bi = vi (z 0 )) that remain unknown. Constraint (1.27) becomes 4 i=1
li ρ 0
0
−h
z z0
vi (z )dz
+ bi (z 0 ) dz = 0,
(1.30)
16
Introduction
or 4
4 hli bi (z 0 ) = − li
i=1
i=1
0
−h
z
dz z0
vi (z )dz ,
(1.31)
where the right-hand side is known. Equation (1.31) is still one equation in four unknown bi , but the zero-solution is no longer possible, unless the right-hand side vanishes. Equation (1.31) is a statement that the weighted average of the bi on the left-hand side is known. If one seeks to obtain estimates of the bi separately, more information is required. Suppose that information pertains to a tracer, perhaps a red dye, known to be conservative, and that the box concentration of red dye, C, is known to be in a steady state. Then conservation of C becomes 0 0 z 4 4 hli Ci (z) dz bi = − li dz Ci (z )vi (z )dz , (1.32) i=1
−h
i=1
−h
−z 0
where Ci (z) is the concentration of red dye on each boundary. Equation (1.32) provides a second relationship for the four unknown bi . One might try to measure another dye concentration, perhaps green dye, and write an equation for this second tracer, exactly analogous to (1.32). With enough such dye measurements, there might be more constraint equations than unknown bi . In any case, no matter how many dyes are measured, the resulting equation set is of the form (1.9). The number of boundaries is not limited to four, but can be either fewer, or many more.5 Vibrating string Consider a uniform vibrating string anchored at its ends r x = 0, r x = L . The free motion of the string is governed by the wave equation ∂ 2η 1 ∂ 2η − = 0, ∂r x2 c2 ∂t 2
c2 = T /ρ,
(1.33)
where T is the tension and ρ the density. Free modes of vibration (eigen-frequencies) are found to exist at discrete frequencies, sq , qπc , q = 1, 2, 3, . . . , (1.34) L which is the solution to a classical forward problem. A number of interesting and useful inverse problems can be formulated. For example, given sq ± sq , q = 1, 2, . . . , M, to determine L or c. These are particularly simple problems, because there is only one parameter, either c or L, to determine. More generally, it is obvious from Eq. (1.34) that one has information only about the ratio c/L – they could not be determined separately. 2πsq =
1.4 Importance of the forward model
17
Suppose, however, that the density varies along the string, ρ = ρ(r x ), so that c = c (r x ). Then, it may be confirmed that the observed frequencies are no longer given by Eq. (1.34), but by expressions involving the integral of c over the length of the string. An important problem is then to infer c(r x ), and hence ρ(r x ). One might wonder whether, under these new circumstances, L can be determined independently of c? A host of such problems exist, in which the observed frequencies of free modes are used to infer properties of media in one to three dimensions. The most elaborate applications are in geophysics and solar physics, where the normal mode frequencies of the vibrating whole Earth or Sun are used to infer the interior properties (density, elastic parameters, magnetic field strength, etc.).6 A good exercise is to render the spatially variable string problem into discrete form.
1.4 Importance of the forward model Inference about the physical world from data requires assertions about the structure of the data and its internal relationships. One sometimes hears claims from people who are expert in measurements that “I don’t use models.” Such a claim is almost always vacuous. What the speaker usually means is that he doesn’t use equations, but is manipulating his data in some simple way (e.g., forming an average) that seems to be so unsophisticated that no model is present. Consider, however, a simple problem faced by someone trying to determine the average temperature in a room. A thermometer is successively placed at different three-dimensional locations, ri , at times ti . Let the measurements be yi and the value of interest be m˜ =
M 1 yi . M i=1
(1.35)
In deciding to compute, and use, m˜ the observer has probably made a long list of very sophisticated, but implicit, model assumptions. Among them we might suggest: (1) Thermometers actually measure the length of a fluid, or an oscillator frequency, or a voltage and require knowledge of the relation to temperature as well as potentially elaborate calibration methods. (2) That the temperature in the room is sufficiently slowly changing that all of the ti can be regarded as effectively identical. A different observer might suggest that the temperature in the room is governed by shock waves bouncing between the walls at intervals of seconds or less. Should that be true, m˜ constructed from the available samples might prove completely meaningless. It might be objected that such an hypothesis is far-fetched. But the assumption that the room temperature is governed, e.g., by a slowly evolving diffusion process, is a specific, and perhaps incorrect model. (3) That the errors in the thermometer are
18
Introduction
such that the best estimate of the room mean temperature is obtained by the simple sum in Eq. (1.35). There are many measurement devices for which this assumption is a very poor one (perhaps the instrument is drifting, or has a calibration that varies with temperature), and we will discuss how to determine averages in Chapter 2. But the assumption that property m˜ is useful, is a strong model assumption concerning both the instrument being used and the physical process it is measuring. This list can be extended, but more generally, the inverse problems listed earlier in this chapter only make sense to the degree that the underlying forward model is likely to be an adequate physical description of the observations. For example, if one is attempting to determine ρ in Eq. (1.15) by taking the Laplacian ∇ 2 φ, (analytically or numerically), the solution to the inverse problem is only sensible if this equation really represents the correct governing physics. If the correct equation to use were, instead, ∂ 2 φ 1 ∂φ + = ρ, ∂r x2 2 ∂r y
(1.36)
where r y is another coordinate, the calculated value of ρ would be incorrect. One might, however, have good reason to use Eq. (1.15) as the most likely hypothesis, but nonetheless remain open to the possibility that it is not an adequate descriptor of the required field, ρ. A good methodology, of the type to be developed in subsequent chapters, permits posing the question: is my model consistent with the data? If the answer to the question is “yes,” a careful investigator would never claim that the resulting answer is the correct one and that the model has been “validated” or “verified.” One claims only that the answer and the model are consistent with the observations, and remains open to the possibility that some new piece of information will be obtained that completely invalidates the model (e.g., some direct measurements of ρ showing that the inferred value is simply wrong). One can never validate or verify a model, one can only show consistency with existing observations.7 Notes 1 2 3 4 5
Whittaker and Robinson (1944). Lanczos (1961) has a much fuller discussion of this correspondence. Herman (1980). Herman (1980); Munk et al. (1995). Oceanographers will recognize this apparently highly artificial problem as being a slightly simplified version of the so-called geostrophic inverse problem, and which is of great practical importance. It is a central subject in Chapter 6. 6 Aki and Richards (1980). A famous two-dimensional version of the problem is described by Kaˇc (1966); see also Gordon and Webb (1996). 7 Oreskes et al. (1994).
2 Basic machinery
2.1 Background The purpose of this chapter is to record a number of results that are useful in finding and understanding the solutions to sets of usually noisy simultaneous linear equations and in which formally there may be too much or too little information. A lot of the material is elementary; good textbooks exist, to which the reader will be referred. Some of what follows is discussed primarily so as to produce a consistent notation for later use. But some topics are given what may be an unfamiliar interpretation, and I urge everyone to at least skim the chapter. Our basic tools are those of matrix and vector algebra as they relate to the solution of linear simultaneous equations, and some elementary statistical ideas – mainly concerning covariance, correlation, and dispersion. Least-squares is reviewed, with an emphasis placed upon the arbitrariness of the distinction between knowns, unknowns, and noise. The singular-value decomposition is a central building block, producing the clearest understanding of least-squares and related formulations. Minimum variance estimation is introduced through the Gauss–Markov theorem as an alternative method for obtaining solutions to simultaneous equations, and its relation to and distinction from least-squares is discussed. The chapter ends with a brief discussion of recursive least-squares and estimation; this part is essential background for the study of time-dependent problems in Chapter 4.
2.2 Matrix and vector algebra This subject is very large and well-developed and it is not my intention to repeat material better found elsewhere.1 Only a brief survey of essential results is provided. A matrix is an M × N array of elements of the form A = {Ai j },
i = 1, 2, . . . , M, 19
j = 1, 2, . . . , N .
Basic machinery
20
Normally a matrix is denoted by a bold-faced capital letter. A vector is a special case of an M × 1 matrix, written as a bold-face lower case letter, for example, q. Corresponding capital or lower case letters for Greek symbols are also indicated in bold-face. Unless otherwise stipulated, vectors are understood to be columnar. The transpose of a matrix A is written AT and is defined as {AT }i j = A ji , an interchange of the rows and columns of A. Thus (AT )T = A. Transposition applied to vectors is sometimes used to save space in printing, for example, q = [q1 , q2 , . . . , q N ]T is the same as ⎡ ⎤ q1 ⎢ q2 ⎥ ⎢ ⎥ q = ⎢ . ⎥. ⎣ .. ⎦ qN Matrices and vectors √ N 2 A conventional measure of length of a vector is aT a = i ai = a. The inner, L T ai bi or dot, product between two L × 1 vectors a, b is written a b ≡ a · b = i=1 and is a scalar. Such an inner product is the “projection” of a onto b (or vice versa). It is readily shown that |aT b| = a b | cos φ| ≤ a b, where the magnitude of cos φ ranges between zero, when the vectors are orthogonal, and one, when they are parallel. Suppose we have a collection of N vectors, ei , each of dimension N . If it is possible to represent perfectly an arbitrary N -dimensional vector f as the linear sum f=
N
α i ei ,
(2.1)
i=1
then ei are said to be a “basis.” A necessary and sufficient condition for them to have that property is that they should be “independent,” that is, no one of them should be perfectly representable by the others: ej −
N
β i ei = 0,
j = 1, 2, . . . , N .
(2.2)
i=1, i= j
A subset of the e j are said to span a subspace (all vectors perfectly representable by the subset). For example, [1, −1, 0]T , [1, 1, 0]T span the subspace of all vectors [v1 , v2 , 0]T . A “spanning set” completely describes the subspace too, but might have additional, redundant vectors. Thus the vectors [1, −1, 0]T , [1, 1, 0]T , [1, 1/2, 0] span the subspace but are not a basis for it.
2.2 Matrix and vector algebra
21
f e1 e2 h f Figure 2.1 Schematic of expansion of an arbitrary vector f in two vectors e1 , e2 which may nearly coincide in direction.
The expansion coefficients α i in (2.1) are obtained by taking the dot product of (2.1) with each of the vectors in turn: N
α i eTk ei = eTk f,
k = 1, 2, . . . , N ,
(2.3)
i=1
which is a system of N equations in N unknowns. The α i are most readily found if the ei are a mutually orthonormal set, that is, if eiT e j = δ i j , but this requirement is not a necessary one. With a basis, the information contained in the set of projections, eiT f = fT ei , is adequate then to determine the α i and thus all the information required to reconstruct f is contained in the dot products. The concept of “nearly dependent” vectors is helpful and can be understood heuristically. Consider Fig. 2.1, in which the space is two-dimensional. Then the two vectors e1 , e2 , as depicted there, are independent and can be used to expand an arbitrary two-dimensional vector f in the plane. The simultaneous equations become α 1 eT1 e1 + α 2 eT1 e2 = eT1 f, α 1 eT2 e1
+
α 2 eT2 e2
=
(2.4)
eT2 f.
The vectors become nearly parallel as the angle φ in Fig. 2.1 goes to zero; as long as they are not identically parallel, they can still be used mathematically to represent f perfectly. An important feature is that even if the lengths of e1, e2 , f are all order-one, the expansion coefficients α 1 , α 2 can have unbounded magnitudes when the angle φ becomes small and f is nearly orthogonal to both (measured by angle η).
Basic machinery
22
That is to say, we find readily from (2.4) that T T T T e f e2 e2 − e2 f e1 e2 α 1 = 1 2 , eT1 e1 eT2 e2 − eT1 e2 T T T T e f e1 e1 − e1 f e2 e1 α 2 = 2 2 . eT1 e1 eT2 e2 − eT1 e2
(2.5)
(2.6)
Suppose for simplicity that f has unit length, and that the ei have also been normalized to unit length as shown in Fig. 2.1. Then, cos (η − φ) − cos φ cos η sin η = , 2 1 − cos φ sin φ α 2 = cos η − sin η cot φ α1 =
(2.7) (2.8)
and whose magnitudes can become arbitrarily large as φ → 0. One can imagine a situation in which α 1 e1 and α 2 e2 were separately measured and found to be very large. One could then erroneously infer that the sum vector, f, was equally large. This property of the expansion in non-orthogonal vectors potentially producing large coefficients becomes important later (Chapter 5) as a way of gaining insight into the behavior of so-called non-normal operators. The generalization to higher dimensions is left to the reader’s intuition. One anticipates that as φ becomes very small, numerical problems can arise in using these “almost parallel” vectors. Gram–Schmidt process One often has a set of p independent, but non-orthonormal vectors, hi , and it is convenient to find a new set gi , which are orthonormal. The “Gram–Schmidt process” operates by induction. Suppose the first k of the hi have been orthonormalized to a new set, gi . To generate vector k + 1, let gk+1 = hk+1 −
k
γ jgj.
(2.9)
j
Because gk+1 must be orthogonal to the preceding gi , i = 1, . . . , k, take the dot products of (2.9) with each of these vectors, producing a set of simultaneous equations for determining the unknown γ j . The resulting gk+1 is easily given unit norm by dividing by its length. Given the first k of N necessary vectors, an additional N − k independent vectors, hi , are needed. There are several possibilities. The necessary extra vectors might be generated by filling their elements with random numbers. Or a very simple trial set like hk+1 = [1, 0, 0, . . . , 0]T , hk+2 = [0, 1, 0, . . . 0], . . . might be adequate. If one is unlucky, the set chosen might prove not to be independent of the existing gi .
2.2 Matrix and vector algebra
23
But a simple numerical perturbation usually suffices to render them so. In practice, the algorithm is changed to what is usually called the “modified Gram–Schmidt process” for purposes of numerical stability.2
2.2.1 Matrix multiplication and identities It has been found convenient and fruitful to usually define multiplication of two matrices A, B, written as C = AB, such that Ci j =
P
Ai p B pj .
(2.10)
p=1
For the definition (2.10) to make sense, A must be an M × P matrix and B must be P × N (including the special case of P × 1, a column vector). That is, the two matrices must be “conformable.” If two matrices are multiplied, or a matrix and a vector are multiplied, conformability is implied – otherwise one can be assured that an error has been made. Note that AB = BA even where both products exist, except under special circumstances. Define A2 = AA, etc. Other definitions of matrix multiplication exist, and are useful, but are not needed here. The mathematical operation in (2.10) may appear arbitrary, but a physical interpretation is available: Matrix multiplication is the dot product of all of the rows of A with all of the columns of B. Thus multiplication of a vector by a matrix represents the projections of the rows of the matrix onto the vector. Define a matrix, E, each of whose columns is the corresponding vector ei , and a vector, α = {α i }, in the same order. Then the expansion (2.1) can be written compactly as f = Eα.
(2.11)
A “symmetric matrix” is one for which AT = A. The product AT A represents the array of all the dot products of the columns of A with themselves, and similarly, AAT represents the set of all dot products of all the rows of A with themselves. It follows that (AB)T = BT AT . Because we have (AAT )T = AAT , (AT A)T = AT A, both of these matrices are symmetric. The “trace” of a square M × M matrix A is defined as trace(A) = iM Aii . A “diagonal matrix” is square and zero except for the terms along the main diagonal, although we will later generalize this definition. The operator diag(q) forms a square diagonal matrix with q along the main diagonal. The special L × L diagonal matrix I L , with Iii = 1, is the “identity.” Usually, when the dimension of I L is clear from the context, the subscript is omitted. IA = A, AI = I, for any A for which the products make sense. If there is a matrix B, such
24
Basic machinery
that BE = I, then B is the “left inverse” of E. If B is the left inverse of E and E is square, a standard result is that it must also be a right inverse: EB = I, B is then called “the inverse of E” and is usually written E−1 . Square matrices with inverses are “non-singular.” Analytical expressions exist for a few inverses; more generally, linear algebra books explain how to find them numerically when they exist. If E is not square, one must distinguish left and right inverses, sometimes written, E+ , and referred to as “generalized inverses.” Some of them will be encountered later. A useful result is that (AB)−1 = B−1 A−1 , if the inverses exist. A notational shorthand is (A−1 )T = (AT )−1 ≡ A−T . The “length,” or norm, of a vector has already been introduced. But several choices are possible; for present purposes, the conventional l2 norm already defined, 1/2
N
f2 ≡ (fT f)1/2 = f i2 , (2.12) i=1
is most useful; often the subscript is omitted. Equation (2.12) leads in turn to the measure of distance between two vectors, a, b, as (2.13) a − b2 = (a − b)T (a − b), which is the familiar Cartesian distance. Distances can also be measured in such a way that deviations of certain elements of c = a − b count for more than others – that is, a metric, or set of weights can be introduced with a definition,
cW = cW c, (2.14) i i ii i depending upon the importance to be attached to magnitudes of different elements, stretching and shrinking various coordinates. Finally, in the most general form, distance can be measured in a coordinate system both stretched and rotated relative to the original one √ cW = cT Wc, (2.15) where W is an arbitrary matrix (but usually, for physical reasons, symmetric and positive definite,3 implying that cT Wc ≥ 0). 2.2.2 Linear simultaneous equations Consider a set of M-linear equations in N -unknowns, Ex = y.
(2.16)
Because of the appearance of simultaneous equations in situations in which the yi are observed, and where x are parameters whose values are sought, it is often convenient
2.2 Matrix and vector algebra
25
to refer to (2.16) as a set of measurements of x that produced the observations or data, y. If M > N , the system is said to be “formally overdetermined.” If M < N , it is “underdetermined,” and if M = N , it is “formally just-determined.” The use of the word “formally” has a purpose we will come to. Knowledge of the matrix inverse to E would make it easy to solve a set of L equations in L unknowns, by left-multiplying (2.16) by E−1 : E−1 Ex = Ix = x = E−1 y. The reader is cautioned that although matrix inverses are a very powerful theoretical tool, one is usually ill-advised to solve large sets of simultaneous equations by employing E−1 ; better numerical methods are available for the purpose.4 There are several ways to view the meaning of any set of linear simultaneous equations. If the columns of E continue to be denoted ei , then (2.16) is x1 e1 + x2 e2 + · · · + x N e N = y.
(2.17)
The ability to so describe an arbitrary y, or to solve the equations, would thus depend upon whether the M × 1 vector y can be specified by a sum of N -column vectors, ei . That is, it would depend upon their being a spanning set. In this view, the elements of x are simply the corresponding expansion coefficients. Depending upon the ratio of M to N , that is, the number of equations compared to the number of unknown elements, one faces the possibility that there are fewer expansion vectors ei than elements of y (M > N ), or that there are more expansion vectors available than elements of y (M < N ). Thus the overdetermined case corresponds to having fewer expansion vectors, and the underdetermined case corresponds to having more expansion vectors, than the dimension of y. It is possible that in the overdetermined case, the too-few expansion vectors are not actually independent, so that there are even fewer vectors available than is first apparent. Similarly, in the underdetermined case, there is the possibility that although it appears we have more expansion vectors than required, fewer may be independent than the number of elements of y, and the consequences of that case need to be understood as well. An alternative interpretation of simultaneous linear equations denotes the rows of E as riT , i = 1, 2, . . . , M. Then Eq. (2.16) is a set of M-inner products, riT x = yi ,
i = 1, 2, . . . , M.
(2.18)
That is, the set of simultaneous equations is also equivalent to being provided with the value of M-dot products of the N -dimensional unknown vector, x, with M known vectors, ri . Whether that is sufficient information to determine x depends upon whether the ri are a spanning set. In this view, in the overdetermined case, one has more dot products available than unknown elements xi , and, in the underdetermined case, there are fewer such values than unknowns.
26
Basic machinery
A special set of simultaneous equations for square matrices, A, is labelled the “eigenvalue/eigenvector problem,” Ae =λe.
(2.19)
In this set of linear simultaneous equations one seeks a special vector, e, such that for some as yet unknown scalar eigenvalue, λ, there is a solution. An N × N matrix will have up to N solutions (λi , ei ), but the nature of these elements and their relations require considerable effort to deduce. We will look at this problem more later; for the moment, it again suffices to say that numerical methods for solving Eq. (2.19) are well-known.
2.2.3 Matrix norms A number of useful definitions of a matrix size, or norm, exist. The so-called “spectral norm” or “2-norm” defined as (2.20) A2 = maximum eigenvalue of (AT A) is usually adequate. Without difficulty, it may be seen that this definition is equivalent to A2 = max
Ax2 xT AT Ax = max T x x x2
(2.21)
where the maximum is defined over all vectors x.5 Another useful measure is the “Frobenius norm,”
M N 2 AF = A = trace(AT A). (2.22) i=1 j=1 i j Neither norm requires A to be square. These norms permit one to derive various useful results. Consider the following illustration. Suppose Q is square, and Q < 1, then (I + Q)−1 = I − Q + Q2 − · · · ,
(2.23)
which may be verified by multiplying both sides by I + Q, doing term-by-term multiplication and measuring the remainders with either norm. Nothing has been said about actually finding the numerical values of either the matrix inverse or the eigenvectors and eigenvalues. Computational algorithms for obtaining them have been developed by experts, and are discussed in many good textbooks.6 Software systems like MATLAB, Maple, IDL, and Mathematica implement them in easy-to-use form. For purposes of this book, we assume the
2.2 Matrix and vector algebra
27
reader has at least a rudimentary knowledge of these techniques and access to a good software implementation. 2.2.4 Identities: differentiation There are some identities and matrix/vector definitions that prove useful. A square “positive definite” matrix A, is one for which the scalar “quadratic form,” J = xT Ax, is positive for all possible vectors x. (It suffices to consider only symmetric A because for a general matrix, xT Ax = xT [(A + AT )/2]x, which follows from the scalar property of the quadratic form.) If J ≥ 0 for all x, A is “positive semi-definite,” or “non-negative definite.” Linear algebra books show that a necessary and sufficient requirement for positive definiteness is that A has only positive eigenvalues (Eq. 2.19) and a semi-definite one must have all non-negative eigenvalues. We end up doing a certain amount of differentiation and other operations with respect to matrices and vectors. A number of formulas are very helpful, and save a lot of writing. They are all demonstrated by doing the derivatives term-by-term. Let q, r be N × 1 column vectors, and A, B, C be matrices. The derivative of a matrix by a scalar is just the matrix of element by element derivatives. Alternatively, if s is any scalar, its derivative by a vector, ∂s ∂s ∂s T = ··· , (2.24) ∂q ∂q1 ∂q N is a column vector (the gradient; some authors define it to be a row vector). The derivative of one vector by another is defined as a matrix: ⎫ ⎧ ∂r1 ∂r2 M ⎪ ⎪ · ∂r ⎪ ⎪ ∂q ∂q ∂q 1 1 1 ⎪ ⎪ ⎪ ⎪ ⎬ ⎨ ∂r ∂r 1 M ∂r ∂ri · · ∂q ∂q 2 2 ≡ B. (2.25) = = ⎪ ∂q ∂q j · · · · ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ∂r M ⎪ ⎭ ⎩ ∂r1 · · ∂q ∂q N N If r, q are of the same dimension, the determinant of B = det (B) is the “Jacobian” of r.7 The second derivative of a scalar, ⎧ ∂2s ⎫ 2 ∂2s · · ∂q∂1 qs N ⎪ ⎪ ∂q1 q2 ⎨ ⎬ ∂q12 2 ∂ ∂s ∂ s = = , (2.26) · · · · · ⎪ ∂q2 ∂qi ∂q j ⎩ ∂2s ⎭ ∂2s ⎪ · · · ∂q 2 ∂q N ∂q1 N
is the “Hessian” of s and is the derivative of the gradient of s.
28
Basic machinery
Assuming conformability, the inner product, J = rT q = qT r, is a scalar. The differential of J is dJ = drT q + rT dq = dqT r + qT dr,
(2.27)
and hence the partial derivatives are ∂(qT r) ∂(rT q) = = r, ∂q ∂q ∂(qT q) = 2q. ∂q
(2.28) (2.29)
It follows immediately that, for matrix/vector products, ∂ (Bq) = BT , ∂q
∂ T (q B) = B. ∂q
(2.30)
The first of these is used repeatedly, and attention is called to the apparently trivial fact that differentiation of Bq with respect to q produces the transpose of B – the origin, as seen later, of so-called adjoint models. For a quadratic form, J = qT Aq ∂J = (A + AT )q, ∂q
(2.31)
and the Hessian of the quadratic form is 2A if A = AT . Differentiation of a scalar function (e.g., J in Eq. 2.31) or a vector by a matrix, A, is readily defined.8 Differentiation of a matrix by another matrix results in a third, very large, matrix. One special case of the differential of a matrix function proves useful later on. It can be shown9 that dAn = (dA) An−1 + A (dA) An−2 + · · · + An−1 (dA), where A is square. Thus the derivative with respect to some scalar, k, is (dA) n−1 dAn n−1 dA n−2 (dA) = A A + ··· + A . +A dk dk dk dk
(2.32)
(2.33)
There are a few, unfortunately unintuitive, matrix inversion identities that are essential later. They are derived by considering the square, partitioned matrix, A B , (2.34) BT C where AT = A, CT = C, but B can be rectangular of conformable dimensions in (2.34).10 The most important of the identities, sometimes called the “matrix
2.3 Simple statistics: regression
29
inversion lemma” is, in one form, {C − BT A−1 B}−1 = {I − C−1 BT A−1 B}−1 C−1 = C−1 − C−1 BT (BC−1 BT − A)−1 BC−1 ,
(2.35)
where it is assumed that the inverses exist.11 A variant is ABT (C + BABT )−1 = (A−1 + BT C−1 B)−1 BT C−1 .
(2.36)
Equation (2.36) is readily confirmed by left-multiplying both sides by (A−1 + BT C−1 B), and right-multiplying by (C + BABT ) and showing that the two sides of the resulting equation are equal. Another identity, found by “completing the square,” is demonstrated by directly multiplying it out, and requires C = CT (A is unrestricted, but the matrices must be conformable as shown): ACAT − BAT − ABT = (A − BC−1 )C(A − BC−1 )T − BC−1 BT .
(2.37)
2.3 Simple statistics: regression 2.3.1 Probability densities, moments Some statistical ideas are required, but the discussion is confined to stating some basic notions and to developing a notation.12 We require the idea of a probability density for a random variable x. This subject is a very deep one, but our approach is heuristic.13 Suppose that an arbitrarily large number of experiments can be conducted for the determination of the values of x, denoted X i , i = 1, 2, . . . , M, and a histogram of the experimental values found. The frequency function, or probability density, will be defined as the limit, supposing it exists, of the histogram of an arbitrarily large number of experiments, M → ∞, divided into bins of arbitrarily small value ranges, and normalized by M, to produce the fraction of the total appearing in the ranges. Let the corresponding limiting frequency function be denoted px (X )dX, interpreted as the fraction (probability) of values of x lying in the range, X ≤ x ≤ X + dX. As a consequence of the definition, px (X ) ≥ 0 and
px (X ) dX = all X
∞ −∞
px (X ) dX = 1.
(2.38)
The infinite integral is a convenient way of representing an integral over “all X ,” as px simply vanishes for impossible values of X. (It should be noted that this so-called frequentist approach has fallen out of favor, with Bayesian assumptions being regarded as ultimately more rigorous and fruitful. For introductory purposes,
30
Basic machinery
however, empirical frequency functions appear to provide an adequate intuitive basis for proceeding.) The “average,” or “mean,” or “expected value” is denoted x and defined as
x ≡ X px (X )dX = m 1 . (2.39) all X
The mean is the center of mass of the probability density. Knowledge of the true mean value of a random variable is commonly all that we are willing to assume known. If forced to “forecast” the numerical value of x under such circumstances, often the best we can do is to employ x. If the deviation from the true mean is denoted x so that x = x + x , such a forecast has the virtue that we are assured the average forecast error, x , would be zero if many such forecasts are made. The bracket operation is very important throughout this book; it has the property that if a is a non-random quantity, ax = a x and ax + y = a x + y. Quantity x is the “first-moment” of the probability density. Higher order moments are defined as ∞ n m n = x = X n px (X )dX, −∞
where n are the non-negative integers. A useful theoretical result is that a knowledge of all the moments is usually enough to completely define the probability density themselves. (There are troublesome situations with, e.g., non-existent moments, as with the so-called Cauchy distribution, px (X ) = (2/π ) (1/(1 + X 2 )) X ≥ 0, whose mean is infinite.) For many important probability densities, including the Gaussian, a knowledge of the first two moments n = 1, 2 is sufficient to define all the others, and hence the full probability density. It is common to define the moments for n > 1 about the mean, so that one has ∞ n (X − X )n px (X )dX. μn = (x − x) = −∞
μ2 is the variance and often written μ2 = σ 2 , where σ is the “standard deviation.” 2.3.2 Sample estimates: bias In observational sciences, one normally must estimate the values defining the probability density from the data itself. Thus the first moment, the mean, is often computed as the “sample average,” m˜ 1 = x M
M 1
≡ Xi . M i=1
(2.40)
2.3 Simple statistics: regression
31
The notation m˜ 1 is used to distinguish the sample estimate from the true value, m 1 . On the other hand, if the experiment of computing m˜ 1 from M samples could be repeated many times, the mean of the sample estimates would be the true mean. This conclusion is readily seen by considering the expected value of the difference from the true mean: ! M 1
x M − x = X i − x M i=1 =
M M 1
X i − x =
x − x = 0. M i=1 M
Such an estimate is said to be “unbiassed”: its expected value is the quantity one seeks. The interpretation is that, for finite M, we do not expect that the sample mean will equal the true mean, but that if we could produce sample averages from distinct groups of observations, the sample averages would themselves have an average that will fluctuate about the true mean, with equal probability of being higher or lower. There are many sample estimates, however, some of which we encounter, where the expected value of the sample estimate is not equal to the true estimate. Such an estimator is said to be “biassed.” A simple example of a biassed estimator is the “sample variance,” defined as s2 ≡
M 1
(X i − x M )2 . M i
(2.41)
For reasons explained later in this chapter (p. 42), one finds that
s 2 =
M −1 2 σ = σ 2 , M
and thus the expected value is not the true variance. (This particular estimate is “asymptotically unbiassed,” as the bias vanishes as M → ∞.) We are assured that the sample mean is unbiassed. But the probability that
x M = x, that is that we obtain exactly the true value, is very small. It helps to have a measure of the extent to which x M is likely to be very far from x. To do so, we need the idea of dispersion – the expected or average squared value of some quantity about some interesting value, like its mean. The most familiar measure of dispersion is the variance, already used above, the expected fluctuation of a random variable about its mean: σ 2 = (x − x)2 .
32
Basic machinery
More generally, define the dispersion of any random variable, z, as D 2 (z) = z 2 . Thus, the variance of x is D 2 (x − x). The variance of x M about the correct value is obtained by a little algebra using the bracket notation, σ2 . (2.42) M This expression shows the well-known result that as M becomes large, any tendency of the sample mean to lie far from the true value will diminish. It does not prove that some particular value will not, by accident, be far away, merely that it becomes increasingly unlikely as M grows. (In statistics textbooks, the Chebyschev inequality is used to formalize this statement.) An estimate that is unbiassed and whose expected dispersion about the true value goes to zero with M is evidently desirable. In more interesting estimators, a bias is often present. Then for a fixed number of samples, M, there would be two distinct sources of deviation (error) from the true value: (1) the bias – how far, on average, it is expected to be from the true value, and (2) the tendency – from purely random events – for the value to differ from the true value (the random error). In numerous cases, one discovers that tolerating a small bias error can greatly reduce the random error – and thus the bias may well be worth accepting for that reason. In some cases therefore, a bias is deliberately introduced. D 2 (( x M − x)2 ) =
2.3.3 Functions and sums of random variables If the probability density of x is px (x), then the mean of a function of x, g(x), is just ∞
g(x) = g(X ) px (X )dX, (2.43) −∞
which follows from the definition of the probability density as the limit of the outcome of a number of trials. The probability density for g regarded as a new random variable is obtained from pg (G) = px (X (G))
dx dG, dg
(2.44)
where dx/dg is the ratio of the differential intervals occupied by x and g and can be understood by reverting to the original definition of probability densities from histograms. The Gaussian, or normal, probability density is one that is mathematically handy (but is potentially dangerous as a general model of the behavior of natural
2.3 Simple statistics: regression
33
processes – many geophysical and fluid processes are demonstrably non-Gaussian). For a single random variable x, it is defined as 1 (X − m x )2 px (X ) = √ exp − 2σ 2x 2πσ x (sometimes abbreviated as G(m x , σ x )). It is readily confirmed that x = m x ,
(x − x)2 = σ 2x . One important special case is the transformation of the Gaussian to another Gaussian of zero-mean and unit standard deviation, G(0, 1), z=
x −m , σx
which can always be done, and thus, 1 Z2 pz (Z ) = √ exp − . 2 2π A second important special case of a change of variable is g = z 2 , where z is Gaussian of zero mean and unit variance. Then the probability density of g is pg (G) =
1 √
G 1/2 2π
exp(−G/2),
(2.45)
a special probability density usually denoted χ 21 (“chi-square-sub-1”), the probability density for the square of a Gaussian. One finds g = 1, (g − g)2 = 2. 2.3.4 Multivariable probability densities: correlation The idea of a frequency function generalizes easily to two or more random variables, x, y. We can, in concept, do an arbitrarily large number of experiments in which we count the occurrences of differing pair values, (X i , Yi ), of x, y and make a histogram normalized by the total number of samples, taking the limit as the number of samples goes to infinity, and the bin sizes go to zero, to produce a joint probability density px y (X, Y ). px y (X, Y ) dX dY is then the fraction of occurrences such that X ≤ x ≤ X + dX, Y ≤ y ≤ Y + dY . A simple example would be the probability density for the simultaneous measurement of the two components of horizontal velocity at a point in a fluid. Again, from the definition, px y (X, Y ) ≥ 0 and ∞ px y (X, Y ) dY = px (X ), (2.46) −∞ ∞ ∞ px y (X, Y )dX dY = 1. (2.47) −∞
−∞
34
Basic machinery
An important use of joint probability densities is in what is known as “conditional probability.” Suppose that the joint probability density for x, y is known and, furthermore, y = Y , that is, information is available concerning the actual value of y. What then is the probability density for x given that a particular value for y is known to have occurred? This new frequency function is usually written as px|y (X |Y ) and read as “the probability of x, given that y has occurred,” or, “the probability of x conditioned on y.” It follows immediately from the definition of the probability density that px|y (X |Y ) =
px y (X, Y ) p y (Y )
(2.48)
(This equation can be interpreted by going back to the original experimental concept, and understanding the restriction on x, given that y is known to lie within a strip paralleling the X axis). Using the joint frequency function, define the average product as
x y = X Y px y (X, Y )dX dY. (2.49) allX,Y
Suppose that upon examining the joint frequency function, one finds that px y (X, Y ) = px (X ) p y (Y ), that is, it factors into two distinct functions. In that case, x, y are said to be “independent.” Many important results follow, including
x y = x y. Non-zero mean values are often primarily a nuisance. One can always define modified variables, e.g., x = x − x, such that the new variables have zero mean. Alternatively, one computes statistics centered on the mean. Should the centered product (x − x)(y − y) be non-zero, x, y are said to “co-vary” or to be “correlated.” If (x − x)(y − y) = 0, then the two variables are “uncorrelated.” If x, y are independent, they are uncorrelated. Independence thus implies lack of correlation, but the reverse is not necessarily true. (These are theoretical relationships, and if x, y are determined from observation, as described below, one must carefully distinguish estimated behavior from that expected theoretically.) If the two variables are independent, then (2.48) is px|y (X |Y ) = px (X ),
(2.50)
that is, the probability of x given y does not depend upon Y, and thus px y (X, Y ) = px (X ) p y (Y ) – and there is then no predictive power for one variable given knowledge of the other.
2.3 Simple statistics: regression
35
Suppose there are two random variables x, y between which there is anticipated to be some linear relationship, x = ay + n,
(2.51)
where n represents any contributions to x that remain unknown despite knowledge of y, and a is a constant. Then,
x = a y + n,
(2.52)
and (2.51) can be re-written as x − x = a(y − y) + (n − n), or x = ay + n ,
where x = x − x,
etc.
(2.53)
From this last equation, a=
x y
x 2 1/2
x 2 1/2
x y = = ρ ,
y 2 ( y 2 x 2 )1/2 y 2 1/2
y 2 1/2
(2.54)
where it was supposed that y n = 0, thus defining n . The quantity ρ≡
x y
y 2 1/2 x 2 1/2
(2.55)
is the “correlation coefficient” and has the property,14 |ρ| ≤ 1. If ρ should vanish, then so does a. If a vanishes, then knowledge of y carries no information about the value of x . If ρ = ±1, then it follows from the definitions that n = 0 and knowledge of a permits perfect prediction of x from knowledge of y . (Because probabilities are being used, rigorous usage would state “perfect prediction almost always,” but this distinction will be ignored.) A measure of how well the prediction of x from y will work can be obtained in terms of the variance of x . We have
x 2 = a 2 y 2 + n 2 = ρ 2 x 2 + n 2 , or (1 − ρ 2 ) x 2 = n 2 .
(2.56)
That is, (1 − ρ 2 ) x 2 is the fraction of the variance in x that is unpredictable from knowledge of y and is the “unpredictable power.” Conversely, ρ 2 x 2 is the “predictable” power in x given knowledge of y . The limits as ρ → 0, 1 are readily apparent.
36
Basic machinery
Thus we interpret the statement that two variables x , y “are correlated” or “co-vary” to mean that knowledge of one permits at least a partial prediction of the other, the expected success of the prediction depending upon the magnitude of ρ. If ρ is not zero, the variables cannot be independent, and the conditional probability px|y (X |Y ) = px (X ). This result represents an implementation of the statement that if two variables are not independent, then knowledge of one permits some skill in the prediction of the other. If two variables do not co-vary, but are known not to be independent, a linear model like (2.51) would not be useful – a non-linear one would be required. Such non-linear methods are possible, and are touched on briefly later. The idea that correlation or covariance between various physical quantities carries useful predictive skill between them is an essential ingredient of many of the methods taken up in this book. In most cases, quantities like ρ, x 2 are determined from the available measurements, e.g., of the form ayi + n i = xi ,
(2.57)
and are not known exactly. They are thus sample values, are not equal to the true values, and must be interpreted carefully in terms of their inevitable biasses and variances. This large subject of regression analysis is left to the references.15 2.3.5 Change of variables Suppose we have two random variables x, y with joint probability density px y (X, Y ). They are known as functions of two new variables x = x(ξ 1 , ξ 2 ), y = y(ξ 1 , ξ 2 ) and an inverse mapping ξ 1 = ξ 1 (x, y), ξ 2 = ξ 2 (x, y). What is the probability density for these new variables? The general rule for changes of variable in probability densities follows from area conservation in mapping from the x, y space to the ξ 1 , ξ 2 space, that is, pξ 1 ξ 2 (1 , 2 ) = px y (X (1 , 2 ), Y (1 , 2 ))
∂(X, Y ) , ∂(1 , 2 )
(2.58)
where ∂(X, Y )/∂(1 , 2 ) is the Jacobian of the transformation between the two variable sets. As in any such transformation, one must be alert for zeros or infinities in the Jacobian, indicative of multiple valuedness in the mapping. Texts on multivariable calculus discuss such issues in detail. Example Suppose x1 , x2 are independent Gaussian random variables of zero mean and variance σ 2 . Then
X 12 + X 22 1 . exp − px (X) = 2πσ 2 2σ 2
2.3 Simple statistics: regression
37
Define new random variables r = (x12 + x22 )1/2 ,
φ = tan−1 (x2 /x1 ),
(2.59)
whose mapping in the inverse direction is x1 = r cos φ,
y1 = r sin φ,
(2.60)
that is, the relationship between polar and Cartesian coordinates. The Jacobian of the transformation is Ja = r. Thus pr,φ (R, ) =
R 2 exp(−R 2 /σ 2 ), 2 2π σ
0 ≤ r, −π ≤ φ ≤ π
The probability density for r alone is obtained by integrating π R pr (R) = pr,φ dφ = 2 exp[−R 2 /(2σ 2 )], σ −π
(2.61)
(2.62)
which is known as a Rayleigh distribution. By inspection then, pφ () =
1 , 2π
which is the uniform distribution, independent of . (These results are very important in signal processing.) To generalize to n dimensions, let there be N variables, xi , i = 1, 2, . . . , N , with known joint probability density px1 ···x N . Let there be N new variables, ξ i , that are known functions of the xi , and an inverse mapping between them. Then the joint probability density for the new variables is just pξ 1 ···ξ N (1 , . . . , N ) = px1 ···x N (1 (X 1 , . . . , X N ) . . . , N (X 1 , . . . , X N ))
∂(X 1 , . . . , X N ) . ∂(1 , . . . , N )
(2.63)
Suppose that x, y are independent Gaussian variables G(m x , σ x ), G(m y , σ y ). Then their joint probability density is just the product of the two individual densities,
1 (X − m x )2 (Y − m y )2 px,y (X, Y ) = . (2.64) exp − − 2πσ x σ y 2σ 2x 2σ 2y Let two new random variables, ξ 1 , ξ 2 , be defined as a linear combination of x, y, ξ 1 = a11 (x − m x ) + a12 (y − m y ) + m ξ 1 ξ 2 = a21 (x − m x ) + a22 (y − m y ) + m ξ 2 ,
(2.65)
Basic machinery
38
or, in vector form, ξ = A(x − mx ) + mξ , where x = [x, y]T , mx = [m x , m y ]T , m y = [m ξ 1 , m ξ 2 ]T , and the numerical values satisfy the corresponding functional relations, 1 = a11 (X − m x ) + a12 (Y − m y ) + m ξ 1 , etc. Suppose that the relationship (2.65) is invertible, that is, we can solve for x = b11 (ξ 1 − m ξ 1 ) + b12 (ξ 2 − m ξ 2 ) + m x y = b21 (ξ 1 − m ξ 1 ) + b22 (ξ 2 − m ξ 2 ) + m y , or x = B(ξ − mξ ) + mx .
(2.66)
Then the Jacobian of the transformation is ∂(X, Y ) = b11 b22 − b12 b21 = det(B) ∂(1 , 2 )
(2.67)
(det(B) is the determinant). Equation (2.65) produces
ξ 1 = m ξ 1
ξ 2 = m ξ 2 2 2 2 2
(ξ 1 − ξ 1 )2 = a11 σ x + a12 σ y,
2 2 2 2
(ξ 2 − ξ 2 )2 = a21 σ x + a22 σy
(ξ 1 − ξ 1 )(ξ 2 − ξ 2 ) = a11 a21 σ 2x + a12 a22 σ 2y = 0. In the special case, cos φ A= − sin φ
sin φ , cos φ
B=
cos φ sin φ
(2.68)
− sin φ , cos φ
(2.69)
the transformation (2.69) is a simple coordinate rotation through angle φ, and the Jacobian is 1. The new second-order moments are
(ξ 1 − ξ 1 )2 = σ 2ξ 1 = cos2 φ σ 2x + sin2 φ σ 2y ,
(ξ 2 − ξ 2 )2 = σ 2ξ 2 = sin2 φ σ 2x + cos2 φ σ 2y ,
(ξ 1 − ξ 1 )(ξ 2 − ξ 2 ) ≡ μξ 1 ξ 2 = σ 2y − σ 2x cos φ sin φ.
(2.70) (2.71) (2.72)
2.3 Simple statistics: regression
39
The new probability density is 1 (2.73) 2πσ ξ 1 σ ξ 2 1 − ρ 2ξ ⎧ " #⎫ ⎨ 2 2 2ρ ξ (1 − m ξ 1 )(2 − m ξ 2 ) (2 − m ξ 2 ) ⎬ (1 − m ξ 1 ) 1 exp − , − + ⎩ 2 1 − ρ2 ⎭ σ ξ1σ ξ2 σ 2ξ 1 σ 2ξ 2
pξ 1 ξ 2 (1 , 2 ) =
ξ
where ρ ξ = (σ 2y − σ 2x ) sin φ cos φ/(σ 2x + σ 2y )1/2 = μξ 1 ξ 2 /σ ξ 1 σ ξ 2 is the correlation coefficient of the new variables. A probability density derived through a linear transformation from two independent variables that are Gaussian will be said to be “jointly Gaussian” and (2.73) is a canonical form. Because a coordinate rotation is invertible, it is important to note that if we had two random variables ξ 1 , ξ 2 that were jointly Gaussian with ρ = 1, then we could find a pure rotation (2.69), which produces two other variables x, y that are uncorrelated, and therefore independent. Two such uncorrelated variables x, y will necessarily have different variances, otherwise ξ 1 , ξ 2 would have zero correlation, too, by Eq. (2.72) . As an important by-product, it is concluded that two jointly Gaussian random variables that are uncorrelated, are also independent. This property is one of the reasons Gaussians are so nice to work with; but it is not generally true of uncorrelated variables. Vector random processes Simultaneous discussion of two random processes, x, y, can regarded as discussion of a vector random process [x, y]T , and suggests a generalization to N dimensions. Label N random processes as xi and define them as the elements of a vector x = [x1 , x2 , . . . , x N ]T . Then the mean is a vector: x = mx , and the covariance is a matrix: Cx x = D 2 (x − x) = (x − x)(x − x)T ,
(2.74)
which is necessarily symmetric and positive semi-definite. The cross-covariance of two vector processes x, y is Cx y = (x − x)(y − y)T ,
(2.75)
and Cx y = CTyx . It proves convenient to introduce two further moment matrices in addition to the covariance matrices. The “second moment” matrices will be defined as Rx x ≡ D 2 (x) = xxT ,
Rx y = xyT ,
40
Basic machinery
that is, not taken about the means. Note Rx y = RTyx , etc. Let x˜ be an “estimate” of the true value, x. Then the dispersion of x˜ about the true value will be called the “uncertainty” (it is sometimes called the “error covariance”) and is P ≡ D 2 (˜x − x) = (˜x − x)(˜x − x)T . P is similar to C, but differs in being taken about the true value, rather than about the mean value; the distinction can be very important. If there are N variables, ξ i , i = 1, 2, . . . , N , they will be said to have an “N -dimensional jointly normal probability density” if it is of the form $ % exp − 12 ( − m)T C−1 ξ ξ ( − m) pξ 1 ,...,ξ N (1 , . . . , N ) = . (2.76) (2π ) N /2 det(Cξ ξ ) One finds ξ = m, (ξ − m)(ξ − m)T = Cξ ξ . Equation 2.73 is a special case for N = 2, and so the earlier forms are consistent with this general definition. Positive definite symmetric matrices can be factored as T/2
1/2
Cξ ξ = Cξ ξ Cξ ξ ,
(2.77) 1/2
which is called the “Cholesky decomposition,” where Cξ ξ is an upper triangular matrix (all zeros below the main diagonal) and non-singular.16 It follows that the transformation (a rotation and stretching) −T/2
x = Cξ ξ
(ξ − m)
(2.78)
produces new variables x of zero mean, and diagonal covariance, that is, a probability density exp − 12 X 12 + · · · + X 2N px1 ,...,x N (X 1 , . . . , X N ) = (2π ) N /2 1 2 exp − 2 X 1 exp − 12 X 2N = ··· , (2.79) (2π )1/2 (2π)1/2 which factors into N -independent, normal variates of zero mean and unit variance (Cx x = Rx x = I). Such a process is often called Gaussian “white noise,” and has the property xi x j = 0, i = j.17
2.3.6 Sums of random variables It is often helpful to be able to compute the probability density of sums of independent random variables. The procedure for doing so is based upon (2.43). Let x be
2.3 Simple statistics: regression
a random variable and consider the expected value of the function ei xt : ∞ i xt
e = px (X ) ei X t dX ≡ φ x (t), −∞
41
(2.80)
which is the Fourier transform of px (X ); φ x (t) is the “characteristic function” of x. Now consider the sum of two independent random variables x, y with probability densities px , p y , respectively, and define a new random variable z = x + y. What is the probability density of z? One starts by first determining the characteristic function, φ z (t), for z and then using the Fourier inversion theorem to obtain px (Z ). To obtain φ z , & ' φ z (t) = ei zt = ei(x+y)t = ei xt ei yt , where the last step depends upon the independence assumption. This last equation shows φ z (t) = φ x (t)φ y (t).
(2.81)
That is, the characteristic function for a sum of two independent variables is the product of the characteristic functions. The “convolution theorem”18 asserts that the Fourier transform (forward or inverse) of a product of two functions is the convolution of the Fourier transforms of the two functions. That is, ∞ pz (Z ) = (2.82) px (r ) p y (Z − r ) dr. −∞
We will not explore this relation in any detail, leaving the reader to pursue the subject in the references.19 But it follows immediately that the multiplication of the characteristic functions of a sum of independent Gaussian variables produces a new variable, which is also Gaussian, with a mean equal to the sum of the means and a variance that is the sum of the variances (“sums of Gaussians are Gaussian”). It also follows immediately from Eq. (2.81) that if a variable ξ is defined as ξ = x12 + x22 + · · · + xν2 ,
(2.83)
where each xi is Gaussian of zero mean and unit variance, then the probability density for ξ is pξ () =
ν/2−1 exp(−/2), 2ν/2 ν2
(2.84)
known as χ 2ν – “chi-square sub-ν.” The chi-square probability density is central to the discussion of the expected sizes of vectors, such as n, ˜ measured as n˜ T n˜ = 2 2 n ˜ = i n˜ i if the elements of n˜ can be assumed to be independent and Gaussian. One has ξ = ν, (ξ − ξ )2 = 2ν. Equation (2.45) is the special case ν = 1.
Basic machinery
42
Degrees-of-Freedom The number of independent variables described by a probability density is usually called the “number of degrees-of-freedom.” Thus the densities in (2.76) and (2.79) have N degrees-of-freedom and (2.84) has ν of them. If a sample average (2.40) is formed, it is said to have N degrees-of-freedom if each of the x j is independent. But what if the x j have a covariance Cx x that is non-diagonal? This question of how to interpret averages of correlated variables will be explicitly discussed later (p. 133). Consider the special case of the sample variance Eq. (2.41) – which we claimed was biassed. The reason is that even if the sample values, xi , are independent, the presence of the sample average in the sample variance means that there are only N − 1 independent terms in the sum. That this is so is most readily seen by examining the two-term case. Two samples produce a sample mean, x2 = (x1 + x2 )/2. The two-term sample variance is s 2 = 12 [(x1 − x2 )2 + (x2 − x2 )2 ], but knowledge of x1 , and of the sample average, permits perfect prediction of x2 = 2 x2 − x1 . The second term in the sample variance as written is not independent of the first term, and thus there is just one independent piece of information in the two-term sample variance. To show it in general, assume without loss of generality that x = 0, so that σ 2 = x 2 . The sample variance about the sample mean (which will not vanish) of independent samples is given by Eq. (2.41), and so ! M M M
1 1 1
s 2 = xi − xi − xj xp M i=1 M j=1 M p=1 ( ) M M M M M
& 2' 1
1
1
1
= xi −
xi x j −
xi x p + 2
x j x p M i=1 M j=1 M p=1 M j=1 p=1 ( ) M 2
2
2
1
σ σ σ = δi j − δi p + 2 δ jp σ2 − M i=1 M j M p M j p =
(M − 1) 2 σ = σ 2 . M Stationarity
Consider a vector random variable, with element xi where the subscript i denotes a position in time or space. Then xi , x j are two different random variables – for example, the temperature at two different positions in a moving fluid, or the temperature at two different times at the same position. If the physics governing these two different random variables are independent of the parameter i (i.e., independent of time or
2.4 Least-squares
43
space), then xi is said to be “stationary” – meaning that all the underlying statistics are independent of i.20 Specifically, xi = x j ≡ x, D 2 (xi ) = D 2 (x j ) = D 2 (x), etc. Furthermore, xi , x j have a covariance C x x (i, j) = (xi − xi )(x j − x j ),
(2.85)
that is, independent of i, j, and might as well be written C x x (|i − j|), depending only upon the difference |i − j|. The distance, |i − j|, is often called the “lag.” C x x (|i − j|) is called the “autocovariance” of x or just the covariance, because xi , x j are now regarded as intrinsically the same process.21 If C x x does not vanish, then by the discussion above, knowledge of the numerical value of xi implies some predictive skill for x j and vice-versa – a result of great importance for mapmaking and objective analysis. For stationary processes, all moments having the same |i − j| are identical; it is seen that all diagonals of such a matrix {C x x (i, j)}, are constant, for example, Cξ ξ in Eq. (2.76). Matrices with constant diagonals are thus defined by the vector C x x (|i − j|), and are said to have a “Toeplitz form.” 2.4 Least-squares Much of what follows in this book can be described using very elegant and powerful mathematical tools. On the other hand, by restricting the discussion to discrete models and finite numbers of measurements (all that ever goes into a digital computer), almost everything can also be viewed as a form of ordinary least-squares, providing a much more intuitive approach than one through functional analysis. It is thus useful to go back and review what “everyone knows” about this most-familiar of all approximation methods. 2.4.1 Basic formulation Consider the elementary problem motivated by the “data” shown in Fig. 2.2. t is supposed to be an independent variable, which could be time, a spatial coordinate, or just an index. Some physical variable, call it θ(t), perhaps temperature at a point in a laboratory tank, has been measured at coordinates t = ti , i = 1, 2, . . . , M, as depicted in the figure. We have reason to believe that there is a linear relationship between θ(t) and t in the form θ(t) = a + bt, so that the measurements are y(ti ) = θ (ti ) + n(ti ) = a + bti + n(ti ),
(2.86)
where n(t) is the inevitable measurement noise. The straight-line relationship might as well be referred to as a “model,” as it represents the present conception of the data structure. We want to determine a, b.
Basic machinery
44 120 100 80
y
60 40 20 0 −20
0
10
20
30
40
50
Time
Figure 2.2 “Data” generated through the rule y = 1 + 2t + n t , where n t = 0,
n i n j = 9δ i j , shown as solid dots connected by the solid line. The dashed line is the simple least-squares fit, y˜ = 1.69 ± 0.83 + (1.98 ± 0.03)t. Residuals are plotted as open circles, and at least visually, show no obvious structure. Note that the fit is correct within its estimated standard errors. The sample variance of the estimated noise, not the theoretical value, was used for calculating the uncertainty.
The set of observations can be written in the general standard form, Ex + n = y, where
⎧ 1 ⎪ ⎪ ⎪ ⎪1 ⎨ E= · ⎪ ⎪ ⎪ · ⎪ ⎩ 1
⎫ t1 ⎪ ⎪ ⎪ t2 ⎪ ⎬ · , ⎪ ⎪ ·⎪ ⎪ ⎭ tM
⎤ y(t1 ) ⎢ y(t2 ) ⎥ ⎥ ⎢ ⎥, y=⎢ · ⎥ ⎢ ⎣ · ⎦ y(t M ) ⎡
a x= , b
(2.87) ⎡
⎤ n(t1 ) ⎢ n(t2 ) ⎥ ⎢ ⎥ ⎥. n=⎢ · ⎢ ⎥ ⎣ · ⎦ n(t M )
(2.88)
Equation sets like (2.87) are seen in many practical situations, including the ones described in Chapter 1. The matrix E in general represents arbitrarily complicated linear relations between the parameters x, and the observations y. In some real cases, it has many thousands of rows and columns. Its construction involves specifying what those relations are, and, in a very general sense, it requires a “model” of the data set. Unfortunately, the term “model” is used in a variety of other ways in this context, including statistical assumptions, and often for auxiliary relationships among the elements of x that are independent of those contained in E. To separate these difference usages, we will sometimes append various adjectives to the use (“statistical model,” “exact relationships,” etc.).
2.4 Least-squares
45
One sometimes sees (2.87) written as Ex ∼ y, or even Ex = y. But Eq. (2.87) is preferable, because it explicitly recognizes that n = 0 is exceptional. Sometimes, by happenstance or arrangement, one finds M = N and that E has an inverse. But the obvious solution, x = E−1 y, leads to the conclusion that n = 0, which should be unacceptable if the y are the result of measurements. We will need to return to this case, but, for now, consider the commonplace problem where M > N . Then, one often sees a “best possible” solution – defined as producing the smallest possible value of nT n, that is, the minimum of J=
M
n i2 = nT n = (y − Ex)T (y − Ex).
(2.89)
i=1
(Whether the smallest noise solution really is the best one is considered later.) In the special case of the straight-line model, J=
M
(yi − a − bti )2 .
(2.90)
i=1
J is an example of what is called an “objective,” “cost” or “misfit” function.22 Taking the differential of (2.90) with respect to a, b or x (using (2.27)), and setting it to zero produces T
∂J ∂J dJ = dxi = dx ∂ xi ∂x i = 2dxT (ET y − ET Ex) = 0. This equation is of the form dJ =
ai dxi = 0.
(2.91)
(2.92)
It is an elementary result of multivariable calculus that an extreme value (here a minimum) of J is found where dJ = 0. Because the xi are free to vary independently, dJ will vanish only if the coefficients of the dxi are separately zero, or ET y − ET Ex = 0.
(2.93)
ET Ex = ET y,
(2.94)
That is,
46
Basic machinery
which are called the “normal equations.” Note that Eq. (2.93) asserts that the columns of E are orthogonal (that is “normal”) to n = y − Ex. Making the sometimes-valid assumption that (ET E)−1 exists, x˜ = (ET E)−1 ET y.
(2.95)
Note on notation: Solutions to equations involving data will be denoted x˜ , to show that they are an estimate of the solution and not necessarily identical to the “true” one in a mathematical sense. Second derivatives of J with respect to x, make clear that we have a minimum and not a maximum. The relationship between 2.95 and the “correct” value is obscure. x˜ can be substituted everywhere for x in Eq. (2.89), but usually the context makes clear the distinction between the calculated and true values. Figure 2.2 displays the fit along with the residuals, n˜ = y − E˜x = [I − E(ET E)−1 ET ]y.
(2.96)
That is, the M equations have been used to estimate N values, x˜ i , and M values n˜ i , or M + N altogether. The combination H = E(ET E)−1 ET
(2.97)
occurs sufficiently often that it is worth a special symbol. Note the “idempotent” property H2 = H. If the solution, x˜ , is substituted into the original equations, the result is E˜x = Hy = y˜ ,
(2.98)
n˜ T y˜ = [(I − H) y]T Hy = 0.
(2.99)
and
The residuals are orthogonal (normal) to the inferred noise-free “data” y˜ . All of this is easy and familiar and applies to any set of simultaneous linear equations, not just the straight-line example. Before proceeding, let us apply some of the statistical machinery to understanding (2.95). Notice that no statistics were used in obtaining (2.95), but we can nonetheless ask the extent to which this value for x˜ is affected by the random elements: the noise in y. Let y0 be the value of y that would be obtained in the hypothetical situation for which n = 0. Assume further that n = 0 and that Rnn = Cnn = nnT is known. Then the expected value of x˜ is
˜x = (ET E)−1 ET y0 .
(2.100)
If the matrix inverse exists, then in many situations, including the problem of fitting a straight line to data, perfect observations would produce the correct answer,
2.4 Least-squares
47
500 400
y
300 200 100 0
−100
0
2
4
6
8
10 Time
12
14
16
18
20
Figure 2.3 Here the “data” were generated from a quadratic rule, y = 1 + t 2 + n(t), n 2 = 900. Note that only the first 20 data points are used. An incorrect straight line fit was used resulting in y˜ = (−76.3 ± 17.3) + (20.98 ± 1.4)t, which is incorrect, but the residuals, at least visually, do not appear unacceptable. At this point some might be inclined to claim the model has been “verified,” or “validated.”
and Eq. (2.95) provides an unbiassed estimate of the true solution, ˜x = x. A more transparent demonstration of this result will be given later in this chapter (see p. 103). On the other hand, if the data were actually produced from physics governed, for example, by a quadratic rule, θ(t) = a + ct 2 , then fitting the linear rule to such observations, even if they are perfect, could never produce the right answer and the solution would be biassed. An example of such a fit is shown in Figs. 2.3 and 2.4. Such errors are conceptually distinguishable from the noise of observation, and are properly labeled “model errors.” Assume, however, that the correct model is being used, and therefore that ˜x = x. Then the uncertainty of the solution is P = Cx˜ x˜ = (˜x − x)(˜x − x)T = (ET E)−1 ET nnT E(ET E)−1 −1
= (E E) T
(2.101)
−1
E Rnn E(E E) . T
T
In the special case, Rnn = σ 2n I, that is, no correlation between the noise in different equations (white noise), Eq. (2.101) simplifies to P = σ 2n (ET E)−1 .
(2.102)
If we are not confident that ˜x = x, perhaps because of doubts about the straight-line model, Eqs. (2.101) and (2.102) are still interpretable, but as Cx˜ x˜ =
Basic machinery
48 3000
y
2000 1000 0 −500
0
10
20
30
40
50
Time
Figure 2.4 The same situation as in Fig. 2.3, except the series was extended to 50 points. Now the residuals (◦ ) are visually structured, and one would have a powerful suggestion that some hypothesis (something about the model or data) is not correct. This straight-line fit should be rejected as being inconsistent with the assumption that the residuals are unstructured: the model has been “invalidated.”
D 2 (˜x −
˜x), the covariance of x˜ . The “standard error” of x˜ i is usually defined to be ± Cx˜ x˜ ii and is used to understand the adequacy of data for distinguishing different possible estimates of x˜ . If applied to the straight-line fit of Fig. 2.2, ˜ = [1.69 ± 0.83, 1.98 ± 0.03], which are within one standard deviation ˜ b] x˜ T = [a, of the true values, [a, b] = [1, 2]. If the noise in y is Gaussian, it follows that the probability density of x˜ is also Gaussian, with mean ˜x and covariance Cx˜ x˜ . Of course, if n is not Gaussian, then the estimate won’t be either, and one must be wary of the utility of the standard errors. A Gaussian, or other, assumption should be regarded as part of the model definition. The uncertainty of the residuals is Cnn = (n˜ − n) ˜ (n˜ − n) ˜ T = (I − H) Rnn (I − H)T
(2.103)
= σ 2n (I − H)2 = σ 2n (I − H) , where zero-mean white noise was assumed, and H was defined in Eq. (2.97). The true noise, n, was assumed to be white, but the estimated noise, n, ˜ has a nondiagonal covariance and so in a formal sense does not have the expected structure. We return to this point below. The fit of a straight line to observations demonstrates many of the issues involved in making inferences from real, noisy data that appear in more complex situations. In Fig. 2.5, the correct model used to generate the data was the same as in Fig. 2.2, ˜ are numerically inexact, but ˜ b] but the noise level is very high. The parameters [a, consistent within one standard error with the correct values, which is all one can hope for. In Fig. 2.3, a quadratic model y = 1 + t 2 + n (t) was used to generate the numbers, with n 2 = 900. Using only the first 20 points, and fitting an incorrect model,
2.4 Least-squares
49
150 100
y
50 0
−50 −100
0
5
10
15
20
25 Time
30
35
40
45
50
Figure 2.5 The same situation as in Fig. 2.2, y = 1 + 2t, except n 2 = 900 to give very noisy data. Now the best-fitting straight line is y = (6.62 ± 6.50) + (1.85 ± 0.22) t, which includes the correct answer within one standard error. Note that the intercept value is indistinguishable from zero.
produces a reasonable straight-line fit to the data as shown. Modeling a quadratic field with a linear model produces a systematic or “model” error, which is not easy to detect here. One sometimes hears it said that “least-squares failed” in situations such as this one. But this conclusion shows a fundamental misunderstanding: leastsquares did exactly what it was asked to do – to produce the best-fitting straight line to the data. Here, one might conclude that “the straight-line fit is consistent with the data.” Such a conclusion is completely different from asserting that one has proven a straight-line fit correctly “explains” the data or, in modeler’s jargon, that the model has been “verified” or “validated.” If the outcome of the fit were sufficiently important, one might try more powerful tests on the n˜ i than a mere visual comparison. Such tests might lead to rejection of the straight-line hypothesis; but even if the tests are passed, the model has never been verified: it has only been shown to be consistent with the available data. If the situation remains unsatisfactory (perhaps one suspects the model is inadequate, but there are not enough data to produce sufficiently powerful tests), it can be very frustrating. But sometimes the only remedy is to obtain more data. So, in Fig. 2.4, the number of observations was extended to 50 points. Now, even visually, the n˜ i are obviously structured, and one would almost surely reject any hypothesis that a straight line was an adequate representation of the data. The model has been invalidated. A quadratic rule, y = a + bt + ct 2 , produces an acceptable solution (see Fig. 2.6). One must always confirm, after the fact, that J , which is a direct function of the residuals, behaves as expected when the solution is substituted. In particular, its
Basic machinery
50 3000 2500 2000
y
1500 1000 500 0
−500
0
5
10
15
20
25
30
35
40
45
50
Time
Figure 2.6 Same as Fig. 2.4, except a more complete model, y = a + bt + ct 2 , was used, and which gives acceptable residuals.
expected value,
J =
M
& 2' ni = M − N ,
(2.104)
i
assuming that the n i have been scaled so that each has an expected value n i2 = 1. That there are only M − N independent terms in (2.104) follows from the N supposed-independent constraints linking the variables. For any particular solution, x˜ , n, ˜ J will be a random variable, whose expectation is (2.104). Assuming the n i are at least approximately Gaussian, J itself is the sum of M − N independent χ 21 variables, and is therefore distributed in χ 2M−N . One can and should make histograms of the individual n i2 to check them against the expected χ 21 probability density. This type of argument leads to the large literature on hypothesis testing. As an illustration of the random behavior of residuals, 30 equations, Ex + n = y, in 15 unknowns were constructed, such that ET E was non-singular. Fifty different values of y were then constructed by generating 50 separate n using a pseudorandom number generator. An ensemble of 50 different solutions were calculated using (2.95), producing 50 × 30 = 1500 separate values of n˜ i2 . These are plotted in 2 ˜ i , was found Fig. 2.7 and compared to χ 21 . The corresponding value, J˜ ( p) = 30 1 n for each set of equations, and also plotted. A corresponding frequency function for J˜( p) is compared in Fig. 2.7 to χ 215 , with reasonably good results. The empirical mean value of all J˜i is 14.3. Any particular solution may, completely correctly, produce individual residuals n˜ i2 differing considerably from the mean of χ 21 = 1, and, similarly, their sums, J ( p) , may differ greatly from χ 215 = 15. But one can readily calculate the probability of finding a much larger or smaller value, and employ it to help evaluate the possibility that one has used an incorrect model.
2.4 Least-squares (b)
(c)
ni
χ215
Ji
pn
(a)
51
i
Ji
Figure 2.7 (a) χ 21 probability density and the empirical frequency function of all residuals, n˜ i2 , from 50 separate experiments for the simple least-squares solution of Ex + n = y. There is at least rough agreement between the theoretical and calculated frequency functions. (b) The 50 values of Ji computed from the same experiments in (a). (c) The empirical frequency function for Ji as compared to the theoretical value of χ 215 (dashed line). Tests exist (not discussed here) of the hypothesis that the calculated Ji are consistent with the theoretical distribution.
Visual tests for randomness of residuals have obvious limitations, and elaborate statistical tests in addition to the comparison with χ 2 exist to help determine objectively whether one should accept or reject the hypothesis that no significant structure remains in a sequence of numbers. Books on regression analysis23 should be consulted for general methodologies. As an indication of what can be done, Fig. 2.8 shows the “sample autocorrelation” M−|τ | 1/M i=1 n˜ i n˜ i+τ ˜ φ nn (τ ) = (2.105) M 2 1/M i=1 n˜ i for the residuals of the fits shown in Figs. 2.4 and 2.6. For white noise,
φ˜ (τ ) = δ 0τ ,
(2.106)
and deviations of the estimated φ˜ (t) from Eq. (2.106) can be used in simple tests. The adequate fit (Fig. 2.6) produces an autocorrelation of the residuals indistinguishable from a delta function at the origin, while the inadequate fit shows a great deal of structure that would lead to the conclusion that the residuals are too different from white noise to be acceptable. (Not all cases are this obvious.) As already pointed out, the residuals of the least-squares fit cannot be expected to be precisely white noise. Because there are M relationships among the parameters of the problem (M equations), and the number of x˜ elements determined is N , there are M − N degrees of freedom in the determination of n˜ and structures are imposed upon them. The failure, for this reason, of n˜ strictly to be white noise, is generally only an issue in practice when M − N becomes small compared to M.24
Basic machinery
52 1
φnn(τ)
0.5 0
−0.5 −50
0 50 t Figure 2.8 Autocorrelations of the estimated residuals in Figs. 2.4 (dashed line) and 2.6 (solid). The latter is indistinguishable, by statistical test, from a delta function at the origin, and so, with this test, the residuals are not distinguishable from white noise.
2.4.2 Weighted and tapered least-squares The least-squares solution (Eqs. (2.95) and (2.96)) was derived by minimizing the objective function (2.89), in which each residual element is given equal weight. An important feature of least-squares is that we can give whatever emphasis we please to minimizing individual equation residuals, for example, by introducing an objective function,
J= Wii−1 n i2 , (2.107) i
where Wii are any numbers desired. The choice Wii = 1, as used above, might be reasonable, but it is an arbitrary one that without further justification does not produce a solution with any special claim to significance. In the least-squares context, we are free to make any other reasonable choice, including demanding that some residuals should be much smaller than others – perhaps just to see if it is possible. A general formalism is obtained by defining a diagonal weight matrix, W = √ diag(Wii ). Dividing each equation by Wii ,
−T/2 −T/2 −T/2 Wii E i j x j + Wii n i = Wii yi , (2.108) i
or E x + n = y E = W−T/2 E,
n = W−T/2 n,
y = W−T/2 y
(2.109)
where we used the fact that the square root of a diagonal matrix is the diagonal matrix of element-by-element square roots. The operation in (2.108) or (2.109) is usually called “row scaling” because it operates on the rows of E (as well as on n, y).
2.4 Least-squares
53
For the new equations (2.109), the objective function, J = n T n = (y − E x)T (y − E x) −1
(2.110)
−1
= n W n = (y − Ex) W (y − Ex), T
T
weights the residuals as desired. If, for some reason, W is non-diagonal, but symmetric and positive-definite, then it has a Cholesky decomposition (see p. 40) and W = WT/2 W1/2 , and (2.109) remains valid more generally. ˜ minimizing (2.110), are The values x˜ , n, x˜ = (E T E )−1 E T y = (ET W−1 E)−1 ET W−1 y, n˜ = WT/2 n = [I − E(ET W−1 E)−1 ET W−1 ]y, Cx x = (ET W−1 E)−1 ET W−1 Rnn W−1 E(ET W−1 E)−1 .
(2.111) (2.112)
Uniform diagonal weights are a special case. The rationale for choosing differing diagonal weights, or a non-diagonal W, is probably not very obvious to the reader. Often one chooses W = Rnn = { n i n j }, that is, the weight matrix is chosen to be the expected second moment matrix of the residuals. Then
n n T = I, and Eq. (2.112) simplifies to −1 Cx x = ET R−1 . nn E
(2.113)
In this special case, the weighting (2.109) has a ready interpretation: The equations (and hence the residuals) are rotated and stretched so that in the new coordinate system of n i , the covariances are all diagonal and the variances are all unity. Under these circumstances, an objective function
n i 2 , J= i
as used in the original form of least-squares (Eq. (2.89)), is a reasonable choice. Example Consider the system x1 + x2 + n 1 = 1 x1 − x2 + n 2 = 2 x1 − 2x2 + n 3 = 4.
Basic machinery
54
Then if n i = 0, n i2 = σ 2 , the least-squares solution is x˜ =[2.0, 0.5]T . Now suppose that ⎫ ⎧ 0.99 0.98⎬ ⎨ 1
n i n j = 0.99 1 0.99 . ⎭ ⎩ 0.98 0.99 4 Then from Eq. (2.112), x˜ = [1.51, −0.48]T . Calculation of the two different solution uncertainties is left to the reader. We emphasize that this choice of W is a very special one and has confused many users of inverse methods. To emphasize again: Least-squares is an approximation procedure in which W is a set of weights wholly at the disposal of the investigator; setting W = Rnn is a special case whose significance is best understood after we examine a different, statistical, estimation procedure. Whether the equations are scaled or not, the previous limitations of the simple least-squares solutions remain. In particular, we still have the problem that the solution may produce elements in x˜ , n˜ whose relative values are not in accord with expected or reasonable behavior, and the solution uncertainty or variances could be unusably large, as the solution is determined, mechanically, and automatically, from combinations such as (ET W−1 E)−1 . Operators like these are neither controllable nor very easy to understand; if any of the matrices are singular, they will not even exist. ˜ Cx x It was long ago recognized that some control over the magnitudes of x˜ , n, could be obtained in the simple least-squares context by modifying the objective function (2.107) to have an additional term: J = nT W−1 n + γ 2 xT x
(2.114)
−1
= (y − Ex) W (y − Ex) + γ x x, T
2 T
(2.115)
in which γ 2 is a positive constant. If the minimum of (2.114) is sought by setting the derivatives with respect to x to zero, then we arrive at the following: x˜ = (ET W−1 E + γ 2 I)−1 ET W−1 y
(2.116)
n˜ = y − E˜x −1
(2.117) −1
−1
−1
−1
−1
Cx x = (E W E + γ I) E W Rnn W E(E W E + γ I) . T
2
T
T
2
(2.118)
By letting γ 2 → 0, the solution to (2.111) and (2.112) is recovered, and if γ 2 → ∞, ˜x2 → 0, n˜ → y; γ 2 is called a “ trade-off parameter,” because it trades the
2.4 Least-squares
55
˜ By varying the size of γ 2 we gain some influence magnitude of x˜ against that of n. over the norm of the residuals relative to that of x˜ . The expected value of x˜ is now,
˜x = [ET W−1 E + γ 2 I]−1 ET W−1 y0 .
(2.119)
If the true solution is believed to be (2.100), then this new solution is biassed. But the variance of x˜ has been reduced, (2.118), by introduction of γ 2 > 0 – that is, the acceptance of a bias reduces the variance, possibly very greatly. Equations (2.116) and (2.117) are sometimes known as the “tapered least-squares” solution, a label whose implication becomes clear later. Cnn , which is not displayed, is readily found by direct computation as in Eq. (2.103). The most basic, and commonly seen, form of this solution assumes that W = Rnn = I, so that x˜ = (ET E + γ 2 I)−1 ET y, −1
(2.120) −1
Cx x = (E E + γ I) E E(E E + γ I) . T
2
T
T
2
(2.121)
A physical motivation for the modified objective function (2.114) is obtained by noticing that a preference for a bounded x is easily produced by adding an equation set, x + n1 = 0, so that the combined set is Ex + n = y,
(2.122)
x + n1 = 0,
(2.123)
or E1 =
E , γI
E1 x + n2 = y2 , % $ nT2 = nT γ nT1 ,
yT2 = [yT 0T ],
(2.124)
in which γ > 0 expresses a preference for fitting the first or second sets more closely. Then J in Eq. (2.114) becomes the natural objective function to use. A preference that x ≈ x0 is readily imposed instead, with an obvious change in (2.114) or (2.123). Note the important points, to be shown later, that the matrix inverses in Eqs. (2.116) and (2.117) will always exist, as long as γ 2 > 0, and that the expressions remain valid even if M < N . Tapered least-squares produces some control ˜ but still does not produce over the sum of squares of the relative norms of x˜ , n, control over the individual elements x˜ i . To gain some of that control, we can further generalize the objective function by introducing another non-singular N × N weight matrix, S (which is usually
Basic machinery
56
symmetric), and J = nT W−1 n + xT S−1 x
(2.125)
−1
T −1
= (y − Ex) W (y − Ex) + x S x, T
(2.126)
for which Eq. (2.114) is a special case. Setting the derivatives with respect to x to zero results in the following: x˜ = (ET W−1 E + S−1 )−1 ET W−1 y,
(2.127)
n˜ = y − E˜x,
(2.128) −1 −1
−1
−1
−1
−1
−1 −1
Cx x = (E W E + S ) E W Rnn W E(E W E + S ) , T
T
T
(2.129)
which are identical to Eqs. (2.116)–(2.118) with S−1 = γ 2 I. Cx x simplifies if Rnn = W. Suppose that S, W are positive definite and symmetric and thus have Cholesky decompositions. Then employing both matrices directly on the equations, Ex + n = y, W−T/2 ES−T/2 ST/2 x + W−T/2 n = W−T/2 y
Ex +n =y
(2.130)
(2.131)
E = W−T/2 EST/2 , x = S−T/2 x, n = W−T/2 n, y = W−T/2 y
(2.132)
The use of S in this way is called “column scaling” because it weights the columns of E. With Eqs. (2.131) the obvious objective function is J = n T n + x T x ,
(2.133)
which is identical to Eq. (2.125) in the original variables, and the solution must be that in Eqs. (2.127)–(2.129). Like W, one is completely free to choose S as one pleases. A common example is to write, where F is N × N , that S = FT F ⎧ 1 ⎪ ⎪ ⎪ ⎨0 F = γ2 . .. ⎪ ⎪ ⎪ ⎩ 0
−1 1 .. .
0 −1 .. .
··· ··· .. .
···
··· 2
0
⎫ 0⎪ ⎪ ⎪ 0⎬ .. ⎪ , .⎪ ⎪ ⎭ 1
(2.134)
2 The effect of (2.134) is to minimize a term γ i (x i − x i+1 ) , which can be regarded 2 as a “smoothest” solution, and, using γ to trade smoothness against the size of ˜ 2 , F is obtained from the Cholesky decomposition of S. n
2.4 Least-squares
57
By invoking the matrix inversion lemma, an alternate form for Eqs. (2.127)– (2.129) is found: x˜ = SET (ESET + W)−1 y,
(2.135)
n˜ = y − E˜x,
(2.136) −1
−1
Cx x = SE (ESE + W) Rnn (ESE + W) ES. T
T
T
(2.137)
A choice of which form to use is sometimes made on the basis of the dimensions of the matrices being inverted. Note again that W = Rnn is a special case. So far, all of this is conventional. But we have made a special point of displaying ˜ Notice that although explicitly not only the elements x˜ , but those of the residuals, n. we have considered only the formally over determined system, M > N , we always ˜ for a total determine not only the N elements of x˜ , but also the M elements of n, of M + N values – extracted from the M equations. It is apparent that any change in any element n˜ i forces changes in x˜ . In this view, to which we adhere, systems of equations involving observations always contain more unknowns than equations. Another way to make the point is to re-write Eq. (2.87) without distinction between x, n as E1 ξ = y,
(2.138)
E1 = {E, I M }, A combined weight matrix,
S S1 = 0
ξ = [x, n] . T
T
0 , W
(2.139)
(2.140)
would be used, and any distinction between the x, n solution elements is suppressed. Equation (2.138) describes a formally underdetermined system, derived from the formally over determined observed one. This identity leads us to the problem of formal underdetermination in the next section. In general with least-squares problems, the solution sought can be regarded as any of the following equivalents: 1. The x˜ , n˜ satisfying Ex + n = y.
(2.141)
2. x˜ , n˜ satisfying the normal equations arising from J (2.125). 3. x˜ , n˜ producing the minimum of J in (2.125).
The point of this list lies with item (3): algorithms exist to find minima of functions by deterministic methods (“go downhill” from an initial guess),25 or
Basic machinery
58
stochastic search methods (Monte Carlo) or even, conceivably, through a shrewd guess by the investigator. If an acceptable minimum of J is found, by whatever means, it is an acceptable solution (subject to further testing, and the possibility that there is more than one such solution). Search methods become essential for the non-linear problems taken up later.
2.4.3 Underdetermined systems and Lagrange multipliers What does one do when the number, M, of equations is less than the number, N , of unknowns and no more observations are possible? We have seen that the claim that a problem involving observations is ever over determined is misleading – because each equation or observation always has a noise unknown, but to motivate some of what follows, it is helpful to first pursue the conventional approach. One often attempts when M < N to reduce the number of unknowns so that the formal overdeterminism is restored. Such a parameter reduction procedure may be sensible; but there are pitfalls. For example, let pi (t), i = 0, 1, . . . , be a set of polynomials, e.g., Chebyschev or Laguerre, etc. Consider data produced from the formula y (t) = 1 + a M p M (t) + n(t),
(2.142)
which might be deduced by fitting a parameter set [a0 , . . . , a M ] and finding a˜ M . If there are fewer than M observations, an attempt to fit with fewer parameters, y=
Q
a j p j (t) ,
Q<M
(2.143)
j=0
may give a good, even perfect fit; but it would be incorrect. The reduction in model parameters in such a case biasses the result, perhaps hopelessly so. One is better off retaining the underdetermined system and making inferences concerning the possible values of ai rather than using the form (2.143), in which any possibility of learning something about a M has been eliminated. Example Consider a tracer problem, not unlike those encountered in medicine, hydrology, oceanography, etc. A box (Fig. 1.2) is observed to contain a steady tracer concentration C0 , and is believed to be fed at the rates J1 , J2 from two reservoirs each with tracer concentration of C1 , C2 respectively. One seeks to determine J1 , J2 . Tracer balance is J1 C1 + J2 C2 − J0 C0 = 0,
(2.144)
2.4 Least-squares
59
where J0 is the rate at which fluid is removed. Mass balance requires that J1 + J2 − J0 = 0.
(2.145)
Evidently, there are but two equations in three unknowns (and a perfectly good solution would be J1 = J2 = J3 = 0); but, as many have noticed, we can nonetheless, determine the relative fraction of the fluid coming from each reservoir. Divide both equations through by J0 , J1 J2 C1 + C2 = C0 J0 J0 J1 J2 + = 1, J0 J0 producing two equations in two unknowns, J1 /J0 , J2 /J0 , which has a unique stable solution (noise is being ignored). Many examples can be given of such calculations in the literature – determining the flux ratios – apparently definitively. But suppose the investigator is suspicious that there might be a third reservoir with tracer concentration C3 . Then the equations become J1 J2 J3 C1 + C2 + C3 = C0 J0 J0 J0 J1 J2 J3 + + = 1, J0 J0 J0 which are now underdetermined with two equations in three unknowns. If it is obvious that no such third reservoir exists, then the reduction to two equations in two unknowns is the right thing to do. But if there is even a suspicion of a third reservoir (or more), one should solve these equations with one of the methods we will develop – permitting construction and understanding of all possible solutions. In general terms, parameter reduction can lead to model errors, that is, bias errors, which can produce wholly illusory results.26 A common situation, particularly in problems involving tracer movements in groundwater, ocean, or atmosphere, is fitting a one or two-dimensional model to data that represent a fully threedimensional field. The result may be apparently pleasing, but possibly completely erroneous. A general approach to solving underdetermined problems is to render the answer apparently unique by minimizing an objective function, subject to satisfaction of the linear constraints. To see how this can work, suppose that Ax = b, exactly and formally underdetermined, M < N , and seek the solution that exactly satisfies the equations and simultaneously renders an objective function, J = xT x, as small as
Basic machinery
60
possible. Direct minimization of J leads to dJ = dxT
∂J = 2xT dx = 0, ∂x
(2.146)
but, unlike the case in Eq. (2.91), the coefficients of the individual dxi can no longer be separately set to zero (i.e., x = 0 is an incorrect solution) because the dxi no longer vary independently, but are restricted to values satisfying Ax = b. One approach is to use the known dependencies to reduce the problem to a new one in which the differentials are independent. For example, suppose that there are general functional relationships ⎡ ⎤ ⎡ ⎤ x1 ξ 1 (x M+1 , . . . , x N ) ⎢ .. ⎥ ⎢ ⎥ .. ⎣ . ⎦=⎣ ⎦. . xM
ξ M (x M+1 , . . . , x N )
Then the first M elements of xi may be eliminated, and the objective function becomes % $ 2 J = [ξ 1 (x M+1 , . . . , x N )2 + · · · + ξ M (x M+1 , . . . , x N )2 ] + x M+1 + · · · + x N2 , in which the remaining xi , M + i = 1, 2, . . . , N are independently varying. In the present case, one can choose (arbitrarily) the first M unknowns, q = [xi ], and define the last N − M unknowns r = [xi ], i = N − M + 1, . . . , N , and rewrite the equations as q { A1 A2 } =b (2.147) r where A1 is M × M, A2 is M × (N − M). Then solving the first set for q, q = b − A2 r.
(2.148)
q can be eliminated from J leaving an unconstrained minimization problem in the independent variables, r. If A−1 1 does not exist, one can try any other subset of the xi to eliminate until a suitable group is found. This approach is completely correct, but finding an explicit solution for L elements of x in terms of the remaining ones may be difficult or inconvenient. Example Solve x1 − x2 + x3 = 1, for the solution of minimum norm. The objective function is J = x12 + x22 + x32 . With one equation, one variable can be eliminated. Arbitrarily choosing x1 = 1 +
2.4 Least-squares
61
x2 − x3 ,J = (1 + x2 − x3 )2 + x22 + x32 . x2 , x3 are now independent variables, and the corresponding derivatives of J can be independently set to zero. Example A somewhat more interesting example involves two equations in three unknowns: x1 + x2 + x3 = 1, x1 − x2 + x3 = 2, and we choose to find a solution minimizing, J = x12 + x22 + x32 . Solving for two unknowns x1 , x2 from x1 + x2 = 1 − x3 , x1 − x2 = 2 − x3 , produces x2 = −1/2, x1 = 3/2 − x3 , and then J = (3/2 − x3 )2 + 1/4 + x32 , whose minimum with respect to x3 (the only remaining variable) is x3 = 3/4, and the full solution is 3 x1 = , 4
1 x2 = − , 2
3 x3 = . 4
Lagrange multipliers and adjoints When it is inconvenient to find such an explicit representation by eliminating some variables in favor of others, a standard procedure for finding the constrained minimum is to introduce a new vector “Lagrange multiplier,” μ, of M unknown elements, to make a new objective function J = J − 2μT (Ax − b)
(2.149)
= x x − 2μ (Ax − b), T
T
and ask for its stationary point – treating both μ and x as independently varying unknowns. The numerical 2 is introduced solely for notational tidiness. The rationale for this procedure is straightforward.27 Consider first a very simple example of one equation in two unknowns, x1 − x2 = 1.
(2.150)
62
Basic machinery
We seek the minimum norm solution, J = x12 + x22 ,
(2.151)
subject to Eq. (2.150). The differential dJ = 2x1 dx1 + 2x2 dx2 = 0
(2.152)
leads to the unacceptable solution x1 = x2 = 0 if we incorrectly set the coefficients of dx1 , dx2 to zero. Consider instead a modified objective function J = J − 2μ (x1 − x2 − 1) ,
(2.153)
where μ is unknown. The differential of J is dJ = 2x1 dx1 + 2x2 dx2 − 2μ (dx1 − dx2 ) − 2 (x1 − x2 − 1) dμ = 0,
(2.154)
dJ /2 = dx1 (x1 − μ) + dx2 (x2 + μ) − dμ (x1 − x2 − 1) = 0.
(2.155)
or
We are free to choose x1 = μ, which kills off the differential involving dx1 . But then only the differentials dx2 , dμ remain; as they can vary independently, their coefficients must vanish separately, and we have x2 = −μ
(2.156)
x1 − x2 = 1.
(2.157)
Note that the second of these recovers the original equation. Substituting x1 = μ, we have 2μ = 1, or μ = 1/2, and x1 = 1/2, x2 = −1/2, J = 0.5, and one can confirm that this is indeed the “constrained” minimum. (A “stationary” value of J was found, not an absolute minimum value, because J is no longer necessarily positive; it has a saddle point, which we have found.) Before writing out the general case, note the following question: Suppose the constraint equation was changed to x1 − x2 = .
(2.158)
How much would J change as is varied? With variable , (2.154) becomes dJ = 2dx1 (x1 − μ) + 2dx2 (x2 + μ) − 2dμ (x1 − x2 − ) + 2μd. (2.159) But the first three terms on the right vanish, and hence ∂J ∂ J = 2μ = , (2.160) ∂ ∂ because J = J at the stationary point (from (2.158)). Thus 2μ is the sensitivity of the objective function J to perturbations in the right-hand side of the
2.4 Least-squares
63
constraint equation. If is changed from 1 to 1.2, it can be confirmed that the approximate change in the value of J is 0.2, as one deduces immediately from Eq. (2.160). Keep in mind, however, that this sensitivity corresponds to infinitesimal perturbations. We now develop this method generally. Reverting to Eq. (2.149), dJ = dJ − 2μT Adx−2 (Ax − b)T dμ ∂J ∂J T T − 2μ a1 dx1 + − 2μ a2 dx2 + · · · = ∂ x1 ∂ x2 ∂J + − 2μT a N dx N − 2(Ax − b)T dμ ∂ xN = (2x1 − 2μT a1 )dx1 + (2x2 − 2μT a2 )dx2 + · · ·
(2.161)
(2.162)
+ (2x N − 2μ a N )dx N − 2(Ax − b) dμ = 0 T
T
Here the ai are the corresponding columns of A. The coefficients of the first M differentials dxi can be set to zero by assigning xi = μT ai , leaving N − M differentials dxi whose coefficients must separately vanish (hence they all vanish, but for two separate reasons), plus the coefficient of the M − dμi , which must also vanish separately. This recipe produces, from Eq. (2.162), ∂ J = x − AT μ = 0, ∂x 1∂J = Ax − b = 0, 2 ∂μ 1 2
(2.163) (2.164)
where the first equation set is the result of the vanishing of the coefficients of dxi and the second, which is the original set of equations, arises from the vanishing of the coefficients of the dμi . The convenience of being able to treat all the xi as independently varying is offset by the increase in problem dimensions by the introduction of the M unknown μi . The first set is N equations for μ in terms of x, and the second set is M equations in x in terms of y. Taken together, these are M + N equations in M + N unknowns, and hence just-determined no matter what the ratio of M to N . Equation (2.163) is AT μ = x,
(2.165)
and, substituting for x into (2.164), AAT μ = b, μ˜ = (AAT )−1 b,
(2.166)
Basic machinery
64
assuming the inverse exists, and that x˜ = AT (AAT )−1 b
(2.167)
n˜ = 0
(2.168)
Cx x = 0.
(2.169)
(Cx x = 0 because formally n˜ = 0.) Equations (2.167)–(2.169) are the classical solution, satisfying the constraints exactly while minimizing the solution length. That a minimum is achieved can be verified by evaluating the second derivatives of J at the solution point. The minimum occurs at a saddle point in x, μ space28 and where the term proportional to μ necessarily vanishes. The operator AT (AAT )−1 is sometimes called a “Moore– Penrose inverse.” Equation (2.165) for μ in terms of x involves the coefficient matrix AT . An intimate connection exists between matrix transposes and adjoints of differential equations (see the appendix to this chapter), and thus μ is sometimes called the “adjoint solution,” with AT defining the “adjoint model”29 in Eq. (2.165), and x acting as a forcing term. The original Ax = b were assumed formally underdetermined, and thus the adjoint model equations in (2.165) are necessarily formally over determined. Example The last example now using matrix vector notation is
1 1 A= 1 −1
1 1 ,b = 1 2
J = xT x − 2μT (Ax − b) d T (x x − 2μT (Ax − b)) = 2x − 2AT μ = 0 dx Ax= b x = AT (AAT )−1 b x = [3/4, −1/2, 3/4]T . Example Write out J : J = x12 + x22 + x32 − 2μ1 (x1 + x2 + x3 − 1) − 2μ2 (x1 − x2 + x3 − 2) dJ = (2x1 − 2μ1 − 2μ2 )dx1 + (2x2 − 2μ1 + 2μ2 )dx2 + (2x3 − 2μ1 − 2μ2 )dx3 + (−2x1 − 2x2 + 2 − 2x3 ) dμ1 + (−2x1 + 2x2 − 2x3 + 4) dμ2 = 0.
2.4 Least-squares
65
Set x1 = μ1 + μ2, x2 = μ1 − μ2 so that the first two terms vanish, and set the coefficients of the differentials of the remaining, independent terms to zero: dJ dx1 dJ dx2 dJ dx3 dJ dμ1 dJ dμ2
= 2x1 − 2μ1 − 2μ2 = 0, = 2x2 − 2μ1 + 2μ2 = 0, = 2x3 − 2μ1 − 2μ2 = 0, = −2x1 − 2x2 + 2 − 2x3 = 0, = −2x1 + 2x2 − 2x3 + 4 = 0.
Then, dJ = (2x3 − 2μ1 − 2μ2 )dx3 + (−2x1 − 2x2 + 2 − 2x3 )dμ1 + (−2x1 + 2x2 − 2x3 + 4)dμ2 = 0, or x1 = μ1 + μ2 x2 = μ1 − μ2 x3 − μ1 − μ2 = 0 −x1 − x2 + 1 − x3 = 0 −x1 + x2 − x3 + 2 = 0. That is, x = AT μ Ax = b, or
−AT 0
I A
x 0 = . μ b
In this particular case, the first set can be solved for x = AT μ: μ = (AAT )−1 b =[ 1/8 " # x=A
T
1 8 5 8
= [ 3/4
5/8 ]T , −1/2
3/4 ]T .
Basic machinery
66
Example Suppose, instead, we wanted to minimize J = (x1 − x2 )2 + (x2 − x3 )2 = xT FT Fx, where
1 F= 0 FT F =
1 0
−1 1
T
0 −1
1 0
−1 1 −1 1
0 −1
⎧ ⎨1 0 = −1 −1 ⎩ 0
−1 2 −1
⎫ 0⎬ −1 . ⎭ 1
Such an objective function might be used to find a “smooth” solution. One confirms that ⎫⎡ ⎤ ⎧ $ % ⎨ 1 −1 0 ⎬ x1 x1 x2 x3 −1 2 −1 ⎣ x2 ⎦ = x12 − 2x1 x2 + 2x22 − 2x2 x3 + x32 ⎭ ⎩ 0 −1 1 x3 = (x1 − x2 )2 + (x2 − x3 )2 . The stationary point of J = xT FT Fx − 2μT (Ax − b) leads to FT Fx = AT μ Ax = b. But, x = (FT F)−1 AT μ, because there is no inverse (guaranteed). But the coupled set T F F −AT x 0 = A 0 μ b does have a solution. The physical interpretation of μ can be obtained as above by considering the way in which J would vary with infinitesimal changes in b. As in the special case above, J = J at the stationary point. Hence, dJ = dJ − 2μT Adx − 2 (Ax − b)T dμ + 2μT db = 0,
(2.170)
2.4 Least-squares
67
or, since the first three terms on the right vanish at the stationary point, ∂J ∂ J = = 2μ. (2.171) ∂b ∂b Thus, as inferred previously, the Lagrange multipliers are the sensitivity of J, at the stationary point, to perturbations in the parameters b. This conclusion leads, in Chapter 4, to the scrutiny of the Lagrange multipliers as a means of understanding the sensitivity of models and the flow of information within them. Now revert to Ex + n = y, that is, equations containing noise. If these are first column scaled using S−T/2 , Eqs. (2.167)–(2.169) are in the primed variables, and the solution in the original variables is x˜ = SET (ESET )−1 y,
(2.172)
n˜ = 0,
(2.173)
Cx x = 0,
(2.174)
and the result depends directly upon S. If a row scaling with W−T/2 is used, it is readily shown that W disappears from the solution and has no effect on it (see p. 107). Equations (2.172)–(2.174) are a solution, but there is the same fatal defect as in Eq. (2.173) – n˜ = 0 is usually unacceptable when y are observations. Furthermore, ˜x is again uncontrolled, and ESET may not have an inverse. Noise vector n must be regarded as fully an element of the solution, as much as x. Equations representing observations can always be written as in (2.138), and can be solved exactly. Therefore, we now use a modified objective function, allowing for general S, W, J = xT S−1 x + nT W−1 n − 2μT (Ex + n − y),
(2.175)
with both x, n appearing in the objective function. Setting the derivatives of (2.175) with respect to x, n, μ to zero, and solving the resulting normal equations produces the following: x˜ = SET (ESET + W)−1 y,
(2.176)
n˜ = y − E˜x,
(2.177) −1
−1
Cx x = SE (ESE + I) Rnn (ESE + I) ES, T
μ˜ = W
−1
T
n, ˜
T
(2.178) (2.179)
Equations (2.176)–(2.179) are identical to Eqs. (2.135)–(2.137) or to the alternate form Eqs. (2.127) − (2.129) derived from an objective function without Lagrange multipliers.
68
Basic machinery
Equations (2.135)–(2.137) and (2.176)–(2.178) result from two very different appearing objective functions – one in which the equations are imposed in the meansquare, and one in which they are imposed exactly, using Lagrange multipliers. Constraints in the mean-square will be termed “soft,” and those imposed exactly are “hard.”30 The distinction is, however, largely illusory: although (2.87) are being imposed exactly, it is only the presence of the error term, n, which permits the equations to be written as equalities and thus as hard constraints. The hard and soft constraints here produce an identical solution. In some (rare) circumstances, which we will discuss briefly below, one may wish to impose exact constraints upon the elements of x˜ i . The solution in (2.167)–(2.169) was derived from the noise-free hard constraint, Ax = b, but we ended by rejecting it as generally inapplicable. Once again, n is only by convention discussed separately from x, and is fully a part of the solution. The combined form (2.138), which literally treats x, n as the solution, are imposed through a hard constraint on the objective function, J = ξ T ξ − 2μT (E1 ξ − y),
(2.180)
where ξ = [(S−T/2 x)T , (W−T/2 n)T ]T , which is Eq. (2.175). (There are numerical advantages, however, in working with objects in two spaces of dimensions M and N , rather than a single space of dimension M + N .) 2.4.4 Interpretation of discrete adjoints When the operators are matrices, as they are in discrete formulations, then the adjoint is just the transposed matrix. Sometimes the adjoint has a simple physical interpretation. Suppose, e.g., that scalar y was calculated from a sum, * + y = Ax, A = 1 1 . 1 1 . (2.181) Then the adjoint operator applied to y is evidently * +T r = AT y = 1 1 1 . 1 y = x.
(2.182)
Thus the adjoint operator “sprays” the average back out onto the originating vector, and might be thought of as an inverse operator. A more interesting case is a first-difference forward operator, ⎫ ⎧ −1 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −1 1 ⎪ ⎪ ⎪ ⎪ ⎬ ⎨ −1 1 , (2.183) A= . . . ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −1 1 ⎪ ⎪ ⎪ ⎭ ⎩ −1
2.5 The singular vector expansion
69
that is, yi = xi+1 − xi , (with the exception of the last element, y N = −x N ). Then its adjoint is ⎧ −1 ⎪ ⎪ ⎪ ⎪ 1 −1 ⎪ ⎪ ⎨ 1 −1 AT = . . ⎪ ⎪ ⎪ ⎪ ⎪ 1 −1 ⎪ ⎩ 1
(2.184)
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬
−1
⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭
,
(2.185)
which is a first-difference backward operator with z = AT y, producing z i = yi−1 − yi , again with the exception of the end point, which is now z 1 . In general, the transpose matrix, or adjoint operator is not simply interpretable as an inverse operation as the summation/spray-out case might have suggested.31 A more general understanding of the relationship between adjoints and inverses will be obtained in the next section.
2.5 The singular vector expansion Least-squares is a very powerful, very useful method for finding solutions of linear simultaneous equations of any dimensionality and one might wonder why it is necessary to discuss any other form of solution. But even in the simplest form of least-squares, the solution is dependent upon the inverses of ET E, or EET . In practice, their existence cannot be guaranteed, and we need to understand what that means, the extent to which solutions can be found when the inverses do not exist, and the effect of introducing weight matrices W, S. This problem is intimately related to the issue of controlling solution and residual norms. Second, the relationship between the equations and the solutions is somewhat impenetrable, in the sense that structures in the solutions are not easily relatable to particular elements of the data yi . For many purposes, particularly physical insight, understanding the structure of the solution is essential. We will return to examine the least-squares solutions using some extra machinery.
2.5.1 Simple vector expansions Consider again the elementary problem (2.1) of representing an L-dimensional vector f as a sum of a basis of L-orthonormal vectors gi , i = 1, 2, . . . , L, giT g j = δ i j .
Basic machinery
70
Without error, f=
L
ajgj,
a j = gTj f.
(2.186)
j=1
But if for some reason only the first K coefficients a j are known, we can only approximate f by its first K terms: ˜f =
K
bjgj
j=1
(2.187)
= f + δf1 , and there is an error, δf1 . From the orthogonality of the gi , it follows that δf1 will have minimum l2 norm only if it is orthogonal to the K vectors retained in the approximation, and then only if b j = a j as given by (2.186). The only way the error could be reduced further is by increasing K . Define an L × K matrix, G K , whose columns are the first K of the g j . Then b = a = GTK f is the vector of coefficients a j = gTj f, j = 1, 2, . . . , K , and the finite representation (2.187) is (one should write it out) ˜f = G K a = G K GT f = G K GT f, a = {ai }, (2.188) K K where the third equality follows from the associative properties of matrix multiplication. This expression shows that a representation of a vector in an incomplete orthonormal set produces a resulting approximation that is a simple linear combination of the elements of the correct values (i.e., a weighted average, or “filtered” version of them). Column i of G K GTK produces the weighted linear combination of the true elements of f that will appear as f˜ i . Because the columns of G K are orthonormal, GTK G K = I K , that is, the K × K identity matrix; but G K GTK = I L unless K = L . (That G L GTL = I L for K = L follows from the theorem for square matrices that shows a left inverse is also a right inverse.) If K < L, G K is “semi-orthogonal.” If K = L, it is “orthogonal”; in T T this case, G−1 L = G L . If it is only semi-orthogonal, G K is a left inverse, but not a right inverse. Any orthogonal matrix has the property that its transpose is identical to its inverse. The matrix G K GTK is known as a “resolution matrix,” with a simple interpretation. Suppose the true value of f were f j0 = [0 0 0 . . . 0 1 0 . 0 . . 0]T , that is, a Kronecker delta δ j j0 , with unity in element j0 and zero otherwise. Then the incomplete expansion (2.187) or (2.188) would not reproduce the delta function, but ˜f j0 = G K GT f j0 , K
(2.189)
2.5 The singular vector expansion
71
1
(a)
0.5 0 −0.5
1
2
3
4
5
6
7
8
9
10
1
(b)
0.5 0 −0.5
1
2
3
4
5 6 7 8 Element number
9
10
Figure 2.9 (a) Example of a row, j0 , of a 10×10 resolution matrix, perhaps the fourth one, showing widely distributed averaging in forming f j0 . (b) The so-called compact resolution, in which the solution is readily interpreted as representing a local average of the true solution. Such situations are not common.
which is column j0 of G K GTK . Each column (or row) of the resolution matrix tells one what the corresponding form of the approximating vector would be, if its true form were a Kronecker delta. To form a Kronecker delta requires a spanning set of vectors. An analogous elementary result of Fourier analysis shows that a Dirac delta function demands contributions from all frequencies to represent a narrow, very high pulse. Removal of some of the requisite vectors (sinusoids) produces peak broadening and sidelobes. Here, depending upon the precise structure of the gi , the broadening and sidelobes can be complicated. If one is lucky, the effect could be a simple broadening (schematically shown in Fig. 2.9) without distant sidelobes), leading to the tidy interpretation of the result as a local average of the true values, called “compact resolution.”32 A resolution matrix has the property trace G K GTK = K , which follows from noting that trace G K GTK = trace GTK G K = trace(I K ) = K .
(2.190)
72
Basic machinery
2.5.2 Square-symmetric problem: eigenvalues/eigenvectors Orthogonal vector expansions are particularly simple to use and interpret, but might seem irrelevant when dealing with simultaneous equations where neither the row nor column vectors of the coefficient matrix are so simply related. What we will show, however, is that we can always find sets of orthonormal vectors to greatly simplify the job of solving simultaneous equations. To do so, we digress to recall the basic elements of the “eigenvector/eigenvalue problem” mentioned in passing on p. 26. Consider a square, M × M matrix E and the simultaneous equations Egi = λi gi ,
i = 1, 2, . . . , M,
(2.191)
that is, the problem of finding a set of vectors gi whose dot products with the rows of E are proportional to themselves. Such vectors are “eigenvectors,” and the constants of proportionality are the “eigenvalues.” Under special circumstances, the eigenvectors form an orthonormal spanning set: Textbooks show that if E is square and symmetric, such a result is guaranteed. It is easy to see that if two λ j , λk are distinct, then the corresponding eigenvectors are orthogonal: Eg j = λ j g j ,
(2.192)
Egk = λk gk .
(2.193)
Left multiply the first of these by gTk , and the second by gTj , and subtract: gTk Eg j − gTj Egk = (λ j − λk )gTk g j .
(2.194)
But because E = ET , the left-hand side vanishes, and hence gTk g j by the assumption λ j = λk . A similar construction proves that the λi are all real, and an elaboration shows that for coincident λi , the corresponding eigenvectors can always be made orthogonal. Example To contrast with the above matrix ⎧ ⎨1 0 ⎩ 0
result, consider the non-symmetric, square 2 1 0
⎫ 3⎬ 4 . ⎭ 1
Solution to the eigenvector/eigenvalue problem produces λi = 1, and ui = [1, 0, 0]T , i = 1, 2, 3. The eigenvectors are not orthogonal, and are certainly not a spanning set. On the other hand, the eigenvector/eigenvalues of ⎫ ⎧ ⎨ 1 −1 −2 ⎬ −1 2 −1 ⎭ ⎩ 1.5 2 −2.5
2.5 The singular vector expansion
are
73
⎤ ⎤ ⎤ ⎡ ⎡ −0.29 + 0.47i −0.29 − 0.47i −0.72 g1 = ⎣ −0.17 + 0.25i ⎦ , g2 = ⎣ −0.17 − 0.25i ⎦ , g3 = ⎣ 0.90 ⎦ , 0.19 + 0.61i 0.19 − 0.61i 0.14 ⎡
λ j = [−1.07 + 1.74i, −1.07 − 1.74 i, 2.64] and (rounded) are not orthogonal, but are a basis. The eigenvalues/eigenvectors appear in complex conjugate pairs and in some contexts are called “principal oscillation patterns” (POPs). Suppose for the moment that we have the square, symmetric, special case, and recall how eigenvectors can be used to solve (2.16). By convention, the pairs (λi , gi ) are ordered in the sense of decreasing λi . If some λi are repeated, an arbitrary order choice is made. With an orthonormal, spanning set, both the known y and the unknown x can be written as x=
M
α i gi ,
α i = giT x,
(2.195)
β i gi ,
β i = giT y.
(2.196)
i=1
y=
M
i=1
By convention, y is known, and therefore β i can be regarded as given. If the α i could be found, x would be known. Substitute (2.195) into (2.16), to give E
M
α i gi =
i=1
M
giT y gi ,
(2.197)
i=1
or, using the eigenvector property, M
α i λi gi =
M
giT y gi .
(2.198)
i
i=1
But the expansion vectors are orthonormal and so λi α i = giT y, αi = x=
(2.199)
giT y , λi M
gT y i
i=1
λi
(2.200) gi .
(2.201)
Basic machinery
74
Apart from the obvious difficulty if an eigenvalue vanishes, the problem is now completely solved. Define a diagonal matrix, Λ, with elements, λi , in descending numerical value, and the matrix G, whose columns are the corresponding gi in the same order, the solution to (2.16) can be written, from (2.195), (2.199)–(2.201) as α = Λ−1 GT y, −1
x = GΛ G y, T
(2.202) (2.203)
where Λ−1 = diag(1/λi ). Vanishing eigenvalues, i = i 0 , cause trouble and must be considered. Let the corresponding eigenvectors be gi0 . Then any part of the solution which is proportional to such an eigenvector is “annihilated” by E, that is, gi0 is orthogonal to all the rows of E. Such a result means that there is no possibility that anything in y could provide any information about the coefficient α i0 . If y corresponds to a set of observations (data), then E represents the connection (“mapping”) between system unknowns and observations. The existence of zero eigenvalues shows that the act of observation of x removes certain structures in the solution which are then indeterminate. Vectors gi0 (and there may be many of them) are said to lie in the “nullspace” of E. Eigenvectors corresponding to non-zero eigenvalues lie in its “range.” The simplest example is given by the “observations” x1 + x2 = 3, x1 + x2 = 3. Any structure in x such that x1 = −x2 is destroyed by√this observation, and, by inspection, the nullspace vector must be g2 = [1, −1]T / 2. (The purpose of showing the observation twice is to produce an E that is square.) Suppose there are K < M non-zero λi . Then for i > K , Eq. (2.199) is 0α i = giT y,
K + i = 1, 2, . . . , M,
(2.204)
and two cases must be distinguished. Case (1) giT y = 0,
K + i = 1, 2, . . . , M.
(2.205)
We could then put α i = 0, K + i = 1, 2, . . . , M, and the solution can be written x˜ =
K
gT y i
i=1
λi
gi ,
(2.206)
2.5 The singular vector expansion
75
and E˜x = y, exactly. Equation (2.205) is often known as a “solvability condition.” A tilde has been placed over x because a solution of the form x˜ =
K
gT y i
i=1
λi
M
gi +
α i gi ,
(2.207)
i=K +1
with the remaining α i taking on arbitrary values, also satisfies the equations exactly. That is, the true value of x could contain structures proportional to the nullspace vectors of E, but the equations (2.16) neither require their presence, nor provide information necessary to determine their amplitudes. We thus have a situation with a “solution nullspace.” Define the matrix G K to be M × K , carrying only the first K of the gi , that is, the range vectors, Λ K , to be K × K with only the first K , non-zero eigenvalues, and the columns of QG are the M − K nullspace vectors (it is M × (M − K )), then the solutions (2.206) and (2.207) are T x˜ = G K Λ−1 K G K y,
(2.208)
T x˜ = G K Λ−1 K G K y + QG αG ,
(2.209)
where αG is the vector of unknown nullspace coefficients. The solution in (2.208), with no nullspace contribution, will be called the “particular” solution. If y = 0, however, Eq. (2.16) is a homogeneous set of equations, then the nullspace represents the only possible solution. If G is written as a partitioned matrix, G = {G K
QG },
it follows from the column orthonormality that GGT = I = G K GTK + QG QTG ,
(2.210)
QG QTG = I − G K GTK .
(2.211)
or
Vectors QG span the nullspace of G. Case (2) giT y = 0,
i > K + 1,
(2.212)
for one or more of the nullspace vectors. In this case, Eq. (2.199) is the contradiction 0α i = 0,
Basic machinery
76
and Eq. (2.198) is actually K
λi α i gi =
i=1
M
giT y gi ,
K < M,
(2.213)
i=1
that is, with differing upper limits on the sums. Therefore, the solvability condition is not satisfied. Owing to the orthonormality of the gi , there is no choice of α i , i = 1, . . . , K on the left that can match the last M − K terms on the right. Evidently there is no solution in the conventional sense unless (2.205) is satisfied, hence the name “solvability condition.” What is the best we might do? Define “best” to mean that the solution x˜ should be chosen such that E˜x = y˜ , where the difference, n˜ = y − y˜ , which we call the “residual,” should be as small as possible (in the l2 norm). If this choice is made, then the orthogonality of the gi shows immediately that the best choice is still (2.200), i = 1, 2, . . . , K . No choice of nullspace vector coefficients, nor any other value of the coefficients of the range vectors, can reduce the norm of n. ˜ The best solution is then also (2.206) or (2.208). In this situation, we are no longer solving the equations (2.16), but rather are dealing with a set that could be written Ex ∼ y,
(2.214)
where the demand is for a solution that is the “best possible,” in the sense just defined. Such statements of approximation are awkward, so as before rewrite (2.214) as Ex + n = y,
(2.215)
where n is the residual. If x˜ is given by (2.207), then by (2.213), n˜ =
M
i=K +1
giT y gi .
(2.216)
Notice that n˜ T y˜ = 0 : y˜ is orthogonal to the residuals. Example Let x1 + x2 = 1, x1 + x2 = 3. √ √ Then using λ1 = 2, g1 = [1, 1]T / 2, λ2 = 0, g2 = [1, −1]T / 2, one has x˜ = [1/2, 1/2]T ∝ g1 , y˜ = [2, 2]T ∝ g1 , n˜ = [−1, 1]T ∝ g2 . This outcome, where M equations in M unknowns were found in practice not to be able to determine some solution structures, is labeled “formally
2.5 The singular vector expansion
77
just-determined.” The expression “formally” alludes to the fact that the appearance of a just-determined system did not mean that the characterization was true in practice. One or more vanishing eigenvalues mean that neither the rows nor columns of E are spanning sets. Some decision has to be made about the coefficients of the nullspace vectors in (2.209). The form could be used as it stands, regarding it as the “general solution.” The analogy with the solution of differential equations should be apparent – typically, there is a particular solution and a homogeneous solution – here the nullspace vectors. When solving a differential equation, determination of the magnitude of the homogeneous solution requires additional information, often provided by boundary or initial conditions; here additional information is also necessary, but missing. Despite the presence of indeterminate elements in the solution, a great deal is known about them: They are proportional to the nullspace vectors. Depending upon the specific situation, we might conceivably be in a position to obtain more observations, and would seriously consider observational strategies directed at observing these missing structures. The reader is also reminded of the discussion of the Neumann problem in Chapter 1. Another approach is to define a “simplest” solution, appealing to what is usually known as “Ockham’s razor,” or the “principle of parsimony,” that in choosing between multiple explanations of a given phenomenon, the simplest one is usually the best. What is “simplest” can be debated, but here there is a compelling choice: The solution (2.208), which is without any nullspace contributions, is less structured than any other solution. (It is often, but not always, true that the nullspace vectors are more “wiggly” than those in the range. The nullspace of the Neumann problem is a counter example. In any case, including any vector not required by the data is arguably producing more structure than is required.) Setting all the unknown α i to zero is thus one plausible choice. It follows from the orthogonality of the gi that this particular solution is also the one of minimum solution norm. Later, other choices for the nullspace vectors will be made. If y = 0, then the nullspace is the solution. If the nullspace vector contributions are set to zero, the true solution has been expanded in an incomplete set of orthonormal vectors. Thus, G K GTK is the resolution matrix, and the relationship between the true solution and the minimal one is just x˜ = G K GTK x = x − QG αG ,
y˜ = G K GTK y,
n˜ = QG QTG y.
(2.217)
These results are very important, so we recapitulate them: (2.207) or (2.209) is the general solution. There are three vectors involved, one of them, y, known, and two of them, x, n, unknown. Because of the assumption that E has an orthonormal
Basic machinery
78
basis of eigenvectors, all three of these vectors can be expanded exactly as x=
M
α i gi ,
n=
i=1
M
γ i gi ,
y=
i=1
M
(yT gi )gi .
(2.218)
i=1
Substituting into (2.215), K
λi α i gi +
i=1
M
M
γ i gi = (yT gi )gi .
i=1
i=1
From the orthogonality property, λi α i + γ i = yT gi , γ i = y gi , T
i = 1, 2, . . . , K ,
(2.219)
K + i = 1, 2, . . . , M.
(2.220)
In dealing with the first relationship, a choice is required. If we set γ i = giT n = 0,
i = 1, 2, . . . , K ,
(2.221)
the residual norm is made as small as possible, by completely eliminating the range vectors from the residual. This choice is motivated by the attempt to satisfy the equations as well as possible, but is seen to have elements of arbitrariness. A decision about other possibilities depends upon knowing more about the system and will be the focus of attention later. The relative contributions of any structure in y, determined by the projection, giT y, will depend upon the ratio giT y/λi . Comparatively weak values of giT y may well be amplified by small, but non-zero, elements of λi . One must keep track of both giT y, and giT y/λi . Before leaving this special case, note one more useful property of the eigenvector/ eigenvalues. For the moment, let G have all its columns, containing both the range and nullspace vectors, with the nullspace vectors being last in arbitrary order. It is thus an M × M matrix. Correspondingly, let Λ contain all the eigenvalues on its diagonal, including the zero ones; it too, is M × M. Then the eigenvector definition (2.191) produces EG = GΛ.
(2.222)
Multiply both sides of (2.222) by GT : GT EG = GT GΛ = Λ.
(2.223)
G is said to “diagonalize” E. Now multiply both sides of (2.223) on the left by G and on the right by GT : GGT EGGT = GΛGT .
(2.224)
2.5 The singular vector expansion
79
Using the orthogonality of G, E = GΛGT .
(2.225)
This is a useful representation of E, consistent with its symmetry. Recall that Λ has zeros on the diagonal corresponding to the zero eigenvalues, and the corresponding rows and columns are entirely zero. Writing out (2.225), these zero rows and columns multiply all the nullspace vector columns of G by zero, and it is found that the nullspace columns of G can be eliminated, Λ can be reduced to its K × K form, and the decomposition (2.225) is still exact – in the form E = G K Λ K GTK .
(2.226)
The representation (decomposition) in either Eq. (2.225) or (2.226) is identical to E =λ1 g1 gT1 + λ2 g2 gT2 + · · · + λ K g K gTK .
(2.227)
That is, a square symmetric matrix can be exactly represented by a sum of products of orthonormal vectors gi giT multiplied by a scalar, λi . Example Consider the matrix from the last example, 1 1 E= . 1 1 We have
2 1 $ E =√ 1 2 1
% 1 % 1 0 1 $ 1 √ +√ 1 −1 √ . 2 2 −1 2
The simultaneous equations (2.215) are G K Λ K GTK x + n = y.
(2.228)
T Left multiply both sides by Λ−1 K G K to get −1 T T GTK x + Λ−1 K G K n = Λ K G K y.
(2.229)
But GTK x are the projection of x onto the range vectors of E, and GTK n is the projection of the noise. We have agreed to set the latter to zero, and obtain T GTK x = Λ−1 K G K y,
the dot products of the range of E with the solution. Hence, it must be true, since the range vectors are orthonormal, that T x˜ ≡ G K GTK x ≡ G K Λ−1 K G K y,
(2.230)
y˜ = E˜x =
(2.231)
G K GTK y,
Basic machinery
80
which is identical to the particular solution (2.206). The residuals are n˜ = y − y˜ = y − E˜x = I M − G K GTK y = QG QTG y,
(2.232)
with n˜ T y˜ = 0. Notice that matrix H of Eq. (2.97) is just G K GTK , and hence (I − H) is the projector of y onto the nullspace vectors. The expected value of the solution (2.206) or (2.230) is T
˜x − x = G K Λ−1 K G K y −
N
α i gi = −QG αG ,
(2.233)
i=1
and so the solution is biassed unless αG = 0. The uncertainty is given by ' & −1 T T T P = D 2 (˜x − x) = G K Λ−1 K G K (y0 + n − y0 )(y0 + n − y0 ) G K Λ K G K ' & + QG αG αTG QTG & ' T −1 T T T T (2.234) = G K Λ−1 K G K nn G K Λ K G K + QG αG αG QG −1 T T T = G K Λ−1 K G K Rnn G K Λ K G K + QG Rαα QG
= Cx x + QG Rαα QTG , defining the second moments, Rαα , of the coefficients of the nullspace vectors. Under the special circumstances that the residuals, n, are white noise, with Rnn = σ 2n I, (2.234) reduces to T T P = σ 2n G K Λ−2 K G K + QG Rαα QG .
(2.235)
Either case shows that the uncertainty of the minimal solution is made up of two distinct parts. The first part, the solution covariance, Cx x , arises owing to the noise present in the observations, and generates uncertainty in the coefficients of the range vectors; the second contribution arises from the missing nullspace vector contribution. Either term can dominate. The magnitude of the noise term depends largely upon the ratio of the noise variance, σ 2n , to the smallest non-zero eigenvalue, λ2K . Example Suppose that
1 1
Ex = y, 1 1 x1 =y= , 3 x2 1
(2.236)
which is inconsistent and has no solution in the conventional sense. Solving, Egi = λi gi ,
(2.237)
2.5 The singular vector expansion
or
1−λ 1 1 1−λ
gi1 gi2
81
0 = . 0
(2.238)
This equation requires that 1−λ 1 0 gi1 + gi2 = , 1 1−λ 0 or
gi2 1 0 1−λ = , + 0 1 gi1 1 − λ
which is gi2 = −(1 − λ) gi1 gi2 1 . =− gi1 1−λ Both equations are satisfied only if λ = 2, 0. This method, which can be generalized, in effect derives the usual statement that for Eq. (2.238) to have a solution, the determinant, , , ,1 − λ 1 ,, , , , 1 1 − λ, must vanish. The first solution is labeled λ1 = 2, and substituting back in produces g1 = √12 [1, 1]T , when given unit length. Also g2 = √12 [−1, 1]T , λ2 = 0. Hence, T 1 1 1 1 1 $ E =√ = 2√ 1 1 2 1 2 1
% 1 .
(2.239)
The equations have no solution in the conventional sense. There is, however, a sensible “best” solution: gT1 y g1 + α 2 g2 , λ 1 1 −1 4 1 1 = + α2 √ √ √ 2 2 2 1 2 1 1 −1 1 . = + α2 √ 1 2 1
x˜ =
Notice that
2 1 E˜x = + 0 = . 2 3
(2.240) (2.241) (2.242)
(2.243)
Basic machinery
82
The solution has compromised the inconsistency. No choice of α 2 can reduce the residual norm. The equations would more sensibly have been written Ex + n = y, and the difference, n = y − E˜x is proportional to g2 . A system like (2.236) would most likely arise from measurements (if both equations are divided by 2, they represent two measurements of the average of (x1 , x2 ), and n would be best regarded as the noise of observation). Example Suppose the same problem as in the last example is solved using Lagrange multipliers, that is, minimizing J = nT n + γ 2 xT x − 2μT (y − Ex − n) . Then, the normal equations are 1∂J = γ 2 x + ET μ = 0, 2 ∂x 1∂J = n + μ = 0, 2 ∂n 1 ∂J = y − Ex − n = 0, 2 ∂μ which produces x˜ = ET (EET + γ 2 I)−1 y 1 1 2 2 2 1 = +γ 1 1 2 2 0
0 1
−1 1 . 3
The limit γ 2 → ∞ is readily evaluated. Letting γ 2 → 0 involves inverting a singular matrix. To understand what is going on, use T 1 1 1 1 T T +0 (2.244) E = GΛG = g1 λ1 g1 + 0 = √ 2√ 2 1 2 1 Hence, EET = GΛ2 GT . Note that the full G, Λ are being used, and I = GGT . Thus, (EET + γ 2 I) = (GΛ2 GT + G(γ 2 )GT ) = G(Λ2 + γ 2 I)GT . By inspection, the inverse of this last matrix is (EET + I/γ 2 )−1 = G(Λ2 + γ 2 I)−1 GT .
2.5 The singular vector expansion
83
But (Λ2 + γ 2 I)−1 is the inverse of a diagonal matrix, * - + (Λ2 + γ 2 I)−1 = diag 1 λi2 + γ 2 . Then * - + x˜ = ET (EET + γ 2 I)−1 y = GΛGT G diag 1 λi2 + γ 2 GT y + * - = G diag λi λi2 + γ 2 GT y T K
2 λi 1 1 1 1 1 T +0 gi 2 gi y = √ = √ 2 2 3 λi + γ 2 1 2+γ 2 1 i=1 4 1 . = 2 + γ2 1 The solution always exists as long as γ 2 > 0. It is a tapered-down form of the solution with γ 2 = 0 if all λi = 0. Therefore, 4 4 1 1 2 1 E = − , n= − 1 3 3 2 + γ2 2 + γ2 2 so that, as γ 2 → ∞, the solution x˜ is minimized, becoming 0, and the residual is equal to y.
2.5.3 Arbitrary systems The singular vector expansion and singular value decomposition It may be objected that this entire development is of little interest, because most problems, including those outlined in Chapter 1, produced E matrices that could not be guaranteed to have complete orthonormal sets of eigenvectors. Indeed, the problems considered produce matrices that are usually non-square, and for which the eigenvector problem is not even defined. For arbitrary square matrices, the question of when a complete orthonormal set of eigenvectors exists is not difficult to answer, but becomes somewhat elaborate.33 When a square matrix of dimension N is not symmetric, one must consider cases in which there are N distinct eigenvalues and where some are repeated, and the general approach requires the so-called Jordan form. But we will next find a way to avoid these intricacies, and yet deal with sets of simultaneous equations of arbitrary dimensions, not just square ones. The square, symmetric case nonetheless provides full analogues to all of the issues in the more general case, and the reader may find it helpful to refer back to this situation for insight. Consider the possibility, suggested by the eigenvector method, of expanding the solution x in a set of orthonormal vectors. Equation (2.87) involves one vector, x,
Basic machinery
84
of dimension N , and two vectors, y, n, of dimension M. We would like to use orthonormal basis vectors, but cannot expect, with two different vector dimensions involved, to use just one set: x can be expanded exactly in N , N -dimensional orthonormal vectors; and similarly, y and n can be exactly represented in M, Mdimensional orthonormal vectors. There are an infinite number of ways to select two such sets. But using the structure of E, a particularly useful pair can be identified. The simple development of the solutions in the square, symmetric case resulted from the theorem concerning the complete nature of the eigenvectors of such a matrix. So construct a new matrix, 0 ET B= , (2.245) E 0 which by definition is square (dimension M + N by M + N ) and symmetric. Thus, B satisfies the theorem just alluded to, and the eigenvalue problem, Bqi = λi qi ,
(2.246)
will give rise to M + N orthonormal eigenvectors qi (an orthonormal basis) whether or not the λi are distinct or non-zero. Writing out (2.246), ⎡
0 E
q1i · qN i
⎤
⎡
q1i · qN i
⎤
⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ E ⎢ ⎥ = λi ⎢ ⎥, ⎥ ⎢ ⎥ ⎢ 0 ⎢ q N +1,i ⎥ ⎢ q N +1,i ⎥ ⎣ · ⎦ ⎣ · ⎦ q N +M,i q N +M,i T
i = 1, 2, . . . , M + N ,
(2.247)
where q pi is the pth element of qi . Taking note of the zero matrices, (2.247) may be rewritten as ⎡ ⎤ ⎡ ⎤ q N +1,i q1i T⎣ ⎦ ⎣ E = λi (2.248) · · ⎦, q N +M,i qN i ⎡ ⎤ ⎡ ⎤ q1i q N +1,i E ⎣ · ⎦ = λi ⎣ · ⎦ , i = 1, 2, . . . , M + N . (2.249) qN i q N +M,i Define ui = [ q N +1,i
· q N +M,i ]T , vi = [ q1,i
· q N ,i ]T , qi = [ viT
uiT ]T , (2.250)
2.5 The singular vector expansion
85
that is, defining the first N elements of qi to be called vi and the last M to be called ui , the two sets together being the “singular vectors.” Then (2.248)–(2.249) are Evi = λi ui ,
(2.251)
E ui = λi vi .
(2.252)
T
If (2.251) is left multiplied by ET , and using (2.252), one has ET Evi = λi2 vi ,
i = 1, 2, . . . , N .
(2.253)
Similarly, left multiplying (2.252) by E and using (2.251) produces EET ui = λi2 ui
i = 1, 2, . . . , M.
(2.254)
These last two equations show that the ui , vi each separately satisfy two independent eigenvector/eigenvalue problems of the square symmetric matrices EET , ET E and they can be separately given unit norm. The λi come in pairs as ±λi and the convention is made that only the non-negative ones are retained, as the ui , vi corresponding to the singular values also differ at most by a minus sign, and hence are not independent of the ones retained.34 If one of M, N is much smaller than the other, only the smaller eigenvalue/eigenvector problem needs to be solved for either of ui , vi ; the other set is immediately calculated from (2.252) or (2.251). Evidently, in the limiting cases of either a single equation or a single unknown, the eigenvalue/eigenvector problem is purely scalar, no matter how large the other dimension. In going from (2.248), (2.249) to (2.253), (2.254), the range of the index i has dropped from M + N to M or N . The missing “extra” equations correspond to negative λi and carry no independent information. Example Consider the non-square, non-symmetric matrix ⎧ ⎫ 0 0 1 −1 2 ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ 1 1 0 0 0 E= . 1 −1 0 0 0⎪ ⎪ ⎪ ⎪ ⎩ ⎭ 1 2 0 0 0 Forming the larger produces ⎧ −0.31623 ⎪ ⎪ ⎪ ⎪ ⎨−0.63246 Q= 0.35857 ⎪ ⎪ ⎪ −0.11952 ⎪ ⎩ 0.59761
matrix B, solve the eigenvector/eigenvalue problem that 0.63246 −1.1796 × 10−16 −0.31623 −2.0817 × 10−16 −0.22361 0.80178 −0.67082 −0.26726 0.00000 −0.53452
⎫ −0.63246 0.31623 ⎪ ⎪ ⎪ 0.31623 0.63246 ⎪ ⎬ −0.22361 0.35857 , ⎪ ⎪ −0.67082 −0.11952 ⎪ ⎪ ⎭ 0.00000 0.59761
Basic machinery
86
⎧ −2.6458 ⎪ ⎪ ⎪ ⎪ 0 ⎨ S= 0 ⎪ ⎪ ⎪ 0 ⎪ ⎩ 0
0 −1.4142 0 0 0
0 0 0 0 0 0 0 1.4142 0 0
⎫ 0 ⎪ ⎪ ⎪ 0 ⎪ ⎬ , 0 ⎪ ⎪ 0 ⎪ ⎪ ⎭ 2.6458
where Q is the matrix whose columns are qi and S is the diagonal matrix whose values are the corresponding eigenvalues. Note that one of the eigenvalues vanishes identically, and that the others occur in positive and negative pairs. The corresponding qi differ only by sign changes in parts of the vectors, but they are all linearly independent. First define a V matrix from the first two rows of Q, −0.31623 0.63246 0 −0.63246 0.31623 V= . −0.63246 −0.31623 0 0.31623 0.63246 Only two of the vectors are linearly independent (the zero-vector is not physically realizable). Similarly, the last three rows of Q define a U matrix, ⎫ ⎧ ⎨ 0.35857 −0.22361 0.80178 −0.22361 0.35857 ⎬ U = −0.11952 −0.67082 −0.26726 −0.67082 −0.11952 , ⎭ ⎩ 0.59761 0.00000 −0.53452 0.00000 0.59761 in which only three columns are linearly independent. Retaining only the last two columns of V and the last three of U, and column normalizing each to unity, produces the singular vectors. By convention, the λi are ordered in decreasing numerical value. Equations (2.251)–(2.252) provide a relationship between each ui , vi pair. But because M = N , generally, there will be more of one set than the other. The only way Eqs. (2.251)–(2.252) can be consistent is if λi = 0, i > min(M, N ) (where min(M, N ) is read as “the minimum of M and N ”). Suppose M < N . Then (2.254) is solved for ui , i = 1, 2, . . . , M, and (2.252) is used to find the corresponding vi . There are N − M vi not generated this way, but which can be found using the Gram–Schmidt method described on p. 22. Let there be K non-zero λi ; then Evi = 0,
i = 1, 2, . . . , K .
(2.255)
These vi are known as the “range vectors of E” or the “solution range vectors.” For the remaining N − K vectors vi , Evi = 0,
i = K + 1, . . . N ,
(2.256)
2.5 The singular vector expansion
87
which are known as the “nullspace vectors of E” or the “nullspace of the solution.” If K < M, there will be K of the ui such that ET ui = 0,
i = 1, 2, . . . , K ,
(2.257)
which are the “range vectors of ET ,” and M − K of the ui such that ET ui = 0,
i = K + 1, . . . , M,
(2.258)
the “nullspace vectors of ET ” or the “data, or observation, nullspace vectors.” The “nullspace” of E is spanned by its nullspace vectors, and the “range” of E is spanned by the range vectors, etc., in the sense, for example, that an arbitrary vector lying in the range is perfectly described by a sum of the range vectors. We now have two complete orthonormal sets in the two different spaces. Note that Eqs. (2.256) and (2.258) imply that Evi = 0, uiT E = 0,
i = K + 1, . . . , N ,
(2.259)
which expresses hard relationships among the columns and rows of E. Because the ui , vi are complete in their corresponding spaces, x, y, n can be expanded without error: x=
N
α i vi ,
y=
i=1
M
β i ui ,
n=
M
j=1
γ i ui ,
(2.260)
i=1
where y has been measured, so that we know β j = uTj y. To find x, we need α i , and to find n, we need the γ i . Substitute (2.260) into (2.87), and using (2.251)–(2.252), N
i=1
α i Evi +
M
γ i ui =
i=1
K
α i λi ui +
i=1
=
M
γ i ui
(2.261)
i=1
M
uiT y ui .
i=1
Notice the differing upper limits on the summations. Because of the orthonormality of the singular vectors, (2.261) can be solved as α i λi + γ i = uiT y, i = 1, 2, . . . , M, α i = uiT y − γ i /λi , λi = 0, i = 1, 2, . . . , K .
(2.262) (2.263)
In these equations, if λi = 0, nothing prevents us from setting γ i = 0, that is, uiT n = 0,
i = 1, 2, . . . , K ,
(2.264)
should we wish, which will have the effect of making the noise norm as small as possible (there is arbitrariness in this choice, and later γ i will be chosen differently).
Basic machinery
88
Then (2.263) produces αi =
uiT y , λi
i = 1, 2, . . . , K .
(2.265)
But, because λi = 0, i > K , the only solution to (2.262) for these values of i is γ i = uiT y, and α i is indeterminate. These γ i are non-zero, except in the event (unlikely with real data) that uiT y = 0,
i = K + 1, . . . , N .
(2.266)
This last equation is a solvability condition – in direct analogy to (2.205). The solution obtained in this manner now has the following form: x˜ =
K
uT y i
i=1
λi
y˜ = E˜x =
N
vi +
α i vi ,
(2.267)
i=K +1
K
uiT y ui ,
(2.268)
i=1
n˜ =
M
i=K +1
uiT y ui .
(2.269)
The coefficients of the last N − K of the vi in Eq. (2.267), the solution nullspace vectors, are arbitrary, representing structures in the solution about which the equations provide no information. A nullspace is always present unless K = N . The solution residuals are directly proportional to the nullspace vectors of ET and will vanish only if K = M, or if the solvability conditions are met. Just as in the simpler square symmetric case, no choice of the coefficients of the solution nullspace vectors can have any effect on the size of the residuals. If we choose once again to exercise Occam’s razor, and regard the simplest solution as best, then setting the nullspace coefficients to zero gives x˜ =
K
uT y i
i=1
λi
vi .
(2.270)
Along with (2.269), this is the “particular-SVD solution.” It minimizes the residuals, and simultaneously produces the corresponding x˜ with the smallest norm. If n = 0, the bias of (2.270) is evidently
˜x − x = −
N
i=K +1
α i vi .
(2.271)
2.5 The singular vector expansion
89
The solution uncertainty is P=
K K
vi
i=1 j=1
N N
uiT Rnn u j T vi + vi α i α j vTj . λi λ j i=K +1 j=K +1
(2.272)
If the noise is white with variance σ 2n or, if a row scaling matrix W−T/2 has been applied to make it so, then (2.272) becomes P=
K
σ2
n v vT 2 i i λ i=1 i
+
N
& i=K +1
' α i2 vi viT ,
(2.273)
where it was also assumed that α i α j = α i2 δ i j in the nullspace. The influence of very small singular values on the uncertainty is plain: In the solution (2.267) or (2.270) there are error terms uiT y/λi that are greatly magnified by small or nearly vanishing singular values, introducing large terms proportional to σ 2n /λi2 into (2.273). The structures dominating x˜ are a competition between the magnitudes of uiT y and λi , given by the ratio, uiT y/λi . Large λi can suppress comparatively large projections onto ui , and similarly, small, but non-zero λi may greatly amplify comparatively modest projections. In practice,35 one is well-advised to study the behavior of both uiT y, uiT y/λi as a function of i to understand the nature of the solution. The decision to omit contributions to the residuals by the range vectors of ET , as we did in Eqs. (2.264) and (2.269), needs to be examined. Should some other choice be made, the x˜ norm would decrease, but the residual norm would increase. Determining the desirability of such a trade-off requires an understanding of the noise structure – in particular, (2.264) imposes rigid structures, and hence covariances, on the residuals. 2.5.4 The singular value decomposition The singular vectors and values have been used to provide a convenient pair of orthonormal bases to solve an arbitrary set of simultaneous equations. The vectors and values have another use, however, in providing a decomposition of E. Define Λ as the M × N matrix whose diagonal elements are the λi , in order of descending values in the same order, U as the M × M matrix whose columns are the ui , V as the N × N matrix whose columns are the vi . As an example, suppose M = 3, N = 4; then ⎧ ⎫ 0 0 0⎬ ⎨λi Λ = 0 λ2 0 0 . ⎩ ⎭ 0 0 λ3 0
Basic machinery
90
Alternatively, if M = 4, N = 3, then ⎧ λ1 ⎪ ⎪ ⎨ 0 Λ= 0 ⎪ ⎪ ⎩ 0
⎫ 0⎪ ⎪ ⎬ 0 , λ3 ⎪ ⎪ ⎭ 0
0 λ2 0 0
therefore extending the definition of a diagonal matrix to non-square ones. Precisely as with matrix G considered above (see p. 75), the column orthonormality of U, V implies that these matrices are orthogonal: UUT = I M ,
UT U = I M ,
(2.274)
VV = I N ,
V V = IN .
(2.275)
T
T
(It follows that U−1 = UT , etc.) As with G above, should one or more columns of U, V be deleted, the matrices will become semi-orthogonal. Equations (2.251)–(2.254) can be written compactly as: EV = UΛ, ET EV = VΛT Λ,
ET U = VΛT , EET U = UΛΛT .
(2.276) (2.277)
Left multiply the first relation of (2.276) by UT and right multiply it by VT , and, invoking Eq. (2.275), UT EV = Λ.
(2.278)
Therefore, U, V diagonalize E (with “diagonal” having the extended meaning for a rectangular matrix as defined above). Right multiplying the first relation of (2.276) by VT gives E = UΛVT .
(2.279)
This last equation represents a product, called the “singular value decomposition” (SVD), of an arbitrary matrix, of two orthogonal matrices, U, V, and a usually non-square diagonal matrix, Λ. There is one further step to take. Notice that for a rectangular Λ, as in the examples above, one or more rows or columns must be all zero, depending upon the shape of the matrix. In addition, if any of the λi = 0, i < min(M, N ), the corresponding rows or columns of Λ will be all zeros. Let K be the number of non-vanishing singular values (the “rank” of E). By inspection (multiplying it out), one finds that the last N − K columns of V and the last M − K columns of U are multiplied by zeros only. If these columns are dropped entirely from U, V so that U becomes M × K and V becomes N × K , and reducing Λ to a K × K square matrix, then
2.5 The singular vector expansion
91
the representation (2.279) remains exact, in the form E = U K Λ K VTK = λ1 u1 vT1 + λ2 u2 vT2 + · · · + λ K u K vTK ,
(2.280)
with the subscript indicating the number of columns, where U K , V K are then only semi-orthogonal, and Λ K is now square. Equation (2.280) should be compared to (2.226).36 The SVD solution can be obtained by matrix manipulation, rather than vector by vector. Consider once again finding the solution to the simultaneous equations (2.87), but first write E in its reduced SVD, U K Λ K VTK x + n = y.
(2.281)
Left multiplying by UTK and invoking the semi-orthogonality of U K produces Λ K VTK x + UTK n = UTK y.
(2.282)
The inverse of Λ K is easily computed, and −1 T T VTK x + Λ−1 K U K n = Λ K U K y.
(2.283)
But VTK x is the dot product of the first K of the vi with the unknown x. Equation (2.283) thus represents statements about the relationship between dot products of the unknown vector, x, with a set of orthonormal vectors, and therefore must represent the expansion coefficients of the solution in those vectors. If we set UTK n = 0,
(2.284)
T VTK x = Λ−1 K U K y,
(2.285)
T x˜ = V K Λ−1 K U K y.
(2.286)
then
and hence
Equation (2.286) is identical to the solution (2.270), and can be confirmed by writing it out explicitly. As with the square symmetric case, the contribution of any structure in y proportional to ui depends upon the ratio of the projection uiT y to λi . Substituting (2.286) into (2.281), T T U K Λ K VTK V K Λ−1 K U K y + n = U K U K y + n = y,
or n˜ = I − U K UTK y.
(2.287)
Basic machinery
92
Let the full U and V matrices be rewritten as U = {U K
Qu }, V = {V K
Qv },
(2.288)
where Qu , Qv are the matrices whose columns are the corresponding nullspace vectors. Then E˜x + n˜ = y, E˜x = y˜ ,
(2.289)
and y˜ =
U K UTK y,
n˜ =
Qu QTu y
=
N
j=K +1
uTj y u j ,
which is identical to (2.268). Note that Qu QTu = I − U K UTK , Qv QTv = I − V K VTK ,
(2.290)
(2.291)
which are idempotent (V K VTK is matrix H of Eq. (2.97)). The two vector sets Qu , Qv span the data and solution nullspaces respectively. The general solution is x˜ = V K Λ−1 K U K y + Qv α,
(2.292)
where α is now restricted to being the vector of coefficients of the nullspace vectors. The solution uncertainty (2.272) is thus −1 T T T P = V K Λ−1 K U K nn U K Λ K V K
+ Qv αα T QTG = Cx x + Qv ααT QTv ,
(2.293)
or T T T P = σ 2n V K Λ−2 K V K + Qv αα Qv ,
(2.294)
for white noise. Least-squares solution of simultaneous solutions by SVD has several important advantages. Among other features, we can write down within one algebraic formulation the solution to systems of equations that can be under-, over-, or justdetermined. Unlike the eigenvalue/eigenvector solution for an arbitrary square system, the singular values (eigenvalues) are always non-negative and real, and the singular vectors (eigenvectors) can always be made a complete orthonormal set. Furthermore, the relations (2.276) provide a specific, quantitative statement of the connection between a set of orthonormal structures in the data, and the corresponding presence of orthonormal structures in the solution. These relations provide a very powerful diagnostic method for understanding precisely why the solution takes on the form it does.
2.5 The singular vector expansion
93
2.5.5 Some simple examples: algebraic equations Example The simplest underdetermined system is 1 × 2. Suppose x1 − 2x2 = 3, so that * + 0.447 −0.894 E = 1 −2 , U = {1} , V = , λ1 = 2.23, −0.894 −0.447 where the second column of V is the nullspace of E. The general solution is x˜ = [0.6, −1.2]T + α 2 v2 . Because K = 1 is the only possible choice, this solution satisfies the equation exactly, and a data nullspace is not possible. Example The most elementary overdetermined problem is 2 × 1. Suppose that x1 = 1, x1 = 3. The appearance of two such equations is possible if there is noise in the observations, and they are properly written as x1 + n 1 = 1, x1 + n 2 = 3. E = {1, 1}T , ET E represents the eigenvalue problem of the smaller dimension, again 1 × 1, and √ 0.707 −0.707 U= , V = {1}, λ1 = 2, 0.707 0.707 where the second column of U lies in the data nullspace, there being no solution nullspace. The general solution is x = x1 = 2, which if substituted back into the original equations produces $ %T E˜x = 2 2 = y˜ . Hence there are residuals n˜ = y˜ − y = [1, −1]T that are necessarily proportional to u2 and thus orthogonal to y˜ . No other solution can produce a smaller l2 norm residual than this one. The SVD provides a solution that compromises the contradiction between the two original equations. Example The possibility of K < M, Consider the system ⎧ ⎨1 −2 3 2 ⎩ 4 0
K < N simultaneously is also easily seen. ⎫ ⎡ ⎤ 1 1⎬ 1 x = ⎣−1⎦ , ⎭ 2 2
94
Basic machinery
which appears superficially just-determined. But the singular values are λ1 = 5.67, λ2 = 2.80, λ3 = 0. The vanishing of the third singular value means that the row and column vectors are not linearly independent sets – indeed, the third row vector is just the sum of the first two (but the third element of y is not the sum of the first two – making the equations inconsistent). Thus there are both solution and data nullspaces, which the reader might wish to find. With a vanishing singular value, E can be written exactly using only two columns of U, V and the linear dependence of the equations is given explicitly as uT3 E = 0. Example Consider now the underdetermined system x1 + x2 − 2x3 = 1, x1 + x2 − 2x3 = 2, which has no conventional solution at all, being a contradiction, and is thus simultaneously underdetermined and incompatible. If one of the coefficients is modified by a very small quantity, || > 0, to produce x1 + x2 − (2 + )x3 = 1, x1 + x2 − 2x3 = 2,
(2.295)
then not only is there a solution, there is an infinite number of them, which can be shown by computing the particular SVD solution and the nullspace. Thus the slightest perturbation in the coefficients has made the system jump from one having no solution to one having an infinite number – a disconcerting situation. The label for such a system is “ill-conditioned.” How would we know the system is ill-conditioned? There are several indicators. First, the ratio of the two singular values is determined by . If we set = 10−10 , the two singular values are λ1 = 3.46, λ2 = 4.1 × 10−11 , an immediate warning that the two equations are nearly linearly dependent. (In a mathematical problem, the non-vanishing of the second singular value is enough to assure a solution. It is the inevitable slight errors in y that suggest sufficiently small singular values should be treated as though they were actually zero.) Example A similar problem exists with the system x1 + x2 − 2x3 = 1, x1 + x2 − 2x3 = 1, which has an infinite number of solutions. But the change to x1 + x2 − 2x3 = 1, x1 + x2 − 2x3 = 1 + ,
2.5 The singular vector expansion
95
T4
8
T1
T3
T5 T6
Figure 2.10 Tomographic problem with nine unknowns and only six integral constraints. Box numbers are in the upper right-corner of each, and the Ti are the measured integrals through various combinations of three boxes.
for arbitrarily small produces a system with no solutions in the conventional mathematical sense, although the SVD will handle the system in a sensible way, which the reader should confirm. Problems like these are simple examples of the practical issues that arise once one recognizes that, unlike textbook problems, observational ones always contain inaccuracies; any discussion of how to handle data in the presence of mathematical relations must account for these inaccuracies as intrinsic – not as something to be regarded as an afterthought. But the SVD itself is sufficiently powerful that it always contains the information to warn of ill-conditioning, and by determination of K to cope with it – producing useful solutions. Example (The tomographic problem from Chapter 1.) A square box is made up of 3 × 3 unit dimension sub-boxes (Fig. 2.10). All rays are in the r x or r y directions. Therefore, the equations are ⎧ 1 ⎪ ⎪ ⎪ ⎪0 ⎪ ⎪ ⎨ 0 1 ⎪ ⎪ ⎪ ⎪ ⎪ 0 ⎪ ⎩ 0
0 1 0 1 0 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
1 0 0 0 0 1
0 1 0 0 0 1
⎫⎡ ⎤ ⎡ ⎤ 0 0⎪ x1 ⎪ ⎪ ⎢ x2 ⎥ ⎢ 1 ⎥ 0⎪ ⎪ ⎪ ⎢ ⎥ ⎬⎢ ⎥ ⎥ ⎢0⎥ 1 ⎢ . ⎢ ⎥ = ⎢ ⎥, ⎢ . ⎥ ⎢0⎥ 0⎪ ⎪ ⎢ ⎥ ⎢ ⎥ ⎪ ⎣ . ⎦ ⎣1⎦ ⎪ 0⎪ ⎪ ⎭ 0 1 x9
that is, Ex = y. There are six integrals (rays) across the nine boxes in which one seeks the corresponding value of xi . y was calculated by assuming that the “true”
Basic machinery
96
value is x5 = 1, xi = 0, i = 5. The SVD produces ⎧ −0.408 0 0 0.816 0 ⎪ ⎪ ⎪−0.408 ⎪ 0.703 −0.0543 −0.408 −0.0549 ⎪ ⎪ ⎨ −0.408 −0.703 0.0543 −0.408 0.0549 U= −0.408 −0.0566 0.0858 0 −0.81 ⎪ ⎪ ⎪ ⎪ ⎪ −0.408 −0.0313 −0.744 0 0.335 ⎪ ⎩ −0.408 0.0879 0.658 0 0.475 Λ = diag([2.45 ⎧ ⎪ −0.333 ⎪ ⎪ ⎪ ⎪ −0.333 ⎪ ⎪ ⎪ ⎪ −0.333 ⎪ ⎪ ⎪ ⎪ ⎨ −0.333 V = −0.333 ⎪ ⎪ ⎪ −0.333 ⎪ ⎪ ⎪ ⎪ −0.333 ⎪ ⎪ ⎪ ⎪ −0.333 ⎪ ⎪ ⎩ −0.333
−0.0327 0.0495 0.373 0.0182 −0.438 0.0808 −0.0181 −0.43 0.388 −0.461 −0.424 −0.398 0.0507 0.38 0.457 0.349 −0.355 0.411
1.73
1.73
1.73
1.73
⎫ 0.408⎪ ⎪ ⎪ 0.408⎪ ⎪ ⎪ ⎬ 0.408 , −0.408⎪ ⎪ ⎪ ⎪ −0.408⎪ ⎪ ⎭ −0.408
0]),
0.471 −0.468 −0.38 −0.224 0.353 −0.236 −0.499 0.432 0.302 −0.275 −0.236 −0.436 −0.0515 −0.0781 −0.0781 0.471 0.193 0.519 −0.361 −0.15 −0.236 0.162 −0.59 −0.0791 −0.29 −0.236 0.225 0.0704 0.44 0.44 0.471 0.274 −0.139 0.585 −0.204 −0.236 0.243 0.158 −0.223 0.566 −0.236 0.306 −0.0189 −0.362 −0.362
⎫ 0.353 ⎪ ⎪ ⎪ ⎪ 0.302 ⎪ ⎪ ⎪ ⎪ −0.655 ⎪ ⎪ ⎪ ⎪ −0.15 ⎪ ⎬ −0.0791 . ⎪ ⎪ 0.229 ⎪ ⎪ ⎪ ⎪ −0.204 ⎪ ⎪ ⎪ ⎪ −0.223 ⎪ ⎪ ⎪ ⎭ 0.427
The zeros appearing in U, and in the last element of diag(Λ), are actually very small numbers (O(10−16 ) or less). Rank K = 5 despite there being six equations – a consequence of redundancy in the integrals. Notice that there are four repeated λi , and the lack of expected simple symmetries in the corresponding vi is a consequence of a random assignment in the eigenvectors. Singular vector u1 just averages the right-hand side values, and the corresponding solution is completely uniform, proportional to v1 . The average of y is often the most robust piece of information. The “right” answer is x = [0, 0, 0, 0, 1, 0, 0, 0, 0]T . The rank 5 answer by SVD is x˜ = [−0.1111, 0.2222, −0.1111, 0.2222, 0.5556, 0.2222, −0.1111, 0.2222, −0.1111]T , which exactly satisfies the same equations, with x˜ T x˜ = 0.556 < xT x. When mapped into two dimensions, x˜ at rank 5 is rx → ⎤ ⎡ −0.11 0.22 −0.11 (2.296) r y ↑ ⎣ 0.22 0.56 0.22 ⎦ , −0.11 0.22 −0.11 and is the minimum norm solution. The mapped v6 , which belongs in the null-space, is ⎡
−0.38 ⎣ r y ↑ 0.52 −0.14
rx → 0.43 −0.59 0.16
⎤ −0.05 0.07 ⎦, −0.02
2.5 The singular vector expansion
97
and along with any remaining nullspace vectors produces a zero sum along any of the ray paths. u6 is in the data nullspace. uT6 E = 0 shows that a (y1 + y2 + y3 ) − a (y4 + y5 + y6 ) = 0, if there is to be a solution without a residual, or alternatively, that no solution would permit this sum to be non-zero. This requirement is physically sensible, as it says that the vertical and horizontal rays cover the same territory and must therefore produce the same sum travel times. It shows why the rank is 5, and not 6. There is no noise in the problem as stated. The correct solution and the SVD solution differ by the nullspace vectors. One can easily confirm that x˜ is column 5 of V5 VT5 . Least-squares allows one to minimize (or maximize) anything one pleases. Suppose that for some reason we want the solution that minimizes the differences between the value in box 5 and its neighbors, perhaps as a way of finding a “smooth” solution. Let ⎫ ⎧ −1 0 0 0 1 0 0 0 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 −1 0 0 1 0 0 0 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪0 0 −1 0 1 0 0 0 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 0 0 −1 1 0 0 0 0 ⎬ ⎨ (2.297) W= 0 0 0 0 1 −1 0 0 0 . ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 0 0 0 1 0 −1 0 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 0 0 0 1 0 0 −1 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 0 0 0 1 0 0 0 −1 ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ 0 0 0 0 1 0 0 0 0 The last row is included to render W a full-rank matrix. Then Wx =[x5 − x1
x5 − x2
...
x5 − x9
x5 ]T ,
(2.298)
and we can minimize J = xT WT Wx
(2.299)
subject to Ex = y by finding the stationary value of J = J − 2μT (y − Ex) .
(2.300)
The normal equations are then WT Wx = ET μ,
(2.301)
Ex = y,
(2.302)
and x˜ =(WT W)−1 ET μ.
Basic machinery
98
Then E(WT W)−1 ET μ = y. The rank of E(WT W)−1 ET is K = 5 < M = 6, and so we need a generalized inverse, μ ˜ =(E(WT W)−1 ET )+ y =
5
vi
j=1
viT y . λi
The nullspace of E(WT W)−1 ET is the vector [−0.408
−0.408
−0.408
0.408
0.408
0.408]T ,
(2.303)
which produces the solvability condition. Here, because E(WT W)−1 ET is symmetric, the SVD reduces to the symmetric decomposition. Finally, the mapped x˜ is rx → ⎤ −0.20 0.41 −0.20 r y ↑ ⎣ 0.41 0.18 0.41 ⎦ , −0.20 0.41 −0.21 ⎡
and one cannot further decrease the sum-squared differences of the solution elements. One can confirm that this solution satisfies the equations. Evidently, it produces a minimum, not a maximum (it suffices to show that the eigenvalues of WT W are all non-negative). The addition of any of the nullspace vectors of E to x˜ will necessarily increase the value of J and hence there is no bounded maximum. In real tomographic problems, the arc lengths making up matrix E are three-dimensional curves and depend upon the background index of refraction in the medium, which is usually itself determined from observations.37 There are thus errors in E itself, rendering the problem one of non-linear estimation. Approaches to solving such problems are described in Chapter 3. Example Consider the box reservoir problem with two sources (“end members”) described in Eqs. (2.144)–(2.145), which was reduced to two equations in two unknowns by dividing through by one of the unknown fluxes, J0 , and solving for the ratios J1 /J0 , J2 /J0 . Suppose they are solved instead in their original form as two equations in three unknowns: J1 + J2 − J0 = 0, C1 J1 + C2 J2 − C0 J0 = 0.
2.5 The singular vector expansion
99
To make it numerically definite, let C1 = 1, C2 = 2, C0 = 1.75. The SVD produces ⎫ ⎧ ⎨−0.41 −0.89 0.20⎬ −0.51 −0.86 U= , V = −0.67 0.44 0.59 , Λ = diag ([3.3, 0.39, 0]) −0.86 0.51 ⎭ ⎩ 0.61 −0.11 0.78 (rounded). As the right-hand side of the governing equations vanishes, the coefficients of the range vectors, v1,2 , must also vanish, and the only possible solution here is proportional to the nullspace vector, α 3 v3 , or [J1 , J2 , J0 ] = α 3 [0.20, 0.59, 0.78]T , and α 3 is arbitrary. Alternatively, J1 /J0 = 0.25, J2 /J0 = 0.75. Example Consider the flow into a four-sided box with missing integration constant as described in Chapter 1. Total mass conservation and conservation of dye is denoted by Ci . Let the relative areas of each interface be 1, 2, 3, 1 units respectively. Let the corresponding velocities on each side be 1, 1/2, −2/3, 0 respectively, with the minus sign indicating a flow out. That mass is conserved is confirmed by 1 −2 1 (1) + 2 +3 + 1 (0) = 0. 2 3 Now suppose that the total velocity is not in fact known, but that an integration constant is missing on each interface, so that 1 1 1 + b1 + 2 (1 + b2 ) + 3 + b3 + 1 (2 + b4 ) = 0, 2 3 where the bi = [1/2, −1/2, −1, −2], but are here treated as unknown. Then the above equation becomes b1 + 2b2 + 3b3 + b4 = −5.5, or one equation in four unknowns. One linear combination of the unknown bi can be determined. We would like more information. Suppose that a tracer of concentration Ci = [2, 1, 3/2, 0] is measured at each side, and is believed conserved. The governing equation is 1 3 1 + b1 2 + 2 (1 + b2 ) 1 + 3 + b3 + 1 (2 + b4 ) 0 = 0, 1 2 3 2 or 2b1 + 2b2 + 4.5b3 + 0b4 = −4.5,
100
Basic machinery
giving a system of two equations in four unknowns: ⎡ ⎤ b1 ⎢ 1 2 3 1 ⎢ b2 ⎥ ⎥ = −5.5 . −4.5 2 2 4.5 0 ⎣ b3 ⎦ b4 The SVD of the coefficient matrix, E, is −0.582 −0.813 6.50 0 0 0 E= 0.813 0.582 0 1.02 0 0 ⎧ ⎫ −0.801 0.179 −0.454 0.347 ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ 0.009 0.832 0.429 0.340 × . −0.116 0.479 −0.243 −0.835⎪ ⎪ ⎪ ⎪ ⎩ ⎭ 0.581 0.215 −0.742 0.259 The remainder of the solution is left to the reader.
2.5.6 Simple examples: differential and partial differential equations Example As an example of the use of this machinery with differential equations, consider d2 x (r ) − k 2 x (r ) = 0, dr 2
(2.304)
subject to initial and/or boundary conditions. Using one-sided, uniform discretization, x ((m + 1) r ) − (2 + k 2 (r )2 )x (mr ) + x ((m − 1) r ) = 0,
(2.305)
at all interior points. Take the specific case, with two end conditions, x (r ) = 10, x (51r ) = 1, r = 0.1. The numerical solution is depicted in Fig. 2.11 from the direct (conventional) solution to Ax = y. The first two rows of A were used to impose the boundary conditions on x(r ), x(51r ). The singular values of A are also plotted in Fig. 2.11. The range is over about two orders of magnitude, and there is no reason to suspect numerical difficulties. The first and last singular vectors u1 , v1 , u51 , v51 are also plotted. One infers (by plotting additional such vectors), that the large singular values correspond to singular vectors showing a great deal of small-scale structure, and the smallest singular values correspond to the least structured (largest spatial scales) in both the solution and in the specific corresponding weighted averages of the equations. This result may be counterintuitive. But note that, in this problem, all elements of y vanish except the first two, which are being used to set the boundary conditions. We know from the analytical
2.5 The singular vector expansion 2
10
(a)
li
x(r)
10 5 0 0
101 (b)
0
10
−2
r
0.2
5 (c)
10
0
i
0.2
50 (d)
0
0 −0.2 0
50 i
−0.2 0
50 i
Figure 2.11 (a) x˜ from Eq. (2.304) by brute force from the simultaneous equations; (b) displays the corresponding singular values; all are finite (there is no nullspace); (c) shows u1 (solid curve), and u51 (dashed); (d) shows the corresponding v1 , v51 . The most robust information corresponds to the absence of small scales in the solution.
solution that the true solution is large-scale; most of the information contained in the differential equation (2.304) or its numerical counterpart (2.305) is an assertion that all small scales are absent; this information is the most robust and corresponds to the largest singular values. The remaining information, on the exact nature of the largest scales, which is contained in only two of the 51 equations – given by the boundary conditions, is extremely important, but is less robust than that concerning the absence of small scales. (Less “robust” is being used in the sense that small changes in the boundary conditions will lead to relatively large changes in the large-scale structures in the solution because of the division by relatively small λi .) Example Consider now the classical Neumann problem described in Chapter 1. The problem is to be solved on a 10 × 10 grid as in Eq. (1.17), Ax = b. The singular values of A are plotted in Fig. 2.12; the largest one is λ1 = 7.8, and the smallest non-zero one is λ99 = 0.08. As expected, λ100 = 0. The singular vector v100 corresponding to the zero singular value is a constant; u100 , also shown in Fig. 2.12, is not a constant, and has considerable structure – which provides the solvability condition for the Neumann problem, uT100 y = 0. The physical origin of the solvability condition is readily understood: Neumann boundary conditions prescribe boundary flux rates, and the sum of the interior source strengths plus the boundary flux rates must sum to zero, otherwise no steady state is possible. If the boundary conditions are homogeneous, then no flow takes place through the boundary, and the interior sources must sum to zero. In particular, the value of u100
Basic machinery
102 (a)
(b)
10 2 4 6 8 10
5
0 0
50
0 −0.05 5
10
−0.1
100
(c)
2 4 6 8 10
0.05
(d)
0.8 0.6 0.4
0 5
−0.2
0.2 10 5
10
5
10
−0.4
Figure 2.12 (a) Graph of the singular values of the coefficient matrix A of the numerical Neumann problem on a 10 × 10 grid. All λi are non-zero except the last one. (b) shows u100, the nullspace vector of ET defining the solvability or consistency condition for a solution through uT100 y = 0. Plotted as mapped onto the two-dimensional spatial grid (r x , r y ) with x = y = 1. The interpretation is that the sum of the influx through the boundaries and from interior sources must √ vanish. Note that corner derivatives differ from other boundary derivatives by 1/ 2. The corresponding v100 is a constant, indeterminate with the information available, and not shown. (c) A source b (a numerical delta function) is present, not satisfying the solvability condition uT100 b = 0, because all boundary fluxes were set to vanish. (d) The particular SVD solution, x˜ , at rank K = 99. One confirms that A˜x − b is proportional to u100 as the source is otherwise inconsistent with no flux boundary conditions. With b a Kronecker delta function at one grid point, this solution is a numerical Green function for the Neumann problem and insulating boundary conditions. (See color figs.)
on the interior grid points is a constant. The Neumann problem is thus a forward one requiring coping with both a solution nullspace and a solvability condition.
2.5.7 Relation of least-squares to the SVD What is the relationship of the SVD solution to the least-squares solutions? To some extent, the answer is already obvious from the orthonormality of the two sets of singular vectors: they are the least-squares solution, where it exists. When does the simple least-squares solution exist? Consider first the formally overdetermined problem, M > N . The solution (2.95) is meaningful if and only if the matrix inverse exists. Substituting the SVD for E, one finds that −1 −1 (ET E)−1 = V N ΛTN UTN U N Λ N VTN = V N Λ2N VTN , (2.306)
2.5 The singular vector expansion
103
where the semi-orthogonality of U N has been used. Suppose that K = N , its maximum possible value; then Λ2N is N × N with all non-zero diagonal elements λi2 . The inverse in (2.306) may be found by inspection, using VTN V N = I N , T (ET E)−1 = V N Λ−2 N VN .
Then the solution (2.95) becomes −1 T T T x˜ = V N Λ−2 N V N V N Λ N U N y = V N Λ N U N y,
(2.307)
(2.308)
which is identical to the SVD solution (2.286). If K < N , Λ2N has at least one zero on the diagonal, there is no matrix inverse, and the conventional least-squares solution is not defined. The condition for its existence is thus K = N , the so-called “full rank overdetermined” case. The condition K < N is called “rank deficient.” The dependence of the least-squares solution magnitude upon the possible presence of very small, but non-vanishing, singular values is obvious. That the full-rank overdetermined case is unbiassed, as previously asserted (p. 47), can now be seen from T N N
ui y uiT y0
˜x − x = vi − x = vi − x = 0, λi λi i=1 i=1 with y = y0 + n, if n = 0, assuming that the correct E (model) is being used. Now consider another problem, the conventional purely underdetermined leastsquares one, whose solution is (2.167). When does that exist? Substituting the SVD, −1 x˜ = V M Λ M UTM U M Λ M VTM V M ΛTM UTM y (2.309) −1 = V M Λ M UTM U M Λ2M UTM y. Again, the matrix inverse exists if and only if Λ2M has all non-zero diagonal elements, which occurs only when K = M. Under that specific condition by inspection, −1 T T x˜ = V M Λ M UTM U M Λ−2 (2.310) M U M y = V M Λ M U M y, n˜ = 0,
(2.311)
which is once again the particular-SVD solution (2.286) – with the nullspace coefficients set to zero. This situation is usually referred to as the “full-rank underdetermined case.” Again, the possible influence of small singular values is apparent and an arbitrary sum of nullspace vectors can be added to (2.310). The bias of (2.309) is given by the nullspace elements, and its uncertainty arises only from their contribution, because with n˜ = 0, the noise variance vanishes, and the particular-SVD solution covariance Cx x would be zero. The particular-SVD solution thus coincides with the two simplest forms of leastsquares solution, and generalizes both of them to the case where the matrix inverses
Basic machinery
104
do not exist. All of the structure imposed by the SVD, in particular the restriction on the residuals in (2.264), is present in the least-squares solution. If the system is not of full rank, then the simple least-squares solutions do not exist. The SVD generalizes these results by determining what it can: the elements of the solution lying in the range of E, and an explicit structure for the resulting nullspace vectors. The SVD provides a lot of flexibility. For example, it permits one to modify the simplest underdetermined solution (2.167) to remove its greatest shortcoming, the necessities that n˜ = 0 and that the residuals be orthogonal to all range vectors. One simply truncates the solution (2.270) at K = K < M, thus assigning all vectors vi , K + i = 1, 2, . . . , K , to an “effective nullspace” (or substitutes K for K everywhere). The residual is then M
n˜ =
i=K +1
uiT y ui ,
(2.312)
with an uncertainty for x˜ given by (2.293), but with the upper limit being K rather than K . Such truncation has the effect of reducing the solution covariance contribution to the uncertainty, but increasing the contribution owing to the nullspace (and increasing the bias). In the presence of singular values small compared to σ n , the resulting overall reduction in uncertainty may be very great – at the expense of a solution bias. The general solution now consists of three parts, x˜ =
K
uT y i
i=1
λi
vi +
K
i=K +1
α i vi +
N
α i vi ,
(2.313)
i=K +1
where the middle sum contains the terms appearing with singular values too small to be employed – for the given noise – and the third sum is the strict nullspace. Usually, one lumps the two nullspace sums together. The first sum, by itself, represents the particular-SVD solution in the presence of noise. Resolution and covariance matrices are modified by the substitution of K for K . This consideration is extremely important – it says that despite the mathematical condition λi = 0, some structures in the solution cannot be estimated with sufficient reliability to be useful. The “effective rank” is then not the same as the mathematical rank. It was already noticed that the simplest form of least-squares does not provide a method to control the ratios of the solution and noise norms. Evidently, truncation of the SVD offers a simple way to do so – by reducing K . It follows that the solution norm necessarily is reduced, and that the residuals must grow, along with the size of
2.5 The singular vector expansion
105
the solution nullspace. The issue of how to choose K , that is, “rank determination,” in practice is an interesting one to which we will return (see p. 116).
2.5.8 Pseudo-inverses Consider an arbitrary M × N matrix E = U K Λ K VTK and Ex + n = y. Then if E is full-rank underdetermined, the minimum norm solution is T x˜ = ET (EET )−1 y = V K Λ−1 K U K y,
K = M,
and if it is full-rank overdetermined, the minimum noise solution is T x˜ = (ET E)−1 ET y = V K Λ−1 K U K y,
K = N.
T −1 T The first of these, the Moore–Penrose, or pseudo-inverse, E+ is 1 = E (EE ) sometimes also known as a “right-inverse,” because EE+ = I . The second pseudoM 1 + T −1 T inverse, E+ 2 = (E E) E , is a “left-inverse” as E2 E = I N . They can both be repT resented as V K Λ−1 K U K , but with differing values of K . If K < M, N neither of the T pseudo-inverses exists, but V K Λ−1 K U K y still provides the particular SVD solution. When K = M = N , one has a demonstration that the left and right inverses are identical; they are then written as E−1 .
2.5.9 Row and column scaling The effects on the least-squares solutions of the row and column scaling can now be understood. We discuss them in the context of noise covariances, but as always in least-squares, the weight matrices need no statistical interpretation, and can be chosen by the investigator to suit his or her convenience or taste. Suppose we have two equations written as ⎡ ⎤ x1 1 1 1 ⎣ ⎦ n y x2 + 1 = 1 , 1 1.01 1 n2 y2 x3 where Rnn = I2 , W = I3 . The SVD of E is ⎧ ⎨0.5764 0.7059 −0.7083 U= , V = 0.5793 0.7083 0.7059 ⎩ 0.5764 λ1 = 2.4536,
−0.4096 0.8151 −0.4096
λ2 = 0.0058.
⎫ 0.7071 ⎬ 0.0000 , ⎭ −0.7071
Basic machinery
106
The SVD solutions, choosing ranks K = 1, 2 in succession, are very nearly (the numbers having been rounded) y1 + y2 x˜ ≈ (2.314) [0.58 0.58 0.58]T , 2.45 y1 + y2 y1 − y2 T x˜ ≈ [0.58 0.58 0.58] + [−0.41 0.82 0.41]T , 2.45 0.0058 respectively, so that the first term simply averages the two measurements, yi , and the difference between them contributes – with great uncertainty – in the second term of the rank 2 solution owing to the very small singular value. The uncertainty is given by −1.50 × 104 1.51 × 104 T −1 (EE ) = . −1.50 × 104 1.51 × 104 Now suppose that the covariance matrix of the noise is known to be 1 0.999999 , Rnn = 0.999999 1 (an extreme case, chosen for illustrative purposes). Then, putting W = Rnn , 1.0000 1.0000 1.0000 0 W1/2 = , W−T/2 = . 0 0.0014 −707.1063 707.1070 The new system to be solved is ⎡ ⎤ x y1 1.0000 1.0000 1.0000 ⎣ 1 ⎦ . x2 = 707.1(−y1 + y2 ) 0.0007 7.0718 0.0007 x3
The SVD is 0.1456 U= 0.9893
0.9893 , −0.1456
⎧ ⎨0.0205 V = 0.9996 ⎩ 0.0205
λ1 = 7.1450,
0.7068 −0.0290 0.7068
⎫ 0.7071 ⎬ 0.0000 , ⎭ −0.7071
λ2 = 1.3996.
The second singular value is now much larger relative to the first one, and the two solutions are y2 − y1 x˜ ≈ [ 0 1 0 ]T , (2.315) 7.1 y2 − y1 y1 − 103 (y2 − y1 ) x˜ ≈ [ 0 1 0 ]T + [ 0.71 0 0.71 ]T , 7.1 1.4
2.5 The singular vector expansion
107
and the rank 1 solution is given by the difference of the observations, in contrast to the unscaled solution. The result is quite sensible – the noise in the two equations is so nearly perfectly correlated that it can be removed by subtraction; the difference y2 − y1 is a nearly noise-free piece of information and accurately defines the appropriate structure in x˜ . In effect, the information provided in the row scaling with R permits the SVD to nearly eliminate the noise at rank 1 by an effective subtraction, whereas without that information, the noise is reduced in the solution (2.314) at rank 1 only by averaging. At full rank, that is, K = 2, it can be confirmed that the solutions (2.314) and (2.315) are identical, as they must be. But the error covariances are quite different: 0.5001 −0.707 T −1 (E E ) = . (2.316) −0.707 0.5001 because the imposed covariance permits a large degree of noise suppression. It was previously asserted (p. 67) that in a full-rank formally underdetermined ˜ as may be seen as follows: system, row scaling is irrelevant to x˜ , n, x˜ = E T (E E T )−1 y −1 = ET W−1/2 W−T/2 EET W−1/2 W−T/2 y = ET W−1/2 W1/2 (EET )−1 WT/2 W−T/2 y
(2.317)
= ET (EET )−1 y, but which is true only in the full rank situation. There is a subtlety in row weighting. Suppose we have two equations of form 10x1 + 5x2 + x3 + n 1 = 1, 100x1 + 50x2 + 10x3 + n 2 = 2,
(2.318)
after row scaling to make the expected noise variance in each the same. A rank 1 solution to these equations by SVD is x˜ = [0.0165, 0.0083, 0.0017]T , which produces residuals y˜ − y = [−0.79, 0.079]T – much smaller in the second equation than in the first one. Consider that the second equation is ten times the first one – in effect saying that a measurement of ten times the values of 10x1 + 5x2 + x3 has the same noise in it as a measurement of one times this same linear combination. The second equation represents a much more accurate determination of this linear combination and the equation should be given much more weight in determining the unknowns – and both the SVD and ordinary least-squares does precisely that. To the extent that one finds this result undesirable (one should be careful about why it is so found), there is an easy remedy – divide the equations by their row norms ( j E i2j )1/2 . But there
108
Basic machinery
will be a contradiction with any assertion that the noise in all equations was the same to begin with. Such row scaling is best regarded as non-statistical in nature. An example of this situation is readily apparent in the box balances discussed in Chapter 1. Equations such as (1.32) could have row norms much larger than those (1.31) for the corresponding mass balance, simply because the tracer is measured by convention in its own units. If the tracer is, e.g., oceanic salt, values are, by convention, measured on the Practical Salinity Scale, and are near 35 (but are dimensionless). Because there is nothing fundamental about the choice of units, it seems unreasonable to infer that the requirement of tracer balance has an expected error 35 times smaller than for mass. One usually proceeds in the obvious way by dividing the tracer equations by their row norms as the first step. (This approach need have no underlying statistical validity, but is often done simply on the assumption that salt balance equations are unlikely to be 35 times more accurate than the mass ones.) The second step is to ask whether anything further can be said about the relative errors of mass and salt balance, which would introduce a second, purely statistical, row weight. Column scaling In the least-squares problem, we formally introduced a “column scaling” matrix S. Column scaling operates on the SVD solution exactly as it does in the least-squares solution, to which it reduces in the two special cases already described. That is, we should apply the SVD to sets of equations only where any knowledge of the solution element size has been removed first. If the SVD has been computed for such a column (and row) scaled system, the solution is for the scaled unknown x , and the physical solution is x˜ = ST/2 x˜ .
(2.319)
But there are occasions, with underdetermined systems, where a non-statistical scaling may also be called for, the analogue to the situation considered above where a row scaling was introduced on the basis of possible non-statistical considerations. Example Suppose we have one equation in two unknowns: 10x1 + 1x2 = 3.
(2.320)
The particular-SVD solution produces x˜ = [0.2970, 0.0297]T , in which the magnitude of x1 is much larger than that of x2 and the result is readily understood. As we have seen, the SVD automatically finds the exact solution, subject to making the solution norm as small as possible. Because the coefficient of x1 in (2.320) is ten times that of x2 , it is more efficient in minimizing the norm to give x1 a larger value than x2 . Although we have demonstrated this dependence for a trivial example,
2.5 The singular vector expansion
109
similar behavior occurs for underdetermined systems in general. In many cases, this distribution of the elements of the solution vector x is desirable, the numerical value 10 appearing for good physical reasons. In other problems, the numerical values appearing in the coefficient matrix E are an “accident.” In the box-balance example of Chapter 1, the distances defining the interfaces of the boxes are a consequence of the spatial distance between measurements. Unless one believed that velocities should be larger where the distances are greater or the fluid depth was greater, then the solutions may behave unphysically.38 Indeed, in some situations the velocities are expected to be inverse to the fluid depth and such a prior statistical hypothesis is best imposed after one has removed the structural accidents from the system. (The tendency for the solutions to be proportional to the column norms is not rigid. In particular, the equations themselves may preclude the proportionality.) Take a positive definite, diagonal matrix S, and rewrite (2.87) as EST/2 S−T/2 x + n = y. Then E x + n = y, E = EST/2 , x = S−T/2 x. Solving x˜ = E T (E E T )−1 y, x˜ = ST/2 x˜ .
(2.321)
How should S be chosen? Apply the recipe (2.321) for the one equation example of (2.320), with 1/a 2 0 S= , 0 1/b2 * + 100 1 E = 10/a 1/b , E E T = 2 + 2 , (2.322) a b a 2 b2 , (2.323) (E E T )−1 = 100b2 + a 2 a 2 b2 10/a 3, (2.324) x˜ = 1/b 100b2 + a 2 a 2 b2 10/a 2 T/2 3. (2.325) x˜ = S x˜ = 1/b2 100b2 + a 2 The relative magnitudes of the elements of x˜ are proportional to 10/a 2 , 1/b2 . To make the numerical values identical, choose a 2 = 10, b2 = 1, that is, divide √ √ the elements of the first column of E by 10 and the second column by 1. The apparent rule (which is general) is to divide each column of E by the square root of its length. The square root of the length may be surprising, but arises because of the
Basic machinery
110
second multiplication by the elements of ST/2 in (2.321). This form of column scaling should be regarded as “non-statistical,” in that it is based upon inferences about the numerical magnitudes of the columns of E and does not employ information about the statistics of the solution. Indeed, its purpose is to prevent the imposition of structure on the solution for which no statistical basis has been anticipated. In general, the elements of x˜ will not prove to be equal – because the equations themselves do not permit it. If the system is full-rank overdetermined, the column weights drop out, as claimed for least-squares above. To see this result, consider that, in the full-rank case, x˜ = (E T E )−1 E T y x˜ = ST/2 (S1/2 ET EST/2 )−1 S1/2 ET y T/2 −T/2
=S
S
T
−1 −1/2 1/2
(E E) S
S
(2.326) −1
E y = (E E) E y. T
T
T
Usually row scaling is done prior to column scaling so that the row norms have a simple physical interpretation, but one can row normalize in the column normalized space. 2.5.10 Solution and observation resolution: data ranking Typically, either or both of the set of vectors vi , ui used to present x, y will be deficient in the sense of the expansions in (2.187). It follows immediately from Eqs. (2.188) that the particular-SVD solution is x˜ = V K VTK x = Tv x,
(2.327)
and the data vector with which both it and the general solution are consistent is y˜ = U K UTK y = Tu y.
(2.328)
It is convenient therefore, to define the solution and observation resolution matrices, Tv = V K VTK ,
Tu = U K UTK .
(2.329)
The interpretation of the solution resolution matrix is identical to that in the squaresymmetric case (p. 77). Interpretation of the data resolution matrix is slightly subtle. Suppose an element of y was fully resolved, that is, some column, j0 , of U K UTK were all zeros except for diagonal element j0 , which is one. Then a change of unity in y j0 would produce a change in x˜ that would leave unchanged all other elements of y˜ . If element j0 is not fully resolved, then a change of unity in observation y j0 produces a solution that leads to changes in other elements of y˜ . Stated slightly differently, if yi is not fully
2.5 The singular vector expansion
111
Figure 2.13 Box model with mass fluxes across the bounding interfaces i = 1, . . . , 9. Mass conservation equations are written for the total volume, for specifying the flux across the southern and northern boundaries separately, and for the mass balance in the three layers shown.
resolved, the system lacks adequate information to distinguish equation i from a linear dependence on one or more other equations.39 One can use these ideas to construct quantitative statements of which observations are the most important (“data ranking”). From (2.190), trace(Tu ) = K and the relative contribution to the solution of any particular constraint is given by the corresponding diagonal element of Tu . Consider the example (2.318) without row weighting. At rank 1, 0.0099 0.099 , Tu = 0.099 0.9901 showing that the second equation has played a much more important role in the solution than the first one – despite the fact that we asserted the expected noise in both to be the same. The reason is that described above, the second equation in effect asserts that the measurement is 10 times more accurate than in the first equation – and the data resolution matrix informs us of that explicitly. The elements of Tu can be used to rank the data in order of importance to the final solution. All of the statements about the properties of resolution matrices made above apply to both Tu , Tv . Example A fluid flows into a box as depicted in Fig. 2.13, which is divided into three layers. The box is bounded on the left and right by solid walls. At its southern boundary, the inward (positive directed) mass flow has six unknown values, three each into the layers on the west (region A, qi , i = 1, 2, 3), and three each into the
112
Basic machinery
layers on the east (region B, qi , i = 4, 5, 6). At the northern boundary, there are only three unknown flow fields, for which a positive value denotes a flow outward (region C, i = 7, 8, 9). We write seven mass conservation equations. The first three represent mass balance in each of the three layers. The second three fix the mass transports across each of the sections A, B, C. The final equation asserts that the sum of the three mass transports across the sections must vanish (for overall mass balance). Note that the last equation is a linear combination of equations 4 to 6. Then the equations are ⎫ ⎧ ⎪ 1 0 0 1 0 0 −1 0 0⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 1 0 0 1 0 0 −1 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 −1⎪ ⎬ ⎨0 0 1 0 0 1 0 (2.330) 1 1 1 0 0 0 0 0 0 x + n = y, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 0 0 1 1 1 0 0 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 0 0 0 0 0 1 1 1⎪ ⎪ ⎪ ⎪ ⎭ ⎩1 1 1 1 1 1 −1 −1 −1⎪ which is 7 × 9. Inspection, or the singular values, demonstrate that the rank of the coefficient matrix, E, is K = 5 because of the linear dependencies noted. Figures 2.14 and 2.15 display the singular vectors and the rank 5 resolution matrices for Eq. (2.330). Notice that the first pair of singular vectors, u1 , v1 show that the mean mass flux (accounting for the sign convention making the northern flow positive outwards) is determined by a simple sum over all of the equations. Other linear combinations of the mass fluxes are determined from various sums and differences of the equations. It is left to the reader to determine whether they make physical sense. The resolution matrices show that none of the equations nor elements of the mass flux are fully resolved, and that all the equations contribute roughly equally. If row and column scaling have been applied to the equations prior to application of the SVD, the covariance, uncertainty, and resolution expressions apply in those new, scaled spaces. The resolution in the original spaces is Tv = ST/2 Tv S−T/2 , Tu = W
T/2
−T/2
Tu W
(2.331) ,
(2.332)
so that x˜ = Tv x,
y˜ = Tu y,
(2.333)
where Tv , Tu are the expressions Eq. (2.329) in the scaled space. Some insight into the effects of weighting can be obtained by applying row and column scaling
2.5 The singular vector expansion (a)
113
(b) (1)
(2)
(3)
(4)
(5)
(6)
(7) 1
2
3
4
5
6
7
1
2
3
4
5
6
7
8
9
Figure 2.14 Singular vectors of the box flow coefficient matrix. Column (a) shows the ui , the column (b) the vi . The last two null space vectors v8 , v9 are not shown, but u6 , u7 do lie in the data nullspace, and v6 , v7 are also in the solution nullspace. All plots have a full scale of ±1.
to the simple physical example of Eq. (2.330). The uncertainty in the new space is P = S1/2 P ST/2 where P is the uncertainty in the scaled space. We have seen an interpretation of three matrices obtained from the SVD: V K VTK , T U K UTK , V K Λ−2 K V K . The reader may well wonder, on the basis of the symmetries between solution and data spaces, whether there is an interpretation of the remaining T matrix U K Λ−2 K U K ? To understand its use, recall the normal equations (2.163, 2.164) that emerged from the constrained objective function (2.149). They become, using the SVD for E, VΛUT μ = x,
(2.334)
UΛV x = y.
(2.335)
T
The pair of equations is always square, of dimension M + N . These equations show that UΛ2 UT μ = y. The particular SVD solution is T μ = U K Λ−2 K U K y,
(2.336)
Basic machinery
114 (a)
(b) (1)
(2)
(3)
(4)
(5)
(6)
(7) 1
2
3
4
5
6
7
1
2
3
4
5
6
7
8
9
Figure 2.15 Rank K = 5 Tu (a) and Tv (b) for the box model coefficient matrix. Full scale is ±1 in all plots.
involving the “missing” fourth matrix. Thus, ∂J T = 2U K Λ−2 K U K y, ∂y and, taking the second derivative, ∂2 J T = 2U K Λ−2 K UK , ∂y2
(2.337)
which is the Hessian of J with respect to the data. If any of the λi become very small, the objective function will be extremely sensitive to small perturbations in y – producing an effective nullspace of the problem. Equation (2.337) supports the suggestion that perfect constraints can lead to difficulties.
2.5.11 Relation to tapered and weighted least-squares In using least-squares, a shift was made from the simple objective functions (2.89) and (2.149) to the more complicated ones in (2.114) or (2.125). The change was
2.5 The singular vector expansion
115
˜ and through the use made to permit a degree of control of the relative norms of x˜ , n, of W, S of the individual elements and the resulting uncertainties and covariances. Application of the weight matrices W, S through their Cholesky decompositions to the equations prior to the use of the SVD is equally valid – thus providing the same amount of influence over the solution elements. The SVD provides its control over the solution norms, uncertainties and covariances through choice of the effective rank K . This approach is different from the use of the extended objective functions (2.114), but the SVD is actually useful in understanding the effect of such functions. Assume that any necessary W, S have been applied. Then the full SVD, including zero singular values and corresponding singular vectors, is substituted into (2.116), x˜ = (γ 2 I N + VΛT ΛVT )−1 VΛT UT y, and x˜ = V(ΛT Λ + γ 2 I)−1 VT VΛT UT y −1 T T = V diag λi2 + γ 2 Λ U y, or x˜ =
N
λi uiT y i=1
λi2 + γ 2
vi .
(2.338)
(2.339)
It is now apparent what the effect of “tapering” has done in least-squares. The word refers to the tapering down of the coefficients of the vi by the presence of γ 2 from the values they would have in the “pure” SVD. In particular, the guarantee that matrices like (ET E + γ 2 I) always have an inverse despite vanishing singular values, is seen to follow because the presence of γ 2 > 0 assures that the inverse of the sum always exists, irrespective of the rank of E. The simple addition of a positive constant to the diagonal of a singular matrix is a well-known ad hoc method for giving it an approximate inverse. Such methods are a form of what is usually known as “regularization,” and are procedures for suppressing nullspaces. Note that the coefficients of vi vanish with λi and a solution nullspace still exists. The residuals of the tapered least-squares solution can be written in various forms. Eqs. (2.117) are n˜ = γ 2 U(γ 2 I + ΛΛT )−1 UT y T 2 M
ui y γ u , γ 2 > 0, = 2 2 i λ + γ i i=1
(2.340)
that is, the projection of the noise onto the range vectors ui no longer vanishes. Some of the structure of the range of ET is being attributed to noise and it is no
Basic machinery
116
longer true that the residuals are subject to the rigid requirement (2.264) of having zero contribution from the range vectors. An increased noise norm is also deemed acceptable, as the price of keeping the solution norm small, by assuring that none of the coefficients in the sum (2.339) becomes overly large – values we can control by varying γ 2 . The covariance of this solution about its mean (Eq. (2.118)) is readily rewritten as Cx x =
N N
λi λ j uiT Rnn uTj 2 vi vTj 2 2 2 λi + γ λj + γ
i=1 j=1
= σ 2n
N
i=1
λi2
λi2 + γ 2
T 2 vi vi
(2.341)
= σ 2n V(ΛT Λ + γ 2 I N )−1 ΛT Λ(ΛT Λ + γ 2 I N )−1 VT , where the second and third lines are again the special case of white noise. The role of γ 2 in controlling the solution variance, as well as the solution size, should be plain. The tapered least-squares solution is biassed – but the presence of the bias can greatly reduce the solution variance. Study of the solution as a function of γ 2 is known as “ridge regression.” Elaborate techniques have been developed for determining the “right” value of γ 2 .40 The uncertainty, P, is readily found as P = γ2
N
i=1
N
vi viT λi2 vi viT 2 + σ n 2 2 2 (λi2 + γ 2 )2 i=1 (λi + γ )
(2.342)
= γ 2 V(ΛT Λ + γ 2 I)−2 VT + σ 2n V(ΛT Λ + γ 2 I)−1 ΛT Λ(ΛT Λ + γ 2 I)−1 VT , showing the variance reduction possible for finite γ 2 (reduction of the second term), and the bias error incurred in compensation in the first term. The truncated SVD and the tapered SVD-tapered least-squares solutions produce the same qualitative effect – it is possible to increase the noise norm while decreasing the solution norm. Although the solutions differ somewhat, they both achieve the purpose stated above – to extend ordinary least-squares in such a way that one can control the relative noise and solution norms. The quantitative difference between them is readily stated – the truncated form makes a clear separation between range and nullspace in both solution and residual spaces: the basic SVD solution contains only range vectors and no nullspace vectors. The residual contains only nullspace vectors and no range vectors. The tapered form permits a merger of the two different sets of vectors: then both solution and residuals contain some contribution from both formal range and effective nullspaces (0 < γ 2 ). We have already seen several times that preventing n˜ from having any contribution from the range of ET introduces covariances into the residuals, with a
2.5 The singular vector expansion
117
consequent inability to produce values that are strictly white noise in character (although it is only a real issue as the number of degrees of freedom, M − K , goes toward zero). In the tapered form of least-squares, or the equivalent tapered SVD, contributions from the range vectors ui , i = 1, 2, . . . , K , are permitted, and a potentially more realistic residual estimate is obtained. (There is usually no good reason why n˜ is expected to be orthogonal to the range vectors.)
2.5.12 Resolution and variance of tapered solutions The tapered least-squares solutions have an implicit nullspace, arising both from the terms corresponding to zero singular values, or from values small compared to γ 2 . To obtain a measure of solution resolution when the vi vectors have not been computed, consider a situation in which the true solution were x j0 ≡ δ j, j0 , that is, unity in the j0 element and zero elsewhere. Then, in the absence of noise, the correct value of y would be Ex j0 = y j0 ,
(2.343)
defining y j0 . Suppose we actually knew (had measured) y j0 , what solution x j0 would be obtained? Assuming all covariance matrices have been applied and suppressing any primes, tapered least-squares (Eq. (2.120)) produces x˜ j0 = ET (EET + γ 2 I)−1 y j0 = ET (EET + γ 2 I)−1 Ex j0 ,
(2.344)
which is row (or column) j0 of Tv = ET (EET + γ 2 I)−1 E.
(2.345)
Thus we can interpret any row or column of Tv as the solution for one in which a Kronecker delta was the underlying correct one. It is an easy matter, using the SVD of E and letting γ 2 → 0, to show that (2.345) reduces to VVT , if K = M. These expressions apply in the row- and column-scaled space and are suitably modified to take account of any W, S which may have been applied, as in Eqs. (2.331) and (2.332). An obvious variant of (2.345) follows from the alternative least-squares solution (2.127), with W = γ 2 I, S = I, Tv = (ET E + γ 2 I)−1 ET E.
(2.346)
Data resolution matrices are obtained similarly. Let y j = δ j j1 . Equation (2.135) produces x˜ j1 = ET (EET + γ 2 I)−1 y j1 ,
(2.347)
Basic machinery
118
which if substituted into the original equations is E˜x j1 = EET (EET + γ 2 I)−1 y j1 .
(2.348)
Tu = EET (EET + γ 2 I)−1 .
(2.349)
Tu = E(ET E + γ 2 I)−1 ET .
(2.350)
Thus,
The alternate form is
All of the resolution matrices reduce properly to either UUT or VVT as γ 2 → 0 when the SVD for E is substituted, and K = M or N as necessary. 2.6 Combined least-squares and adjoints 2.6.1 Exact constraints Consider now a modest generalization of the constrained problem Eq. (2.87) in which the unknowns x are also meant to satisfy some constraints exactly, or nearly so, for example, Ax = b.
(2.351)
In some contexts, (2.351) is referred to as the “model,” a term also employed, confusingly, for the physics defining E and/or the statistics assumed to describe x, n. We will temporarily refer to Eq. (2.351) as “perfect constraints,” as opposed to those involving E, which generally always have a non-zero noise element. An example of a model in these terms occurs in acoustic tomography (Chapter 1), where measurements exist of both density and velocity fields, and they are connected by dynamical relations; the errors in the relations are believed to be so much smaller than those in the data, that for practical purposes, the constraints (2.351) might as well be treated as though they are perfect.41 But otherwise, the distinction between constraints (2.351) and the observations is an arbitrary one, and the introduction of an error term in the former, no matter how small, removes any particular reason to distinguish them: A may well be some subset of the rows of E. What follows can in fact be obtained by imposing the zero noise limit for some of the rows of E in the solutions already described. Furthermore, whether the model should be satisfied exactly, or should contain a noise element too, is situation dependent. One should be wary of introducing exact equalities into estimation problems, because they carry the strong possibility of introducing small eigenvalues, or near singular relationships, into the solution, and which may dominate the results. Nonetheless, carrying one or more perfect constraints does produce some insight into how the system is behaving.
2.6 Combined least-squares and adjoints
119
Several approaches are possible. Consider, for example, the objective function J = (Ex − y)T (Ex − y) + γ 2 (Ax − b)T (Ax − b),
(2.352)
where W, S have been previously applied if necessary, and γ 2 is retained as a tradeoff parameter. This objective function corresponds to the requirement of a solution of the combined equation sets, E n y x+ = , (2.353) A u b in which u is the model noise, and the weight given to the model is γ 2 I. For any finite γ 2 , the perfect constraints are formally “soft” because they are being applied only as a minimized sum of squares. The solution follows immediately from (2.95) with E y E −→ , y −→ , γA γb assuming the matrix inverse exists. As γ 2 → ∞, the second set of equations is being imposed with arbitrarily great accuracy, and, barring numerical issues, becomes as close to exactly satisfied as one wants. Alternatively, the model can be imposed as a hard constraint. All prior covariances and scalings having been applied, and Lagrange multipliers introduced, the problem is one with an objective function, J = nT n − 2μT (Ax − b) = (Ex − y)T (Ex − y) − 2μT (Ax − b),
(2.354)
which is a variant of (2.149). But now, Eq. (2.351) is to be exactly satisfied, and the observations only approximately so. Setting the derivatives of J with respect to x, μ to zero, gives the following normal equations: AT μ = ET (Ex − y) , Ax = b.
(2.355) (2.356)
Equation (2.355) represents the adjoint, or “dual” model, for the adjoint or dual solution μ, and the two equation sets are to be solved simultaneously for x, μ. They are again M + N equations in M + N unknowns (M of the μi , N of the xi ), but need not be full-rank. The first set, sometimes referred to as the “adjoint model,” determines μ from the difference between Ex and y. The last set is just the exact constraints. We can most easily solve two extreme cases in Eqs. (2.355) and (2.356) – one in which A is square, N × N , and of full-rank, and one in which E has this property.
Basic machinery
120
In the first case, x˜ = A−1 b,
(2.357)
μ ˜ = A−T (ET EA−1 − ET )b.
(2.358)
and
Here, the values of x˜ are completely determined by the full-rank, perfect constraints and the minimization of the deviation from the observations is passive. The Lagrange multipliers or adjoint solution, however, are useful in providing the sensitivity information, ∂ J/∂b = 2μ, as already discussed. The uncertainty of this solution is zero because of the full-rank perfect model assumption (2.356). In the second case, from (2.355), x˜ = (ET E)−1 [ET y + AT μ] ≡ x˜ u + (ET E)−1 AT μ, where x˜ u = (ET E)−1 ET y is the ordinary, unconstrained least-squares solution. Substituting into (2.356) produces μ ˜ = [A(ET E)−1 AT ]−1 (b − A˜xu ),
(2.359)
x˜ = x˜ u + (ET E)−1 AT [A(ET E)−1 AT ]−1 (b − A˜xu ),
(2.360)
and
assuming A is full-rank underdetermined. The perfect constraints are underdetermined; their range is being fit perfectly, with its nullspace being employed to reduce the misfit to the data as far as possible. The uncertainty of this solution may be written as42 P = D 2 (˜x − x) −1
= σ {(E E) 2
T
(2.361) −1
−1
T −1
−1
− (E E) A [A(E E) A ] A(E E) }, T
T
T
T
which represents a reduction in the uncertainty of the ordinary least-squares solution (first term on the right) by the information in the perfectly known constraints. The presence of A−1 in these solutions is a manifestation of the warning about the possible introduction of components dependent upon small eigenvalues of A. If neither ET E nor A is of full-rank one can use, e.g., the SVD with the above solution; the combined E, A may be rank deficient, or just-determined. Example Consider the least-squares problem of solving x1 + n 1 = 1, x2 + n 2 = 1, x1 + x2 + n 3 = 3,
2.6 Combined least-squares and adjoints
121
with uniform, uncorrelated noise of variance 1 in each of the equations. The leastsquares solution is then x˜ = [1.3333
1.3333]T ,
with uncertainty
0.6667 P= −0.333
−0.3333 . 0.6667
But suppose that it is known or desired that x1 − x2 = 1. Then (2.360) produces x˜ = [1.8333 0.8333]T , μ = 0.5, J = 0.8333, with uncertainty 0.1667 0.1667 P= . 0.1667 0.1667 If the constraint is shifted to x1 − x2 = 1.1, the new solution is x˜ = [1.8833 0.7833]T and the new objective function is J = 0.9383, consistent with the sensitivity deduced from μ. A more generally useful case occurs when the errors normally expected to be present in the supposedly exact constraints are explicitly acknowledged. If the exact constraints have errors either in the “forcing,” b, or in a mis-specification of A, then we write Ax = b + Γu,
(2.362)
assuming that u = 0, uuT = Q. Γ is a known coefficient matrix included for generality. If, for example, the errors were thought to be the same in all equations, we could write Γ = [1, 1, . . . 1]T , and then u would be just a scalar. Let the dimension of u be P × 1. Such representations are not unique and more will be said about them in Chapter 4. A hard constraint formulation can still be used, in which (2.362) is to be exactly satisfied, imposed through an objective function of form T −1 T J = (Ex − y)T R−1 nn (Ex − y) + u Q u − 2μ (Ax − b − Γu).
(2.363)
Here, the noise error covariance matrix has been explicitly included. Finding the normal equations by setting the derivatives with respect to (x, u, μ) to zero produces AT μ = ET R−1 nn (Ex − y) ,
(2.364)
ΓT μ = Q−1 u,
(2.365)
Ax + Γu = b.
(2.366)
Basic machinery
122 (a)
(b) p
p
(c) p
x +10
Latitude
x –20 2
2
2 x 1.6
x –0.02
1
1
x –100
x 0.2 1 x0
x –0.005
0
0
1
2
Longitude
p
0
0
2
1
Longitude
p
0
0
1
2
Longitude
p
Figure 2.16 Numerical solution of the partial differential equation, Eq. (2.370). Panel (a) shows the imposed symmetric forcing – sin x sin y. (b) Displays the solution φ, and (c) shows the Lagrange multipliers, or adjoint solution, μ, whose structure is a near mirror image of φ. (Source: Schr¨oter and Wunsch, 1986)
This system is (2N + P) equations in (2N + P) unknowns, where the first equation is again the adjoint system, and dependent upon Ex − y. Because u is a simple function of the Lagrange multipliers, the system is easily reduced to AT μ = ET R−1 nn (Ex − y), Ax + ΓQΓT μ = b,
(2.367) (2.368)
which is now 2N × 2N , the u having dropped out. If all matrices are full-rank, the solution is immediate; otherwise the SVD can be used. To use a soft constraint methodology, write T −1 T J = (Ex − y)T R−1 nn (Ex − y) + (Ax − b − Γu) Q (Ax − b − Γu) ,
(2.369)
and find the normal equations. It is again readily confirmed that the solutions using (2.352) or (2.363) are identical, and the hard/soft distinction is seen again to be artificial. The soft constraint method can deal with perfect constraints, by letting Q−1 → 0 but stopping when numerical instability sets in. The resulting numerical algorithms fall under the general subject of “penalty” and “barrier” methods.43 Objective functions like (2.363) and (2.369) will be used extensively in Chapter 4. Example Consider the partial differential equation ∇ 2 φ +
∂φ = − sin x sin y. ∂x
(2.370)
A code was written to solve it by finite differences for the case = 0.05 and φ = 0 on the boundaries, 0 ≤ x ≤ π, 0 ≤ y ≤ π, as depicted in Fig. 2.16. The discretized form of the model is then the perfect N × N constraint system Ax = b,
x = {φ i j },
(2.371)
2.6 Combined least-squares and adjoints
123
and b is equivalently discretized – sin x sin y. The theory of partial differential equations shows that this system is full-rank and generally well-behaved. But let us pretend that this information is unknown to us, and seek the values of x that make the objective function, J = xT x − 2μT (Ax − b),
(2.372)
stationary with respect to x, μ, that is the Eqs. (2.355) and (2.356) with E = I, y = 0. Physically, xT x is identified with the solution potential energy. The solution μ, corresponding to the solution of Fig. 2.16(b) is shown in Fig. 2.16(c). What is the interpretation? The Lagrange multipliers represent the sensitivity of the solution potential energy to perturbations in the forcing field. The sensitivity is greatest in the right-half of the domain, and indeed displays a boundary layer character. A physical interpretation of the Lagrange multipliers can be inferred, given the simple structure of the governing equation (2.370), and the Dirichlet boundary conditions. This equation is not self-adjoint; the adjoint partial differential equation is of form ∂ν = d, (2.373) ∂x where d is a forcing term, subject to mixed boundary conditions, and whose discrete form is obtained by taking the transpose of the A matrix of the discretization (see Appendix 2 to this chapter). The forward solution exhibits a boundary layer on the left-hand wall, while the adjoint solution has a corresponding behavior in the dual space on the right-hand wall. The structure of the μ would evidently change if J were changed.44 ∇ 2 ν −
The original objective function J is very closely analogous to the Lagrangian (not to be confused with the Lagrange multiplier) in classical mechanics. In mechanics, the gradients of the Lagrangian commonly are virtual forces (forces required to enforce the constraints). The modified Lagrangian, J , is used in mechanics to impose various physical constraints, and the virtual force required to impose the constraints, for example, the demand that a particle follow a particular path, is the Lagrange multiplier.45 In an economics/management context, the multipliers are usually called “shadow prices” as they are intimately related to the question of how much profit will change with a shift in the availability or cost of a product ingredient. The terminology “cost function” is a sensible substitute for what we call the “objective function.” More generally, there is a close connection between the stationarity requirements imposed upon various objective functions throughout this book, and the mathematics of classical mechanics. An elegant Hamiltonian formulation of the material is possible.
Basic machinery
124
2.6.2 Relation to Green functions Consider any linear set of simultaneous equations, involving an arbitrary matrix, A, Ax = b.
(2.374)
Write the adjoint equations for an arbitrary right-hand side, AT z = r.
(2.375)
zT Ax − xT AT z = 0
(2.376)
Then the scalar relation
(the “bilinear identity”) implies that zT b = xT r.
(2.377)
zT b = 0,
(2.378)
In the special case, r = 0, we have
that is, b, the right-hand side of the original equations (2.374), must be orthogonal to any solution of the homogeneous adjoint equations. (In SVD terms, this result is the solvability condition Eq. (2.266).) If A is of full rank, then there is no non-zero solution to the homogeneous adjoint equations. Now assume that A is N × N of full rank. Add a single equation to (2.374) of the form x p = α p,
(2.379)
eTp x = α p ,
(2.380)
or
where e p = δ i p and α p is unknown. We also demand that Eq. (2.374) should remain exactly satisfied. The combined system of (2.374) and (2.380), written as A1 x = b 1 ,
(2.381)
is overdetermined. If it is to have a solution without any residual, it must still be orthogonal to any solution of the homogeneous adjoint equations, AT1 z = 0.
(2.382)
There is only one such solution (because there is only one vector, z = u N +1 , in the null space of AT1 ). Write u N +1 = [g p ,γ ]T , separating out the first N elements of
2.7 Minimum variance estimation and simultaneous equations
125
u N +1 , calling them g p , and calling the one remaining element, γ . Thus Eq. (2.377) is uTN +1 b1 = gTp b + γ α p = 0.
(2.383)
Choose γ = −1 (any other choice can be absorbed into g p ). Then α p = gTp b.
(2.384)
If g p were known, then α p in (2.384) would be the only value consistent with the solutions to (2.374), and would be the correct value of x p. But (2.382) is the same as AT g p = e p .
(2.385)
Because all elements x p are needed, Eq. (2.385) is solved for all p = 1, 2, . . . , N , that is, AT G = I N ,
(2.386)
which is N separate problems, each for the corresponding column of G = {g1 , g2 , . . . , g N }. Here, G is the Green function. With G known, we have immediately that x = GT b,
(2.387)
(from Eq. (2.384)). The Green function is an inverse to the adjoint equations; its significance here is that it generalizes in the continuous case to an operator inverse.46
2.7 Minimum variance estimation and simultaneous equations The fundamental objective for least-squares is minimization of the noise norm (2.89), although we complicated the discussion somewhat by introducing trade-offs against ˜x, various weights in the norms, and even the restriction that x˜ should satisfy certain equations exactly. Least-squares methods, whether used directly as in (2.95) or indirectly through the vector representations of the SVD, are fundamentally deterministic. Statistics were used only to understand the sensitivity of the solutions to noise, and to obtain measures of the expected deviation of the solution from some supposed truth. But there is another, very different, approach to obtaining estimates of the solution to equation sets like (2.87), directed more clearly toward the physical goal: to find an estimate x˜ that deviates as little as possible in the mean-square from the true solution. That is, we wish to minimize the statistical quantities (x˜i − xi )2 ˜ for all i. The next section is devoted to finding such an x˜ (and the corresponding n),
Basic machinery
126
through an excursion into statistical estimation theory. It is far from obvious that this x˜ should bear any resemblance to one of the least-squares estimates; but as will be seen, under some circumstances the two are identical. Their possible identity is extremely useful, but has apparently led many investigators to seriously confuse the methodologies, and therefore the interpretation of the result.
2.7.1 The fundamental result Suppose we are interested in making an estimate of a physical variable, x, which might be a vector or a scalar, and is either constant or varying with space and time. To be definite, let x be a function of an independent variable r, written discretely as r j (it might be a vector of space coordinates, or a scalar time, or an accountant’s label). Let us make some suppositions about what is usually called “prior information.” In particular, suppose we have an estimate of the low-order statistics describing x, that is, specifying its mean and second moments,
x = 0,
x(ri )x(r j )T = Rx x (ri , r j ).
(2.388)
To make this problem concrete, one might think of x as being the temperature anomaly (about the mean) at a fixed depth in a fluid (a scalar) and r j a vector of horizontal positions; or conductivity in a well, where r j would be the depth coordinate, and x is the vector of scalars at any location, r p , x p = x(r p ). Alternatively, x might be the temperature at a fixed point, with r j being the scalar of time. But if the field of interest is the velocity vector, then each element of x is itself a vector, and one can extend the notation in a straightforward fashion. To keep the notation a little cleaner, however, all elements of x are written as scalars. Now suppose that we have some observations, yi , as a function of the same coordinate ri , with a known, zero mean, and second moments R yy (ri , r j ) = y(ri )y(r j )T ,
Rx y (ri , r j ) = x(ri )y(r j )T ,
i, j = 1, 2, . . . , M. (2.389) (The individual observation elements can also be vectors – for example, two or three components of velocity and a temperature at a point – but as with x, the modifications required to treat this case are straightforward, and scalar observations are assumed.) Could the measurements be used to make an estimate of x at a point r˜ α where no measurement is available? Or could many measurements be used to obtain a better estimate even at points where there exists a measurement? The idea is to exploit the concept that finite covariances carry predictive capabilities from known variables to unknown ones. A specific example would be to suppose the measurements are of temperature, y(r j ) = y0 (r j ) + n(r j ), where n is the noise and temperature estimates are sought at different locations, perhaps on a regular grid r˜ α ,
2.7 Minimum variance estimation and simultaneous equations
127
α = 1, 2, . . . , N . This special problem is one of gridding or mapmaking (the tilde is placed on rα as a device to emphasize that this is a location where an estimate is sought; the numerical values of these places or labels are assumed known). Alternatively, and somewhat more interesting, perhaps the measurements are more indirect, with y(ri ) representing a velocity field component at depth in a fluid and believed connected, through a differential equation, to the temperature field. We might want to estimate the temperature from measurements of the velocity. Given the previous statistical discussion (p. 31), it is reasonable to ask for an estimate x˜ (˜rα ), whose dispersion about its true value, x(˜rα ) is as small as possible, that is, P (˜rα , r˜ α ) = (x˜ (˜rα ) − x(˜rα ))(x˜ (˜rβ ) − x(˜rβ ))|r˜ α =˜rβ is to be minimized. If an estimate is needed at more than one point, r˜ α , the covariance of the errors in the different estimates would usually be required, too. Form a vector of values to be estimated, {x˜ (rα )} ≡ x˜ , and their uncertainty is P(˜rα , r˜ β ) = (x˜ (˜rα ) − x(˜rα ))(x˜ (˜rβ ) − x(˜rβ )) = (˜x − x)(˜x − x)T ,
α, β = 1, 2, . . . , N ,
(2.390)
where the diagonal elements, P(˜rα , r˜ α ), are to be individually minimized (not in the sum of squares). Thus a solution with minimum variance about the correct value is sought. What should the relationship be between data and estimate? At least initially, a linear combination of data is a reasonable starting point, x˜ (˜rα ) =
M
B(˜rα , r j )y(r j ),
(2.391)
j=1
for all α, which makes the diagonal elements of P in (2.390) as small as possible. By letting B be an N × M matrix, all of the points can be handled at once: x˜ (˜rα ) = B(˜rα , r j )y(r j ).
(2.392)
(This notation is redundant. Equation (2.392) is a shorthand for (2.391), in which the argument has been put into B explicitly as a reminder that there is a summation over all the data locations, r j , for all mapping locations, r˜ α , but it is automatically accounted for by the usual matrix multiplication convention. It suffices to write x˜ = By.) An important result, often called the “Gauss–Markov theorem,” produces the values of B that will minimize the diagonal elements of P.47 Substituting (2.392)
Basic machinery
128
into (2.390) and expanding, P(˜rα , r˜ β ) = (B(˜rα , r j )y(r j ) − x(˜rα ))(B(˜rβ , rl )y (rl ) − x(˜rβ ))T ≡ (By − x)(By − x)T
(2.393)
= B yy − xy B − B yx + xx . T
T
T
T
T
Using Rx y = RTyx , Eq. (2.393) is P = BR yy BT − Rx y BT − BRTx y + Rx x .
(2.394)
Notice that because Rx x represents the moments of x evaluated at the estimation positions, it is a function of r˜ α , r˜ β , whereas Rx y involves covariances of y at the data positions with x at the estimation positions, and is consequently a function Rx y (˜rα , r j ). Now completing the square (Eq. (2.37)) (by adding and subtracting T Rx y R−1 yy Rx y ), (2.394) becomes −1 T −1 T P = (B − Rx y R−1 yy )R yy (B − Rx y R yy ) − Rx y R yy Rx y + Rx x .
(2.395)
Setting r˜ α = r˜ β so that (2.395) is the variance of the estimate at point r˜ α about its true value, and noting that all three terms in Eq. (2.395) are positive definite, minimization of any diagonal element of P is obtained by choosing B so that the first term vanishes, or B(˜rα , r j ) = Rx y (˜rα , ri ) R yy (ri , r j )−1 = Rx y R−1 yy .
(2.396)
−1 T (The diagonal elements of (B − Rx y R−1 yy )R yy (B − Rx y R yy ) need to be written out explicitly to see that Eq. (2.396) is necessary. Consider the 2 × 2 case: the first term of Eq. (2.395) is of the form T C11 C12 R11 R12 C11 C12 , C21 C22 R21 R22 C21 C22
where C = B − Rx y R−1 yy . Then, one has the diagonal of
2 2 C11 R11 + C12 C11 (R21 + R12 ) + C12 R22 ·.
· , 2 2 R11 + C21 C22 (R21 + R12 ) + C22 R22 C21
and these diagonals vanish (with R11 , R22 > 0, only if C11 = C12 = C21 = C22 = 0). Thus the minimum variance estimate is x˜ (˜rα ) = Rx y (˜rα , ri ) R−1 yy (ri , r j )y(r j ),
(2.397)
and the minimum of the diagonal elements of P is found by substituting back into (2.394), producing T P(˜rα , r˜ β ) = Rx x (˜rα , r˜ β ) − Rx y (˜rα , r j )R−1 rβ , rk ). yy (r j , rk )Rx y (˜
(2.398)
2.7 Minimum variance estimation and simultaneous equations
129
The bias of (2.398) is
˜x − x = Rx y R−1 yy y − x.
(2.399)
If y = x = 0, the estimator is unbiassed, and called a “best linear unbiassed estimator,” or “BLUE”; otherwise it is biassed. The whole development here began with the assumption that x = y = 0; what is usually done is to remove the sample mean from the observations y, and to ignore the difference between the true and sample means. An example of using this machinery for mapping purposes will be seen in Chapter 3. Under some circumstances, this approximation is unacceptable, and the mapping error introduced by the use of the sample mean must be found. A general approach falls under the label of “kriging,” which is also briefly discussed in Chapter 3. 2.7.2 Linear algebraic equations The result shown in (2.396)–(2.398) is the abstract general case and is deceptively simple. Invocation of the physical problem of interpolating temperatures, etc., is not necessary: the only information actually used is that there are finite covariances between x, y, n. Although mapping will be explicitly explored in Chapter 3, suppose instead that the observations are related to the unknown vector x as in our canonical problem, that is, through a set of linear equations, Ex + n = y. The measurement covariance, R yy , can then be computed directly as R yy = (Ex + n)(Ex + n)T = ERx x ET + Rnn .
(2.400)
The unnecessary, but simplifying and often excellent, assumption was made that the cross-terms of form Rxn = RTnx = 0,
(2.401)
Rx y = x(Ex + n)T = Rx x ET ,
(2.402)
so that
that is, there is no correlation between the measurement noise and the actual state vector (e.g., that the noise in a temperature measurement does not depend upon whether the true value is 10◦ or 25◦ ). Under these circumstances, Eqs. (2.397) and (2.398) take on the following form: x˜ = Rx x ET (ERx x ET + Rnn )−1 y,
(2.403)
n˜ = y − E˜x,
(2.404) −1
P = Rx x − Rx x E (ERx x E + Rnn ) ERx x . T
T
(2.405)
130
Basic machinery
These latter expressions are extremely important; they permit discussion of the solution to a set of linear algebraic equations in the presence of noise using information concerning the statistics of both the noise and the solution. Notice that they are identical to the least-squares expression (2.135) if S = Rx x , W = Rnn , except that there the uncertainty was estimated about the mean solution; here it is taken about the true one. As is generally true of all linear methods, the uncertainty, P, is independent of the actual data, and can be computed in advance should one wish. From the matrix inversion lemma, Eqs. (2.403)–(2.405) can be rewritten as −1 T −1 T −1 x˜ = R−1 E Rnn y, x x + E Rnn E n˜ = y − E˜x, −1 T −1 P = R−1 . x x + E Rnn E
(2.406) (2.407) (2.408)
Although these alternate forms are algebraically and numerically identical to Eqs. (2.403)–(2.405), the size of the matrices to be inverted changes from M × M matrices to N × N , where E is M × N (but note that Rnn is M × M; the efficacy of this alternate form may depend upon whether the inverse of Rnn is known). Depending upon the relative magnitudes of M, N , one form may be more preferable to the other. Finally, (2.408) has an important interpretation that we will discuss when we come to recursive methods. Recall, too, the options we had with the SVD of solving M × M or N × N problems. Note that in the limit of complete a priori ignorance of the solution, R−1 x x → 0, Eqs. (2.406) and (2.408) reduce to −1 T −1 E Rnn y, x˜ = ET R−1 nn E −1 P = (ET R−1 nn E) ,
the conventional weighted least-squares solution, now with P = Cx x. More generally, the presence of finite R−1 x = x, x x introduces a bias into the solution so that ˜ which, however, produces a smaller solution variance than in the unbiassed solution. The solution shown in Eqs. (2.403)–(2.405) and (2.406)–(2.408) is an “estimator”; it was found by demanding a solution with the minimum dispersion about the true solution and it is found, surprisingly, identical to the tapered, weighted least-squares solution when S = Rx x , W = Rnn , the least-squares objective function weights are chosen. This correspondence of the two solutions often leads them to be seriously confused. It is essential to recognize that the logic of the derivations are quite distinct: we were free in the least-squares derivation to use weight matrices which were anything we wished – as long as appropriate inverses existed. The correspondence of least-squares with what is usually known as minimum variance estimation can be understood by recognizing that the Gauss–Markov estimator was derived by minimizing a quadratic objective function. The least-squares
2.7 Minimum variance estimation and simultaneous equations
131
estimate was obtained by minimizing a summation which is a sample estimate of the Gauss–Markov objective function when S, W are chosen properly. 2.7.3 Testing after the fact As with any statistical estimator, an essential step after an apparent solution has been found is the testing that the behavior of x˜ , n˜ is consistent with the assumed prior statistics reflected in Rx x , Rnn , and any assumptions about their means or other properties. Such a-posteriori checks are both necessary and very demanding. One sometimes hears it said that estimation using Gauss–Markov and related methods is “pulling solutions out of the air” because the prior covariance matrices Rx x , Rnn often are only poorly known. But producing solutions that pass the test of consistency with the prior covariances can be very difficult. It is also true that the solutions tend to be somewhat insensitive to the details of the prior covariances and it is easy to become overly concerned with the detailed structure of Rx x , Rnn . As stated previously, it is also rare to be faced with a situation in which one is truly ignorant of the covariances, true ignorance meaning that arbitrarily large or small numerical values of xi , n i would be acceptable. In the box inversions of Chapter 1 (to be revisited in Chapter 5), solution velocities of order 1000 cm/s might be regarded as absurd, and their absurdity is readily asserted by choosing Rx x = diag(10cm/s)2 , which reflects a mild belief that velocities are 0(10cm/s) with no known correlations with each other. Testing of statistical estimates against prior hypotheses is a highly developed field in applied statistics, and we leave it to the references already listed for their discussion. Should such tests be failed, one must reject the solutions x˜ , n˜ and ask why they failed – as it usually implies an incorrect model, E, and the assumed statistics of solution and/or noise. Example The underdetermined system 1 1 1 1 1 x+n= , 1 −1 −1 1 −1 with noise variance nnT = 0.01I, has a solution, if Rx x = I, of x˜ = ET (EET + 0.01I)−1 y = [0
0.4988
0.4988
0]T ,
n˜ = [0.0025, −0.0025]T . If the solution was thought to be covariance ⎧ 1 ⎪ ⎪ ⎨ 0.999 Rx x = 0.998 ⎪ ⎪ ⎩ 0.997
large scale and smooth, one might use the 0.999 1 0.999 0.998
0.998 0.999 1 0.999
⎫ 0.997⎪ ⎪ ⎬ 0.998 , 0.999⎪ ⎪ ⎭ 1
Basic machinery
132
which produces a solution x˜ = [0.2402 ± 0.028 0.2595 ± 0.0264 0.2595 ± 0.0264 0.2402 ± 0.0283]T , n˜ = [0.0006
−0.9615]T ,
having the desired large-scale property. (One might worry a bit about the structure of the residuals, but two equations are inadequate to draw any conclusions.)
2.7.4 Use of basis functions A superficially different way of dealing with prior statistical information is often commonly used. Suppose that the indices of xi refer to a spatial or temporal position, call it ri , so that xi = x(ri ). Then it is often sensible to consider expanding the unknown x in a set of basis functions, F j , for example, sines and cosines, Chebyschev polynomials, ordinary polynomials, etc. One might write x(ri ) =
L
α j F j (ri ),
j=1
or
x = Fα,
⎧ F1 (r1 ) ⎪ ⎪ ⎨ F1 (r2 ) F= · ⎪ ⎪ ⎩ F1 (r N )
F2 (r1 ) F2 (r2 ) · F2 (r N )
⎫ FL (r1 ) ⎪ ⎪ ⎬ FL (r2 ) , · ⎪ ⎪ ⎭ FL (r N )
··· ··· · ···
α = [α 1 · · · α L ]T ,
which, when substituted into (2.87), produces Tα + n = y,
T = EF.
(2.409)
If L < M < N , one can convert an underdetermined system into one which is formally overdetermined and, of course, the reverse is possible as well. It should be apparent, however, that the solution to (2.409) will have a covariance structure dictated in large part by that contained in the basis functions chosen, and thus there is no fundamental gain in employing basis functions, although they may be convenient, numerically or otherwise. If Pαα denotes the uncertainty of α, then P = FPαα FT ,
(2.410)
is the uncertainty of x˜ . If there are special conditions applying to x, such as boundary conditions at certain positions, r B , a choice of basis function satisfying those conditions could be more convenient than appending them as additional equations.
2.7 Minimum variance estimation and simultaneous equations
133
Example If, in the last example, one attempts a solution as a first-order polynomial, xi = a + bri ,
r1 = 0, r2 = 1, r3 = 2, . . . ,
the system will become two equations in the two unknowns a, b: a 4 6 a 1 EF = +n= , b 0 0 b −1 and if no prior information about the covariance of a, b is provided, ˜ = [0.0769, 0.1154], ˜ b] [a, % $ x˜ = 0.0769 ± 0.0077 0.1923 ± 0.0192 0.3076 ± 0.0308 0.4230 ± 0.0423 T , n˜ = [0.0002, −1.00]T , which is also large scale and smooth, but different than that obtained above. Although this latter solution has been obtained from a just-determined system, it is not clearly “better.” If a linear trend is expected in the solution, then the polynomial expansion is certainly convenient – although such a structure can be imposed through use of Rx x by specifying a growing variance with ri .
2.7.5 Determining a mean value Let the measurements of the physical quantity continue to be denoted yi and suppose that each is made up of an unknown large-scale mean, m, plus a deviation from that mean, θ i . Then m + θ i = yi ,
i = 1, 2, . . . , M,
(2.411)
or Dm + θ = y,
D = [1
1
1
···
1]T ,
(2.412)
˜ of m. In (2.411) or (2.412) the unknown x has and we seek a best estimate, m, become the scalar m, and the deviation of the field from its mean is the noise, that is, θ ≡ n, whose true mean is zero. The problem is evidently a special case of the use of basis functions, in which only one function – a zeroth-order polynomial, m, is retained. Set Rnn = θθ T . If we were estimating a large-scale mean temperature in a fluid flow filled with smaller-scale eddies, then Rnn is the sum of the covariance of the eddy field plus that of observational errors and any other fields contributing to the difference between yi and the true mean m. To be general, suppose Rx x = m 2 =
Basic machinery
134
m 20 , and that, from (2.406), m˜ = =
−1
1 + DT R−1 nn D m 20 1
DT R−1 nn y (2.413)
1/m 20 + DT R−1 nn D
D
T
R−1 nn y,
48 where DT R−1 nn D is a scalar. The expected uncertainty of this estimate is (2.408), −1 1 1 T −1 + D Rnn D = , (2.414) P= 2 2 m0 1/m 0 + DT R−1 nn D
(also a scalar). The estimates may appear somewhat unfamiliar; they reduce to more common expressions in certain limits. Let the θ i be uncorrelated, with uniform variance σ 2 ; Rnn is then diagonal and (2.413) is M M
m 20 1 yi = 2 yi , σ + Mm 20 i=1 1/m 20 + M/σ 2 σ 2 i=1
m˜ =
where the relations DT D = M, DT y = the estimate is
M i=1
(2.415)
yi were used. The expected value of
M
m 20 m 20 ˜ = 2
m
y = Mm = m, i σ + Mm 20 i σ 2 + Mm 20
(2.416)
that is, it is biassed, as inferred above, unless yi = 0, implying m = 0. P becomes P=
σ 2 m 20 1 = . 1/m 20 + M/σ 2 σ 2 + Mm 20
(2.417)
Under the further assumption that m 20 → ∞, m˜ =
M 1
yi , M i=1
P = σ 2 /M,
(2.418) (2.419)
which are the ordinary average and its variance (the latter expression is the wellknown “square root of M rule” for the standard deviation of an average; recall ˜ in (2.418) is readily seen to be the true mean – this estimate Eq. (2.42)); m has become unbiassed. However, the magnitude of (2.419) always exceeds that of (2.417) – acceptance of bias in the estimate (2.415) reduces the uncertainty of the result – a common trade-off.
2.7 Minimum variance estimation and simultaneous equations
y(t )
0
135
(a)
−20
−40
0
100
200
300
400
500
Time
R(t)
600
(b)
500 400 300
0
100
200
300 t = |t − t ′|
400
500
Figure 2.17 (a) Time series y(t) whose mean is required. (b) The autocovariance
y(t)y(t ) as a function of |t − t | (in this special case, it does not depend upon t, t separately.) The true mean of y(t) is zero by construction.
Equations (2.413) and (2.414) are the more general estimation rule – accounting through Rnn for correlations in the observations and their irregular distribution. Because samples may not be independent, (2.419) may be extremely optimistic. Equation (2.414) gives one the appropriate expression for the variance when the data are correlated (that is, when there are fewer degrees of freedom than the number of sample points). Example The mean is needed for the M = 500 values of the measured time series, shown in Fig. 2.17. If one calculates the ordinary average, m˜ = −20.0, the standard error, treating the measurements as uncorrelated, by Eq. (2.419) is ±0.31. If, on the other hand, one uses the covariance function displayed in Fig. 2.17, and Eqs. (2.413) and (2.414) with m 20 → ∞, one obtains m˜ = −23.7, with a standard error of ±20. The true mean of the time series is actually zero (it was generated that way), and one sees the dire effects of assuming uncorrelated measurement noise, when the correlation is actually very strong. Within two standard deviations (a so-called 95% confidence interval for the mean, if n is Gaussian), one finds, correctly, that the sample mean is indistinguishable from zero, whereas the mean assuming uncorrelated noise would appear to be very well determined and markedly different from zero.49 (One might be tempted to apply a transformation to render the observations uncorrelated before averaging, and so treat the result as having M degrees-of-freedom. But recall, e.g., that for Gaussian variables (p. 39),
136
Basic machinery
the resulting numbers will have different variances, and one would be averaging apples and oranges.) The use of the prior estimate, m 20 , is interesting. Letting m 20 go to infinity does not mean that an infinite mean is expected (Eq. (2.418) is finite). It is is merely a statement that there is no information whatsoever, before we start, as to the magnitude of the true average – it could be arbitrarily large (or small and of either sign) and if it came out that way, it would be acceptable. Such a situation is, of course, unlikely and even though we might choose not to use information concerning the probable size of the solution, we should remain aware that we could do so (the importance of the prior estimate diminishes as M grows – so that with an infinite amount of data it has no effect at all on the estimate). If a prior estimate of m itself is available, rather than just its mean square, the problem should be reformulated as one for the estimate of the perturbation about this value. It is very important not to be tempted into making a first estimate of m 20 by using (2.418), substituting into (2.415), thinking to reduce the error variance. For the Gauss–Markov theorem to be valid, the prior information must be truly independent of the data being used.
2.8 Improving recursively 2.8.1 Least-squares A common situation arises that one has a solution x˜ , n, ˜ P, and more information becomes available, often in the form of further noisy linear constraints. One way of using the new information is to combine the old and new equations into one larger system, and re-solve. This approach may well be the best one. Sometimes, however, perhaps because the earlier equations have been discarded, or for reasons of storage or both, one prefers to retain the information from the previous solution without having to re-solve the entire system. So-called recursive methods, in both least-squares and minimum variance estimation, provide the appropriate recipes. Let the original equations be re-labeled so that we can distinguish them from those that come later, in the form E(1)x + n(1) = y(1),
(2.420)
where the noise n(1) has zero mean and covariance matrix Rnn (1). Let the estimate of the solution to (2.420) from one of the estimators be written as x˜ (1), with uncertainty P(1). To be specific, suppose (2.420) is full-rank overdetermined, and was solved using row-weighted least-squares, as x˜ (1) = [E(1)T Rnn (1)−1 E(1)]−1 E(1)T Rnn (1)−1 y(1),
(2.421)
2.8 Improving recursively
137
with corresponding P(1) (column weighting is redundant in the full-rank fullydetermined case). Some new observations, y(2), are obtained, with the error covariance of the new observations given by Rnn (2), so that the problem for the unknown x is E(1) n(1) y(1) x+ = , (2.422) E(2) n(2) y(2) where x is the same unknown. We assume n(2) = 0 and
n(1)n(2)T = 0,
(2.423)
that is, no correlation of the old and new measurement errors. A solution to (2.422) should give a better estimate of x than (2.420) alone, because more observations are available. It is sensible to row weight the concatenated set to Rnn (1)−T/2 E(1) Rnn (1)−T/2 n(1) Rnn (1)−T/2 y(1) x + = . (2.424) Rnn (2)−T/2 E(2) Rnn (2)−T/2 n(2) Rnn (2)−T/2 y(2) “Recursive weighted least-squares” seeks the solution to (2.424) without inverting the new, larger matrix, by taking advantage of the existing knowledge of x˜ (1), P(1) – however they might actually have been obtained. The objective function corresponding to finding the minimum weighted error norm in (2.424) is J = (y(1) − E(1)x)T Rnn (1)−1 (y(1) − E(1)x)
(2.425)
+ (y(2) − E(2)x)T Rnn (2)−1 (y(2) − E(2)x).
Taking the derivatives with respect to x, the normal equations produce a new solution, x˜ (2) = {E(1)T Rnn (1)−1 E(1) + E(2)T Rnn (2)−1 E(2)}−1
(2.426)
× {E(1)T Rnn (1)−1 y(1) + E(2)T Rnn (2)−1 y(2)}.
This is the result from the brute-force re-solution. But one can manipulate (2.426) into50 (see Appendix 3 to this chapter): x˜ (2) = x˜ (1) + P(1)E(2)T [E(2)P(1)E(2)T + Rnn (2)]−1 [y(2) − E(2)˜x(1)] = x˜ (1) + K(2)[y(2) − E(2)˜x(1)],
(2.427)
P(2) = P(1) − K(2)E(2)P(1),
(2.428) −1
K(2) = P(1)E(2) [E(2)P(1)E(2) + Rnn (2)] . T
T
(2.429)
An alternate form for P(2), found from the matrix inversion lemma, is P (2) = [P (1)−1 + E (2)T Rnn (2)−1 E (2)]−1 .
(2.430)
Basic machinery
138
A similar alternate for x˜ (2), involving different dimensions of the matrices to be inverted, is also available from the matrix inversion lemma, but is generally less useful. (In some large problems, however, matrix inversion can prove less onerous than matrix multiplication.) The solution (2.427) is just the least-squares solution to the full set, but rearranged after a bit of algebra. The original data, y(1), and coefficient matrix, E(1), have disappeared, to be replaced by the first solution, x˜ (1), and its uncertainty, P(1). That is to say, one need not retain the original data and E(1) for the new solution to be computed. Furthermore, because the new solution depends only upon x˜ (1), and P(1), the particular methodology originally employed for obtaining them is irrelevant: they might even have been obtained from an educated guess, or from some previous calculation of arbitrary complexity. If the initial set of equations (2.420) is actually underdetermined, and should it have been solved using the SVD, one must be careful that P(1) includes the estimated error owing to the missing nullspace. Otherwise, these elements would be assigned zero error variance, and the new data could never affect them. Similarly, the dimensionality and rank of E (2) is arbitrary, as long as the matrix inverse exists. Example Suppose we have a single measurement of a scalar, x, so that x + n (1) = y(1), n(1) = 0, n(1)2 = R(1). Then an estimate of x is x˜ (1) = y(1), with uncertainty P(1) = R(1). A second measurement then becomes available, x + n(2) = y(2), n(2) = 0, n(2)2 = R(2). By Eq. (2.427), an improved solution is x˜ (2) = y(1) + R(1)/(R(1) + R(2))(y(1) − y(2)), with uncertainty by Eq. (2.430), P (2) = 1/ (1/R (1) + 1/R(2)) = R(1)R(2)/(R (1) + R (2)). If R (1) = R (2) = R, we have x˜ (2) = (y (1) + y (2)) /2, P (2) = R/2. If there are M successive measurements all with the same error variance, R, one finds the last estimate is x˜ (M) = x˜ (M − 1) + R/ (M − 1) (R/ (M − 1) + R)−1 y (M) 1 = x˜ (M − 1) + y (M) M 1 (y (1) + y (2) + · · · + y (M)) , = M with uncertainty P (M) =
1 R = , ((M − 1) /R + 1/R) M
2.8 Improving recursively
139
the conventional average and its variance. Note that each new measurement is given a weight 1/M relative to the average, x˜ (M − 1), already computed from the previous M − 1 data points. The structure of the improved solution (2.427) is also interesting and suggestive. It is made up of two terms: the previous estimate plus a term proportional to the difference between the new observations y(2), and a prediction of what those observations should have been were the first estimate the wholly correct one and the new observations perfect. It thus has the form of a “predictor-corrector.” The difference between the prediction and the forecast can be called the “prediction error,” but recall there is observational noise in y(2). The new estimate is a weighted average of this difference and the prior estimate, with the weighting depending upon the details of the uncertainty of prior estimate and new data. The behavior of the updated estimate is worth understanding in various limits. For example, suppose the initial uncertainty estimate is diagonal, P(1) = 2 I. Then, K(2) = E(2)T [E(2)E(2)T + Rnn (2)/2 ]−1 .
(2.431)
If the new observations are extremely accurate, the norm of Rnn (2)/2 is small, and if the second set of observations is full-rank underdetermined, K(2) −→ E(2)T (E(2)E(2)T )−1 and x˜ (2) = x˜ (1) + E(2)T (E(2)E(2)T )−1 [y(2) − E(2)˜x(1)] = [I − E(2)T (E(2)E(2)T )−1 E(2)]˜x(1) + E(2)T (E(2)E(2)T )−1 y(2).
(2.432)
Now, [I − E(2)T (E(2)E(2)T )−1 E(2)] = I N − VVT = Qv QTv , where V is the fullrank singular vector matrix for E(2), and it spans the nullspace of E(2) (see Eq. (2.291)). The update thus replaces, in the first estimate, all the structures given perfectly by the second set of observations, but retains those structures from the first estimate about which the new observations say nothing – a sensible result. (Compare Eq. (2.427) with Eq. (2.360).) At the opposite extreme, when the new observations are very noisy compared to the previous ones, Rnn /2 → ∞, K(2) → 0, and the first estimate is left unchanged. The general case represents a weighted average of the previous estimate with elements found from the new data, with the weighting depending both upon the relative noise in each, and upon the structure of the observations relative to the structure of x as represented in P(1), Rnn (2), E (2) . The matrix being inverted in (2.429) is the sum of the measurement error covariance, Rnn (2), and the error covariance of the “forecast” E(2)˜x(1). To see this, let γ be the error component in x˜ (1) = x (1) + γ, which by definition has covariance γγ T = P(1). Then the expected covariance
Basic machinery
140
of the error of prediction is E(1)γγ T E (1)T = E(1)P(1)E(1)T , which appears in K(2). Because of the assumptions (2.423), and γ (1) x (1)T = 0, it follows that
y (1) (y (2) − E (2) x˜ (1)) = 0.
(2.433)
That is, the prediction error or “innovation,” y (2) − E(2)˜x(1), is uncorrelated with the previous measurement. The possibility of a recursion based on Eqs. (2.427) and (2.428) (or (2.430)) is obvious – all subscript 1 variables being replaced by subscript 2 variables, which in turn are replaced by subscript 3 variables, etc. The general form would be: x˜ (m) = x˜ (m − 1) + K(m)[y(m) − E(m)˜x(m − 1)],
(2.434) −1
K(m) = P(m − 1)E(m) [E(m)P(m − 1)E(m) + Rnn (m)] ,
(2.435)
P(m) = P(m − 1) − K(m)E(m)P(m − 1),
(2.436)
T
T
m = 1, 2, . . . ,
where m conventionally starts with 0. An alternative form for Eq. (2.436) is, from (2.430), P (m) = [P (m − 1)−1 + E (m)T Rnn (m)−1 E (m)]−1 .
(2.437)
The computational load of the recursive solution needs to be addressed. A leastsquares solution does not require one to calculate the uncertainty P (although the utility of x˜ without such an estimate is unclear). But to use the recursive form, one must have P (m − 1) , otherwise the update step, Eq. (2.434) cannot be used. In very large problems, such as appear in oceanography and meteorology (Chapter 6), the computation of the uncertainty, from (2.436) or (2.437), can become prohibitive. In such a situation, one might simply store all the data, and do one large calculation – if this is feasible. Normally, it will involve less pure computation than will the recursive solution which must repeatedly update P(m). The comparatively simple interpretation of the recursive, weighted least-squares problem will be used in Chapter 4 to derive the Kalman filter and suboptimal filters in a very simple form. It also becomes the key to understanding “assimilation” schemes such as “nudging,” “forcing to climatology,” and “robust diagnostic” methods.
2.8.2 Minimum variance recursive estimates The recursive least-squares result is identical to a recursive estimation procedure, if appropriate least-squares weight matrices were used. To see this result, suppose there exist two independent estimates of an unknown vector x, denoted x˜ a , x˜ b with estimated uncertainties Pa , Pb , respectively. They are either unbiassed, or have the
2.8 Improving recursively
141
same mean, ˜xa = ˜xb = x B . How should the two be combined to give a third estimate x˜ + with minimum error variance? Try a linear combination, x˜ + = La x˜ a + Lb x˜ b .
(2.438)
If the new estimate is to be unbiassed, or is to retain the prior bias (that is, the same mean), it follows that,
˜x+ = La ˜xa + Lb ˜xb ,
(2.439)
x B = La x B + Lb x B ,
(2.440)
Lb = I − La .
(2.441)
or
or
Then the uncertainty is P+ = (˜x+ − x)(˜x+ − x)T = (La x˜ a + (I − La )˜xb )(La x˜ a + (I − La )˜xb )T (2.442) = La Pa LaT + (I − La )Pb (I − La )T , where the independence assumption has been used to set (˜xa − x) (˜xb − x) = 0. P+ is positive definite; minimizing its diagonal elements with respect to La yields (after writing out the diagonal elements of the products) La = Pb (Pa + Pb )−1 ,
Lb = Pa (Pa + Pb )−1 .
(Blithely differentiating and setting to zero produces the correct answer: + ∂P ∂(diag P+ ) = diag [2Pa La − Pb (I − La )] = 0, = diag ∂La ∂La or La = Pb (Pa + Pb )−1 .) The new combined estimate is x˜ + = Pb (Pa + Pb )−1 x˜ a + Pa (Pa + Pb )−1 x˜ b .
(2.443)
This last expression can be rewritten by adding and subtracting x˜ a as x˜ + = x˜ a + Pb (Pa + Pb )−1 x˜ a + Pa (Pa + Pb )−1 x˜ b − (Pa + Pb )(Pa + Pb )−1 x˜ a = x˜ a + Pa (Pa + Pb )−1 (˜xb − x˜ a ).
(2.444)
Notice, in particular, the re-appearance of a predictor-corrector form relative to x˜ a .
142
Basic machinery
The uncertainty of the estimate (2.444) is easily evaluated as P+ = Pa − Pa (Pa + Pb )−1 Pa .
(2.445)
or, by straightforward application of the matrix inversion lemma, −1 . P+ = Pa−1 + P−1 b
(2.446)
The uncertainty is again independent of the observations. Equations (2.444)–(2.446) are the general rules for combining two estimates with uncorrelated errors. Now suppose that x˜ a and its uncertainty are known, but that instead of x˜ b there are measurements, E(2)x + n(2) = y(2),
(2.447)
with n(2) = 0, n(2)n(2)T = Rnn (2). From this second set of observations, we estimate the solution, using the minimum variance estimator (2.406, 2.408) with no use of the solution variance; that is, let R−1 x x → 0. The reason for suppressing Rx x , which logically could come from Pa , is to maintain the independence of the previous and the new estimates. Then x˜ b = [E(2)T Rnn (2)−1 E(2)]−1 E(2)T Rnn (2)−1 y(2), −1
−1
Pb = [E(2) Rnn (2) E(2)] . T
(2.448) (2.449)
Substituting (2.448), (2.449) into (2.444), (2.445), and using the matrix inversion lemma, (see Appendix 3 to this chapter) gives x˜ + = x˜ a + Pa E(2)T [E(2)Pa E(2)T + Rnn (2)]−1 (y(2) − E(2)˜xa ),
(2.450)
P+ = [Pa−1 + E(2)T Rnn (2)−1 E(2)]−1 ,
(2.451)
which is the same as (2.434), (2.437) and thus a recursive minimum variance estimate coincides with a corresponding weighted least-squares recursion. The new covariance may also be confirmed to be that in either of Eqs. (2.436) or (2.437). Notice that if x˜ a was itself estimated from an earlier set of observations, then those data have disappeared from the problem, with all the information derived from them contained in x˜ a and Pa . Thus, again, earlier data can be wholly discarded after use. It does not matter where x˜ a originated, whether from over- or underdetermined equations or a pure guess – as long as Pa is realistic. Similarly, expression (2.450) remains valid whatever the dimensionality or rank of E (2) as long as the inverse matrix exists. The general implementation of this sequence for a continuing data stream corresponds to Eqs. (2.434)–(2.437).
2.9 Summary
143
2.9 Summary This chapter has not exhausted the possibilities for inverse methods, and the techniques will be extended in several directions in the next chapters. Given the lengthy nature of the discussion so far, however, some summary of what has been accomplished may be helpful. The focus is on making inferences about parameters or fields, x, n satisfying linear relationships of the form Ex + n = y. Such equations arise as we have seen, from both “forward” and “inverse” problems, but the techniques for estimating x, n and their uncertainty are useful whatever the physical origin of the equations. Two methods for estimating x, n have been the focus of the chapter: least-squares (including the singular value decomposition) and the Gauss–Markov or minimum variance technique. Least-squares, in any of its many guises, is a very powerful method – but its power and ease of use have (judging from the published literature) led many investigators into serious confusion about what they are doing. This confusion is compounded by the misunderstandings about the difference between an inverse problem and an inverse method. An attempt is made therefore, to emphasize the two distinct roles of least-squares: as a method of approximation, and as a method of estimation. It is only in the second formulation that it can be regarded as an inverse method. A working definition of an inverse method is a technique able to estimate unknown parameters or fields of a model, while producing an estimate of the uncertainties of the results. Solution plus uncertainty are the fundamental requirements. There are many desirable additional features of inverse methods. Among them are: (1) separation of nullspace uncertainties from observational noise uncertainties; (2) the ability to rank the data in its importance to the solution; (3) the ability to use prior statistical knowledge; (4) understanding of solution structures in terms of data structure; (5) the ability to trade resolution against variance. (The list is not exhaustive. For example, we will briefly examine in Chapter 3 the use of inequality information.) As with all estimation methods, one also trades computational load against the need for information. (The SVD, for example, is a powerful form of least-squares, but requires more computation than do other forms.) The Gauss–Markov approach has the strength of forcing explicit use of prior statistical information and is directed at the central goal of obtaining x, n with the smallest mean-square error, and for this reason might well be regarded as the default methodology for linear inverse problems. It has the added advantage that we know we can obtain precisely the same result with appropriate versions of least-squares, including the SVD, permitting the use of least-squares algorithms, but at the risk of losing sight of what we are actually
144
Basic machinery
attempting. A limitation is that the underlying probability densities of solution and noise have to be unimodal (so that a minimum variance estimate makes sense). If unimodality fails, one must look to other methods. The heavy emphasis here on noise and uncertainty may appear to be a tedious business. But readers of the scientific literature will come to recognize how qualitative much of the discussion is – the investigator telling a story about what he or she thinks is going on with no estimate of uncertainties, and no attempt to resolve quantitatively differences with previous competing estimates. In a quantitative subject, such vagueness is ultimately intolerable. A number of different procedures for producing estimates of the solution to a set of noisy simultaneous equations of arbitrary dimension have been described here. The reader may wonder which of the variants makes the most sense to use in practice. Because, in the presence of noise one is dealing with a statistical estimation problem, there is no single “best” answer, and one must be guided by model context and goals. A few general remarks might be helpful. In any problem where data are to be used to make inferences about physical parameters, one typically needs some approximate idea of just how large the solution is likely to be and how large the residuals probably are. In this nearly agnostic case, where almost nothing else is known and the problem is very large, the weighted, tapered least-squares solution is a good first choice – it is easily and efficiently computed and coincides with the Gauss–Markov and tapered SVD solutions, if the weight matrices are the appropriate covariances. Sparse matrix methods exist for its solution should that be necessary.51 Coincidence with the Gauss–Markov solution means one can reinterpret it as a minimum-variance or maximum-likelihood solution (see Appendix 1 to this chapter) should one wish. It is a comparatively easy matter to vary the trade-off parameter, γ 2 , to explore the consequences of any errors in specifying the noise and solution variances. Once a value for γ 2 is chosen, the tapered SVD can then be computed to understand the relationships between solution and data structures, their resolution and their variance. For problems of small to moderate size (the meaning of “moderate” is constantly shifting, but it is difficult to examine and interpret matrices of more than about 500 × 500), the SVD, whether in the truncated or tapered forms is probably the method of choice – because it provides the fullest information about data and its relationship to the solution. Its only disadvantages are that one can easily be overwhelmed by the available information, particularly if a range of solutions must be examined. The SVD has a flexibility beyond even what we have discussed – one could, for example, change the degree of tapering in each of the terms of (2.338) and (2.339) should there be reason to repartition the variance between solution and noise, or some terms could be dropped out of the truncated form at will – should the investigator know enough to justify it.
Appendix 1. Maximum likelihood
145
To the extent that either or both of x, n have expected structures expressible through covariance matrices, these structures can be removed from the problem through the various weight matrix and/or the Cholesky decomposition. The resulting problem is then one in completely unstructured (equivalent to white noise) elements x, n. In the resulting scaled and rotated systems, one can use the simplest of all objective functions. Covariance, resolution, etc., in the original spaces of x, n is readily recovered by appropriately applying the weight matrices to the results of the scaled/rotated space. Both ordinary weighted least-squares and the SVD applied to row- and columnweighted equations are best thought of as approximation, rather than estimation, methods. In particular, the truncated SVD does not produce a minimum variance estimate the way the tapered version can. The tapered SVD (along with the Gauss– Markov estimate, or the tapered least-squares solutions) produce the minimum variance property by tolerating a bias in the solution. Whether the bias is more desirable than a larger uncertainty is a decision the user must make. But the reader is warned against the belief that there is any single best method.
Appendix 1. Maximum likelihood The estimation procedures used in this book are based primarily upon the idea of minimizing the variance of the estimate about the true value. Alternatives exist. For example, given a set of observations with known joint probability density, one can use a principle of “maximum likelihood.” This very general and powerful principle attempts to find those estimated parameters that render the actual observations the most likely to have occurred. By way of motivation, consider the simple case of uncorrelated jointly normal stationary time series, xi , where
xi = m, (xi − m) (x j − m) = σ 2 δ i j . The corresponding joint probability density for x = [x1 , x2 , . . . , x N ]T can be written as px (X) =
1
(2.452) σN 1 2 2 2 × exp − 2 [(X 1 − m) + (X 2 − m) + · · · + (X N − m) ] . 2σ (2π)
N /2
Substitution of the observed values, X 1 = x1 , X 2 = x2 , . . . , into Eq. (2.452) permits evaluation of the probability that these particular values occurred. Denote the corresponding probability density as L . One can demand those values of m, σ rendering the value of L as large as possible. L will be a maximum if log (L) is as
Basic machinery
146
large as possible: that is, we seek to maximize 1 [(x1 − m)2 + (x2 − m)2 + · · · + (x N − m)2 ] 2σ 2 N + N log (σ ) + log(2π), 2
log (L) = −
with respect to m, σ . Setting the corresponding partial derivatives to zero and solving produces m˜ =
M N 1
1
˜ 2. xi , σ˜ 2 = (xi − m) N i=1 N i=1
That is, the usual sample mean, and biassed sample variance maximize the probability of the observed data actually occurring. A similar calculation is readily carried out using correlated normal, or any random variables with a different probability density. Likelihood estimation, and its close cousin, Bayesian methods, are general powerful estimation methods that can be used as an alternative to almost everything covered in this book.52 Some will prefer that route, but the methods used here are adequate for a wide range of problems.
Appendix 2. Differential operators and Green functions Adjoints appear prominently in the theory of differential operators and are usually discussed independently of any optimization problem. Many of the concepts are those used in defining Green functions. Suppose we want to solve an ordinary differential equation, du (ξ ) d2 u (ξ ) + = ρ(ξ ), dξ dξ 2
(2.453)
subject to boundary conditions on u (ξ ) at ξ = 0, L . To proceed, seek first a solution to α
∂v(ξ , ξ 0 ) ∂ 2 v(ξ , ξ 0 ) + = δ(ξ 0 − ξ ), ∂ξ ∂ξ 2
(2.454)
where α is arbitrary for the time being. Multiply (2.453) by v, and (2.454) by u, and subtract: v(ξ , ξ 0 )
∂ 2 v(ξ , ξ 0 ) du(ξ ) d2 u(ξ ) ∂v(ξ , ξ 0 ) (ξ ) + v(ξ , ξ 0 ) − u − u(ξ )α dξ ∂ξ dξ 2 ∂ξ 2
= v(ξ , ξ 0 )ρ (ξ ) − u(ξ )v(ξ , ξ 0 ).
(2.455)
Appendix 2. Differential operators and Green functions
147
Integrate this last equation over the domain, L ∂ 2 v(ξ , ξ 0 ) du (ξ ) d2 u (ξ ) ∂v(ξ , ξ 0 ) + v(ξ , ξ 0 ) − u (ξ ) v(ξ , ξ 0 ) dξ − u(ξ )α dξ ∂ξ dξ 2 ∂ξ 2 0 (2.456) L {v(ξ , ξ 0 )ρ(ξ ) − u (ξ ) δ(ξ 0 − ξ )}dξ , (2.457) = 0
or
L
d 0 dξ = 0
L 2 ∂ v(ξ , ξ 0 ) du dv ∂ 2 v(ξ , ξ 0 ) u dξ v − αu dξ + −u dξ dξ ∂ξ 2 ∂ξ 2 0
L
v(ξ , ξ 0 )ρ (ξ ) dξ − u(ξ 0 ).
(2.458)
If we choose α = −1, then the first term on the left-hand side is integrable, as L d {uv} dξ = uv|0L , (2.459) 0 dξ as is the second term on the left, L ∂v ∂u ∂v ∂u L d u −v ∂ξ = u −v , ∂ξ ∂ξ ∂ξ ∂ξ 0 0 dξ
(2.460)
and thus, u(ξ 0 ) =
L 0
v(ξ , ξ 0 )ρ (ξ ) dξ +
uv|0L
du dv −v + u dξ dξ
L .
(2.461)
0
Because the boundary conditions on v were not specified, we are free to choose them such that v = 0 on ξ = 0, L, e.g., the boundary terms reduce simply to [udv/dξ ]0L , which is then known. Here, v is the adjoint solution to Eq. (2.454), with α = −1, defining the adjoint equation to (2.453); it was found by requiring that the terms on the left-hand side of Eq. (2.458) should be exactly integrable. v is also the problem Green function (although the Green function is sometimes defined so as to satisfy the forward operator, rather than the adjoint one). Textbooks show that for a general differential operator, L, the requirement that v should render the analogous terms integrable is that u T Lv = v T LT u,
(2.462)
where here the superscript T denotes the adjoint. Equation (2.462) defines the adjoint operator (compare to (2.376)).
Basic machinery
148
Appendix 3. Recursive least-squares and Gauss–Markov solutions The recursive least-squares solution Eq. (2.427) is appealingly simple. Unfortunately, obtaining it from the concatenated least-squares form (2.426), x˜ (2) = {E (1)T Rnn (1)−1 E (1) + E (2)T Rnn (2)−1 E (2)}−1 × {E (1)T Rnn (1)−1 y (1) + E (2)T Rnn (2)−1 y} is not easy. First note that x˜ (1) = [E (1)T Rnn (1)−1 E (1)]−1 E (1)T Rnn (1)−1 y (1)
(2.463)
= P(1)E(1)T Rnn (1)−1 y(1), where P (1) = [E (1)T Rnn (1)−1 E (1)]−1 , are the solution and uncertainty of the overdetermined system from the first set of observations alone. Then x˜ (2) = {P (1)−1 + E(2)T Rnn (2)−1 E(2)}−1 × {E(1)T Rnn (1)−1 y(1) + E(2)T Rnn (2)−1 y(2)}. Apply the matrix inversion lemma, in the form Eq. (2.35), to the first bracket (using C → P (1)−1 , B → E (2) , A → Rnn (2)), x˜ (2) = {P(1) − P(1)E(2)T [E(2)P(1)E(2)T + Rnn (2)]−1 E(2)P(1)} × {E(1)T Rnn (1)−1 y(1)} + {P(1) − P(1)E(2)T [E(2)P(1)E(2)T + Rnn (2)]−1 E(2)P(1)}{E(2)T Rnn (2)−1 y(2)} = x˜ (1) − P(1)E(2)T [E(2)P(1)E(2)T + Rnn (2)]−1 E(2)˜x(1) + P(1)E(2)T {I − [E(2)P(1)E(2)T + Rnn (2)]−1 E(2)P(1)E(2)T } × Rnn (2)−1 y(2), using (2.463) and factoring E (2)T in the last line. Using the identity [E (2) P (1) E (2)T + Rnn (2)]−1 [E (2) P (1) E (2)T + Rnn (2)] = I, and substituting for I in the previous expression, factoring, and collecting terms, gives x˜ (2) = x˜ (1) + P(1)E(2)T [E(2)P(1)E(2)T + Rnn (2)]−1 [y(2) − E(2)˜x(1)], (2.464) which is the desired expression. The new uncertainty is given by (2.428) or (2.430).
Notes
149
Manipulation of the recursive Gauss–Markov solution (2.443) or (2.444) is similar, involving repeated use of the matrix inversion lemma. Consider Eq. (2.443) with xb from Eq. (2.448), −1 $ %−1 x˜ + = E (2)T R−1 Pa + E (2)T R−1 x˜ a nn E (2) nn E (2) $ %−1 T −1 T + Pa Pa + E (2) Rnn E (2) (E (2) Rnn E (2))−1 E (2)T R−1 nn y(2). −1 Using Eq. (2.36) on the first term (with A →(E(2)T R−1 nn E(2)) , B → I, C → Pa ), −1 and on the second term with C →(E(2)Rnn E(2)), A → Pa , B → I, this last expression becomes $ %−1 $ −1 % Pa x˜ a + E (2)T R−1 x˜ + = Pa−1 + E (2)T R−1 nn E (2) nn y (2) ,
yet another alternate form. By further application of the matrix inversion lemma,53 this last expression can be manipulated into Eq. (2.450), which is necessarily the same as (2.464). These expressions have been derived assuming that matrices such as E (2)T R−1 nn E (2) are non-singular (full-rank overdetermined). If they are singular, they can be inverted using a generalized inverse, but taking care that P(1) includes the nullspace contribution (e.g., from Eq. (2.272)). Notes 1 Noble and Daniel (1977), Strang (1988). 2 Lawson and Hanson (1995). 3 “Positive definite” will be defined below. Here it suffices to mean that cT Wc should never be negative, for any c. 4 Golub and Van Loan (1996). 5 Haykin (2002). 6 Lawson and Hanson (1995), Golub and van Loan (1996), Press et al. (1996), etc. 7 Determinants are used only rarely in this book. Their definition and properties are left to the references, as they are usually encountered in high school mathematics. 8 Rogers (1980) is an entire volume of matrix derivative identities, and many other useful properties are discussed by Magnus and Neudecker (1988). 9 Magnus and Neudecker (1988, p. 183). 10 Liebelt (1967, Sections 1–19). 11 The history of this not-very-obvious identity is discussed by Haykin (2002). 12 A good statistics text such as Cram´er (1946), or one on regression such as Seber and Lee (2003), should be consulted. 13 Feller (1957) and Jeffreys (1961) represent differing philosophies. Jaynes (2003) forcefully and colorfully argues the case for so-called Bayesian inference (following Jeffreys), and it seems likely that this approach to statistical inference will ultimately become the default method; Gauch (2003) has a particularly clear account of Bayesian methods. For most of the methods in this book, however, we use little more than the first moments of probability distributions, and hence can ignore the underlying philosophical debate. 14 It follows from the Cauchy–Schwarz inequality: Consider (ax + y )2 = a x 2 + y 2 + 2a x y ≥ 0 for any constant a. Choose a = − x y / x 2 , and one has − x y 2 / x 2 + y 2 ≥ 0, or 1 ≥ x y 2 /( x 2 y 2 ). Taking the square root of both sides, the required result follows. 15 Draper and Smith (1998), Seber and Lee (2003).
Basic machinery
150 1/2
16 Numerical schemes for finding Cξ ξ are described by Lawson and Hanson (1995) and Golub and Van Loan (1996) 17 Cram´er (1946) discusses what happens when the determinant of Cξ ξ vanishes, that is, if Cξ ξ is singular. 18 Bracewell (2000). 19 Cram´er (1946). 20 In problems involving time, one needs to be clear that “stationary” is not the same idea as “steady.” 21 If the means and variances are independent of i, j and the first cross-moment is dependent only upon |i − j|, the process x is said to be stationary in the “wide-sense.” If all higher moments also depend only on |i − j|, the process is said to be stationary in the “strict-sense,” or, more simply, just stationary. A Gaussian process has the unusual property that wide-sense stationarity implies strict-sense stationarity. 22 The terminology “least-squares” is reserved in this book, conventionally, for the minimization of discrete sums such as Eq. (2.89). This usage contrasts with that of Bennett (2002) who applies it .b to continuous integrals, such as a (u (q) − r (q))2 dq, leading to the calculus of variations and Euler–Lagrange equations. 23 Box et al. (1994), Draper and Smith (1998), or Seber and Lee (2003), are all good starting points. 24 Draper and Smith (1998, Chapter 3) and the references given there. 25 Gill et al. (1986). 26 Wunsch and Minster (1982). 27 Morse and Feshbach (1953, p. 238), Strang (1988). 28 See Sewell (1987) for an interesting discussion. 29 But the matrix transpose is not what the older literature calls the “adjoint matrix,” which is quite different. In the more recent literature the latter has been termed the “adjugate” matrix to avoid confusion. 30 In the meteorological terminology of Sasaki (1970) and others, exact relationships are called “strong” constraints, and those imposed in the mean-square are “weak” ones. 31 Claerbout (2001) displays more examples, and Lanczos (1961) gives a very general discussion of operators and their adjoints, Green functions, and their adjoints. See also the appendix to this chapter. 32 Wiggins (1972). 33 Brogan (1991) has a succinct discussion. 34 Lanczos (1961, pp. 117–18), sorts out the sign dependencies. 35 Lawson and Hanson (1995). 36 The singular value decomposition for arbitrary non-square matrices is apparently due to the physicist-turned-oceanographer Carl Eckart (Eckart and Young, 1939; see the discussion in Klema and Laub, 1980; Stewart, 1993; or Haykin, 2002). A particularly lucid account is given by Lanczos (1961) who, however, fails to give the decomposition a name. Other references are Noble and Daniel (1977), Strang (1988), and many recent books on applied linear algebra. The crucial role it plays in inverse methods appears to have been first noticed by Wiggins (1972). 37 Munk et al. (1995). 38 In physical oceanography, the distance would be that traveled by a ship between stops for measurement, and the water depth is clearly determined by the local topography. 39 Menke (1989). 40 Hansen (1992), or Lawson and Hanson (1995). Hansen’s (1992) discussion is particularly interesting because he exploits the “generalized SVD,” which is used to simultaneously diagonalize two matrices. 41 Munk and Wunsch (1982). 42 Seber and Lee (2003). 43 Luenberger (2003). 44 In oceanographic terms, the exact constraints describe the Stommel Gulf Stream solution. The eastward intensification of the adjoint solution corresponds to the change in sign of β in the adjoint model. See Schr¨oter and Wunsch (1986) for details and an elaboration to a non-linear situation.
Notes 45 46 47 48 49 50 51 52 53
151
Lanczos (1960) has a good discussion. See Lanczos (1961, Section 3.19). The derivation follows Liebelt (1967). Cf. Bretherton et al. (1976). The time series was generated as yt = 0.999yt−1 + θ t , θ t = 0, θ 2t = 1, a so-called autoregressive process of order 1 (AR(1)). The covariance yi y j can be determined analytically; see Priestley (1982, p. 119). Many geophysical processes obey similar rules. Stengel (1986); Brogan (1991). Paige and Saunders (1982). See especially, van Trees (2001). Liebelt (1967, p. 164).
3 Extensions of methods
In this chapter we extend and apply some of the methods developed in Chapter 2. The problems discussed there raise a number of issues concerning models and data that are difficult to address with the mathematical machinery already available. Among them are the introduction of left and right eigenvectors, model nonlinearity, the potential use of inequality constraints, and sampling adequacy.
3.1 The general eigenvector/eigenvalue problem To understand some recent work on so-called pseudospectra and some surprising recent results on fluid instability, it helps to review the more general eigenvector/ eigenvalue problem for arbitrary, square, matrices. Consider Egi = λi gi ,
i = 1, 2, . . . , N .
(3.1)
If there are no repeated eigenvalues, λi , then it is possible to show that there are always N independent gi , which are a basis, but which are not usually orthogonal. Because in most problems dealt with in this book, small perturbations can be made to the elements of E without creating any physical damage, it suffices here to assume that such perturbations can always ensure that there are N distinct λi . (Failure of the hypothesis leads to the Jordan form, which requires a somewhat tedious discussion.) A matrix E is “normal” if its eigenvectors form an orthonormal basis. Otherwise it is “non-normal.” (Any matrix of forms AAT , AT A, is necessarily normal.) Denote G ={gi }, Λ = diag(λi ). It follows immediately that E can be diagonalized: G−1 EG = Λ, but for a non-normal matrix, G−1 = GT . The general decomposition is E = GΛG−1 . 152
(3.2)
3.1 The general eigenvector/eigenvalue problem
153
Matrix ET has a different set of spanning eigenvectors, but the same eigenvalues: ET f j = λ j f j ,
j = 1, 2, . . . , N ,
(3.3)
which can be written fTj E =λ j fTj .
(3.4)
The gi are hence known as the “right eigenvectors,” and the fi as the “left eigenvectors.” Multiplying (3.1) on the left by fTj and (3.4) on the right by gi and subtracting shows 0 = (λi − λ j )fTj gi ,
(3.5)
fTj gi = 0,
(3.6)
or i = j.
That is to say, the left and right eigenvectors are orthogonal for different eigenvalues, but fTj g j = 0. (In general the eigenvectors and eigenvalues are complex even for purely real E. Note that some software automatically conjugates a transposed complex vector or matrix, and the derivation of (3.6) shows that it applies to the non-conjugated variables.) Consider now a “model,” Ax = b.
(3.7)
The norm of b is supposed bounded, b ≤ b, and the norm of x will be x =A−1 b.
(3.8)
What is the relationship of x to b? Let gi be the right eigenvectors of A. Write b=
N
β i gi ,
(3.9)
α i gi .
(3.10)
i=1
x=
N i=1
If the gi were orthogonal, |β i | ≤ b. But as they are not orthonormal, the β i will need to be found through a system of simultaneous equations (2.3) (recall the discussion in Chapter 2 of the expansion of an arbitrary vector in non-orthogonal vectors) and no simple bound on the β i is then possible; some may be very large. Substituting into (3.7), N i=1
α i λi gi =
N i=1
β i gi .
(3.11)
154
Extensions of methods
A term-by-term solution is evidently no longer possible because of the lack of orthogonality. But multiplying on the left by fTj , and invoking (3.6), produces α j λ j fTj g j = β j fTj g j ,
(3.12)
or α j = β j /λ j ,
λ j = 0.
(3.13)
Even if the λ j are all of the same order, the possibility that some of the β j are very large implies that eigenstructures in the solution may be much larger than b. This possibility becomes very interesting when we turn to time-dependent systems. At the moment, note that partial differential equations that are self-adjoint produce discretizations that have coefficient matrices A, such that AT = A. Thus self-adjoint systems have normal matrices, and the eigencomponents of the solution are all immediately bounded by b /λi . Non-self-adjoint systems produce nonnormal coefficient matrices and so can therefore unexpectedly generate very large eigenvector contributions. Example The equations ⎧ −0.32685 0.34133 ⎪ ⎪ ⎨ −4.0590 0.80114 −3.3601 0.36789 ⎪ ⎪ ⎩ −5.8710 1.0981
⎫ ⎡ ⎤ 0.69969 0.56619⎪ 0.29441 ⎪ ⎬ ⎢ −1.3362 ⎥ 3.0219 1.3683 ⎥ x =⎢ ⎣ 0.71432 ⎦, 2.6619 1.2135 ⎪ ⎪ ⎭ 3.9281 1.3676 1.6236
are readily solved by ordinary Gaussian elimination (or matrix inversion). If one attempts to use an eigenvector expansion, it is found from Eq. (3.1) (up to rounding errors) that λi = [2.3171, 2.2171, −0.1888, 0.1583]T , and the right eigenvectors, gi , corresponding to the eigenvalues are ⎧ ⎫ 0.32272 0.33385 0.20263 0.36466 ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ 0.58335 0.58478 0.27357 −0.23032 G= , 0.46086 0.46057 0.53322 0.75938 ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ 0.58581 0.57832 −0.77446 −0.48715 and g1 ,2 are nearly parallel. If one expands the right-hand side, y, of the above equations in the gi , the coefficients, β = [25.2230, −24.7147, −3.3401, 2.9680]T , and the first two are markedly greater than any of the elements in y or in the gi . Note that, in general, such arbitrary matrices will have complex eigenvalues and eigenvectors (in conjugate pairs).
3.2 Sampling
155
3.2 Sampling In Chapter 2, on p. 129, we discussed the problem of making a uniformly gridded map from irregularly spaced observations. But not just any set of observations proves adequate to the purpose. The most fundamental problem generally arises under the topic of “sampling” and “sampling error.” This subject is a large and interesting one in its own right,1 and we can only outline the basic ideas. The simplest and most fundamental idea derives from consideration of a onedimensional continuous function, f (q), where q is an arbitrary independent variable, usually either time or space, and f (q) is supposed to be sampled uniformly at intervals, q, an infinite number of times to produce the infinite set of sample values { f (nq)}, −∞ ≤ n ≤ ∞, n integer. The sampling theorem, or sometimes the “Shannon–Whittaker sampling theorem”2 is a statement of the conditions under which f (q) should be reconstructible from the sample values. Let the Fourier transform of f (q) be defined as ∞ fˆ(r ) = f (q)e2iπrq dq, (3.14) −∞
and assumed to exist. The sampling theorem asserts that a necessary and sufficient condition to perfectly reconstruct f (q) from its samples is that | fˆ(r )| = 0,
|r | ≥ 1/(2q).
(3.15)
It produces the Shannon–Whittaker formula for the reconstruction f (q) =
∞ n=−∞
f (nq)
sin[(2π/2q)(q − nq)] . (2π/2q)(q − nq)
(3.16)
Mathematically, the Shannon–Whittaker result is surprising – because it provides a condition under which a function at an uncountable infinity of points – the continuous line – can be perfectly reconstructed from information known only at a countable infinity, nq, of them. For present purposes, an intuitive interpretation is all we seek and this is perhaps best done by considering a special case in which the conditions of the theorem are violated. Figure 3.1 displays an ordinary sinusoid whose Fourier transform can be represented as fˆ(r ) = 12 (δ(r − r0 ) − δ(r + r0 )),
(3.17)
which is sampled as depicted, and in violation of the sampling theorem (δ is the Dirac delta-function). It is quite clear that there is at least one more perfect sinusoid, the one depicted with the dashed line, which is completely consistent with all the sample points and which cannot be distinguished from it using the measurements
Extensions of methods
156
sin(2pt/8)
1
0
−1
0
50
100
150
t
Figure 3.1 Effects of undersampling a periodic function: solid curve is y (t) = sin (2π t/8) sampled at time intervals of t = 0.1. The dashed curve is the same function, but sampled at intervals t = 7. With this undersampling, the curve of frequency s = 1/8 time units is aliased into one that appears to have a frequency sa = 1/8 − 1/7 = 1/56 < 1/14. That is, the aliased curve appears to have a period of 56 time units.
alone. A little thought shows that the apparent frequency of this new sinusoid is ra = r0 ±
n , q
(3.18)
such that |ra | ≤
1 . 2q
(3.19)
The samples cannot distinguish the true high frequency sinusoid from this low frequency one, and the high frequency can be said to masquerade or “alias” as the lower frequency one.3 The Fourier transform of a sampled function is easily seen to be periodic with period 1/q in the transform domain, that is, in the r space.4 Because of this periodicity, there is no point in computing its values for frequencies outside |r | ≤ 1/2q (we make the convention that this “baseband,” i.e., the fundamental interval for computation, is symmetric about r = 0, over a distance 1/2q; see Fig. 3.2). Frequencies of absolute value larger than 1/2q, the so-called Nyquist frequency, cannot be distinguished from those in the baseband, and alias into it. Figure 3.2 shows a densely sampled, non-periodic function and its Fourier transform compared to that obtained from the undersampled version overlain. Undersampling is a very unforgiving practice. The consequences of aliasing range from the negligible to the disastrous. A simple example is that of the principal lunar tide, usually labeled M2 , with a period of 12.42 hours, r = 1.932 cycles/day. An observer measures the height of sea level at a fixed time, say 10 a.m. each day so that q = 1 day. Applying the formula (3.18), the apparent frequency of the tide will be 0.0676 cycles/day for a period of about 14.8 days (n = 2). To the extent that the observer understands what is going on, he or she will not conclude that the principal lunar tide has a period of
3.2 Sampling (a)
157 (b)
4
0.4 ))
subs
2
Rl(F(y)), Rl(F(y
y (nΔt ), y (7m Δt )
3
1 0 −1 −2
0
50
100
150
t
0.2 0 −0.2 −0.4
0.2
0.1 0 0.1 Cycles/unit time
0.2
Figure 3.2 (a) A non-periodic function sampled at intervals t = 0.1, and the same function sampled at intervals t = 7 time units. (b) The real part of the Fourier components of the two functions shown in (a). The subsampled function has a Fourier transform confined to |s| ≤ 1/(2.7) (dashed) while that of the original, more densely sampled, function (solid) extends to |s| ≤ 1/0.1 = 10, most of which is not displayed. The subsampled function has a very different Fourier transform from that of the original densely sampled one. Both transforms are periodic in frequency, s, with period equal to the width of their corresponding basebands. (This periodicity is suppressed in the plot.) Note in particular how erroneous an estimate of the temporal derivative of the undersampled function would be in comparison to that of the highly sampled one.
14.8 days, but will realize that the true period can be computed through (3.18) from the apparent one. But without that understanding, some bizarre theory might be produced.5 The reader should object that the Shannon–Whittaker theorem applies only to an infinite number of perfect samples and that one never has either perfect samples or an infinite number of them. In particular, it is true that if the duration of the data in the q domain is finite, then it is impossible for the Fourier transform to vanish over any finite interval, much less the infinite interval above the Nyquist frequency.6 Nonetheless, the rule of thumb that results from (3.16) has been found to be quite a good one. The deviations from the assumptions of the theorem are usually dealt with by asserting that sampling should be done so that q 1/2r0 .
(3.20)
Many extensions and variations of the sampling theorem exist – taking account of the finite time duration , the use of “burst-sampling” and known function derivatives, etc.7 Most of these variations are sensitive to noise. There are also extensions to multiple dimensions,8 which are required for mapmaking purposes. Because failure to acknowledge the possibility that a signal is undersampled is so dire, one concludes that consideration of sampling is critical to any discussion of field data.
Extensions of methods
158
3.2.1 One-dimensional interpolation Let there be two observations [y1 , y2 ]T = [x1 + n 1 , x2 + n 2 ]T located at positions [r1 , r2 ]T , where n i are the observation noise. We require an estimate of x(˜r ), where r1 < r˜ < r2 . The formula (3.16) is unusable – there are only two noisy observations, not an infinite number of perfect ones. We could try using linear interpolation: x˜ (˜r ) =
|r1 − r˜ | |r2 − r˜ | y(r1 ) + y(r2 ). |r2 − r1 | |r2 − r1 |
(3.21)
If there are N data points, ri , 1 = 1, 2, . . . , N , then another possibility is Aitken– Lagrange interpolation:9 x˜ (˜r ) =
N
l j (˜r )y j ,
(3.22)
j=1
l j (˜r ) =
(˜r − r1 ) · · · (˜r − r M ) . (r j − r1 ) · · · (r j − r j−1 )(r j − r j+1 ) · · · (r j − r M )
(3.23)
Equations (3.21)– (3.23) are only two of many possible interpolation formulas. When would one be better than the other? How good are the estimates? To answer these questions, let us take a different tack, and employ the Gauss–Markov theorem, assuming we know something about the necessary covariances. Suppose either x = n = 0 or that a known value has been removed from both (this just keeps our notation a bit simpler). Then Rx y (˜r , r j ) ≡ x(˜r )y(r j ) = x(˜r )(x(r j ) + n(r j )) = Rx x (˜r , r j ),
(3.24)
R yy (ri , r j ) ≡ (x(ri ) + n(ri ))(x(r j ) + n(r j ))
(3.25)
= Rx x (ri , r j ) + Rnn (ri , r j ),
(3.26)
where it has been assumed that x(r )n(q) = 0. From (2.396), the best linear interpolator is x˜ = By,
B(˜r , r) =
M
Rx x (˜r , r j ) {Rx x + Rnn }−1 ji ,
(3.27)
j=1
({Rx x + Rnn }−1 ji means the ji element of the inverse matrix) and the minimum possible error that results is P(˜rα , r˜β ) = Rx x (˜rα , r˜β ) −
M M j
Rx x (˜rα , r j ){Rx x + Rnn }−1 ji Rx x (ri , r˜β ),
i
(3.28) and n˜ = y − x˜ .
3.2 Sampling
159
3
(a)
x, y, xˆ
2 1 0 −1 −2
0
10
20
30
40
50
60
70
80
90
100
3
(b)
x, y, xˆ
2 1 0 −1 −2
0
10
20
30
40
50
60
70
80
90
100
r, rˆ
Figure 3.3 In both panels, the solid curve is the “true” curve, x(r ), from which noisy samples (denoted “x”) have been obtained. x (r ) was generated to have a true covariance S = exp(−r 2 /100), and the “data” values, y (ri ) = x (ri ) + n i , where n i = 0, n i n j = (1/4) δ i j , were generated from a Gaussian probability density. In (b), linear interpolation is used to generate the estimated values of x (r ) (dashed line). The estimates are identical to the observations at r = ri . In (a), objective mapping was used to make the estimates (dashed line). Note that x˜ (ri ) = y(ri ), and that an error bar is available – as plotted. The true values are generally within one standard deviation of the estimated value (but about 35% of the estimated values would be expected to lie outside the error bars), and the estimated value is within two standard deviations of the correct one everywhere. The errors in the estimates, x˜ (ri ) − x(ri ), are clearly spatially correlated, and can be inferred from Eq. (3.28) (not shown). The values of x(r ) were generated to have the inferred covariance S, by forming the matrix, S = toeplitz (S(ri , r j )), and obtaining its symmetric factorization, S = UUT , x (r ) = Uα, where the elements of α are pseudo-random numbers.
Results for both linear interpolation and objective mapping are shown in Fig. 3.3. Notice that, like other interpolation methods, the optimal one is a linear combination of the data. If any other set of weights B is chosen, then the interpolation is not as good as it could be in the mean-square error sense; the error of any such scheme
160
Extensions of methods
can be obtained by substituting it into (2.394) and evaluating the result (the true covariances still need to be known). Looking back now at the two familiar formulas in (3.21) and (3.22), it is clear what is happening: they represent a choice of B. Unless the covariance is such as to produce one of the two sets of weights as the optimum choice, neither Aitken– Lagrange nor linear (nor any other common choice, like a spline) is the best one could do. Alternatively, if any of (3.21)–(3.23) were thought to be the best one, they are equivalent to specifying the solution and noise covariances. If interpolation is done for two points, r˜α , r˜β , the error of the two estimates will usually be correlated, and represented by P(˜rα , r˜β ). Knowledge of the correlations between the errors in different interpolated points is often essential – for example, if one wishes to interpolate to uniformly spaced grid points so as to make estimates of derivatives of x. Such derivatives might be numerically meaningless if the mapping errors are small scale (relative to the grid spacing) and of large amplitude. But if the mapping errors are large scale compared to the grid, the derivatives may tend to remove the error and produce better estimates than for x itself. Both linear and Aitken–Lagrange weights will produce estimates that are exactly equal to the observed values if r˜α = r p , that is, on the data points. Such a result is characteristic of “true interpolation.” If no noise is present, then the observed value is the correct one to use at a data point. In contrast, the Gauss–Markov estimate will differ from the data values at the data points, because the estimator attempts to reduce the noise in the data by averaging over all observations, not just the one. The Gauss–Markov estimate is thus not a true interpolator; it is instead a “smoother.” One can recover true interpolation if Rnn → 0, although the matrix being inverted in (3.27) and (3.28) can become singular. The weights B can be fairly complicated if there is any structure in either of Rx x , Rnn . The estimator takes explicit account of the expected spatial structure of both x, n to weight the data in such a way as to most effectively “kill” the noise relative to the signal. One is guaranteed that no other linear filter can do better. If Rnn Rx x , x˜ → 0, manifesting the bias in the estimator; this bias was deliberately introduced so as to minimize the uncertainty (minimum variance about the true value). Thus, estimated values of zero-mean processes tend toward zero, particularly far from the data points. For this reason, it is common to use expressions such as (2.413) to first remove the mean, prior to mapping the residual, and re-adding the estimated mean at the end. The interpolated values of the residuals are nearly unbiassed, because their true mean is nearly zero. Rigorous estimates of P for this approach require some care, as the mapped residuals contain variances owing to the uncertainty of the estimated mean,10 but the corrections are commonly ignored. As we have seen, the addition of small positive numbers to the diagonal of a matrix usually renders it non-singular. In the formally noise-free case, Rnn → 0,
3.2 Sampling
161
and one has the prospect that Rx x by itself may be singular. To understand the meaning of this situation, consider the general case, involving both matrices. Then the symmetric form of the SVD of the sum of the two matrices is Rx x + Rnn = UUT .
(3.29)
If the sum covariance is positive definite, will be square with K = M and the inverse will exist. If the sum is not positive definite, but is only semi-definite, one or more of the singular values will vanish. The meaning is that there are possible structures in the data that have been assigned to neither the noise field nor the solution field. This situation is realistic only if one is truly confident that y does not contain such structures. In that case, the solution x˜ = Rx x (Rx x + Rnn )−1 y = Rx x (U−1 UT )y
(3.30)
will have components of the form 0/0, the denominator corresponding to the zero singular values and the numerator to the absent, impossible, structures of y. One can arrange that the ratio of these terms should be set to zero (e.g., by using the SVD). But such a delicate balance is not necessary. If one simply adds a small white noise covariance to Rx x + Rnn → Rx x + Rnn + 2 I, or Rx x → Rx x + 2 I, one is assured, by the discussion of tapering, that the result is no longer singular – all structures in the field are being assigned either to the noise or the solution (or in part to both). Anyone using a Gauss–Markov estimator to make maps must check that the result is consistent with the prior estimates of Rx x , Rnn . Such checks include determining whether the differences between the mapped values at the data points and the observed values have numerical values consistent with the assumed noise variance; a further check involves the sample autocovariance of n˜ and its test against Rnn (see books on regression for such tests). The mapped field should also have a variance and covariance consistent with the prior estimate Rx x . If these tests are not passed, the entire result should be rejected. 3.2.2 Higher-dimensional mapping We can now immediately write down the optimal interpolation formulas for an arbitrary distribution of data in two or more dimensions. Let the positions where data are measured be the set r j with measured value y(r j ), containing noise n. It is assumed that aliasing errors are unimportant. The mean value of the field is first estimated and subtracted from the measurements and we proceed as though the true mean were zero.11 As in the case where the positions are scalars, one minimizes the expected meansquare difference between the estimated and the true field x(˜rα ). The result is (3.27)
Extensions of methods
162 (a)
46 44
0 –0.87
42 Latitude
–0.01
0.86
40
4.3 7.7
1
2.7
38 8.2
36
6
7
6 6.5
2
5
34 32 (b)
3
4
278
280
282
284
286
288
290
292
288
290
292
46 44
4.5
Latitude
42
4
3 3
40
3
3.5 2.5
3.5
38
3 2.5
2.5
36 3.5
34 32 278
4
4.5
280
282
284 286 Longitude
Figure 3.4 (a) Observations, shown as solid dots, from which a uniformly gridded map is desired. Contours were constructed using a fixed covariance and the Gauss– Markov estimate Eq. (3.27). Noise was assumed white with a variance of 1. (b) Expected standard error of the mapped field in (a). Values tend, far from the observations points, to a variance of 25, √ which was the specified field variance, and hence the largest expected error is 25. Note the minima centered on the data points.
and (3.28), except that now everything is a function of the vector positions. If the field being mapped is also a vector (e.g., two components of velocity) with known covariances between the two components, then the elements of B become matrices. The observations could also be vectors at each point. An example of a two-dimensional map is shown in Fig. 3.4. The “data points,” y(ri ), are the dots, while estimates of x˜ (ri ) on the uniform grid were wanted.
3.2 Sampling
163
46 44 42 Latitude
0.0
40 0.2
38
0.8 0.4 0.2
36 34 0.0
32
278
280
282
284 286 Longitude
288
290
292
Figure 3.5 One of the rows of P corresponding to the grid point in Fig. 3.4 at 39◦ N, 282◦ E, displaying the expected correlations that occur in the errors of the mapped field. These errors would be important, e.g., in any use that differentiated the mapped field. For plotting purposes, the variance was normalized to 1.
The a priori noise was set to n = 0, Rnn = n i n j = σ 2n δ i j , σ 2n = 1, and the true field covariance was x = 0, Rx x = x(ri )x(r j ) = P0 exp −|ri − r j |2 /L 2 , P0 = 25, L 2 = 100. Figure 3.4 also shows the estimated values and Figs. 3.4 and 3.5 show the error variance estimate of the mapped values. Notice that, far from the data points, the estimated values are 0: the mapped field goes asymptotically to the estimated true mean, with the error variance rising to the full value of 25, which cannot be exceeded. That is to say, when we are mapping far from any data point, the only real information available is provided by the prior statistics – that the mean is 0, and the variance about that mean is 25. So the expected uncertainty of the mapped field, in the absence of data, cannot exceed the prior estimate of how far from the mean the true value is likely to be. The best estimate is then the mean itself. A complex error structure of the mapped field exists – even in the vicinity of the data points. Should a model be “driven” by this mapped field, one would need to make some provision in the model accounting for the spatial dependence in the expected errors of this forcing. In practice, most published objective mapping (often called “OI” for “objective interpolation,” although as we have seen, it is not true interpolation) has been based upon simple analytical statements of the covariances Rx x , Rnn as used in the example: that is, they are commonly assumed to be spatially stationary and isotropic
164
Extensions of methods
(depending upon |ri − r j | and not upon the two positions separately nor upon their orientation). The use of analytic forms removes the necessity for finding, storing, and computing with the potentially very large M × M data covariance matrices in which hypothetically every data or grid point has a different covariance with every other data or grid point. But the analytical convenience often distorts the solutions, as many fluid flows and other fields are neither spatially stationary nor isotropic.12
3.2.3 Mapping derivatives A common problem in setting up fluid models is the need to specify the fields of quantities such as temperature, density, etc., on a regular model grid. The derivatives of these fields must be specified for use in advection–diffusion equations, ∂C + v · ∇C = K ∇ 2 C, ∂t
(3.31)
where C is any scalar field of interest. Suppose the spatial derivative is calculated as a one-sided difference, C(˜r1 ) − C(˜r2 ) ∂C(˜r1 ) ∼ . ∂r r˜1 − r˜2
(3.32)
Then it is attractive to subtract the two estimates made from Eq. (3.27) , producing 1 ∂C(˜r1 ) ∼ (Rx x (˜r1 , r j ) − Rx x (˜r2 , r j ))(Rx x + Rnn )−1 y. ∂r r
(3.33)
Alternatively, an estimate of ∂C/∂r could be made directly from (2.392), using x = C(r1 ) − C(r2 ). R yy = Rx x + Rnn , which describes the data, does not change. Rx y does change: Rx y = (C(˜r1 ) − C(˜r2 ))(C(r j ) + n(r j )) = Rx x (˜r1 , r j ) − Rx x (˜r2 , r j ),
(3.34)
which when substituted into (2.396) produces (3.33). Thus, the optimal map of the finite difference field is simply the difference of the mapped values. More generally, the optimally mapped value of any linear combination of the values is that linear combination of the maps.13
3.3 Inequality constraints: non-negative least-squares In many estimation problems, it is useful to be able to impose inequality constraints upon the solutions. Problems involving tracer concentrations, for example, usually demand that they remain positive; empirical eddy diffusion coefficients are sometimes regarded as acceptable only when non-negative; in some fluid flow
3.3 Inequality constraints: non-negative least-squares
165
problems we may wish to impose directions, but not magnitudes, upon velocity fields. Such needs lead to consideration of the forms Ex + n = y,
(3.35)
Gx ≥ h,
(3.36)
where the use of a greater-than inequality to represent the general case is purely arbitrary; multiplication by minus 1 readily reverses it. G is of dimension M2 × N . Several cases need to be distinguished. (A) Suppose E is full rank and fully ˜ and there is no solution determined; then the SVD solution to (3.35) by itself is x˜ , n, nullspace. Substitution of the solution into (3.36) shows that the inequalities are either satisfied or that some are violated. In the first instance, the problem is solved, and the inequalities bring no new information. In the second case, the solution must ˜ will increase, given the noise-minimizing nature be modified and, necessarily, n of the SVD solution. It is also possible that the inequalities are contradictory, in which case there is no solution. (B) Suppose that E is formally underdetermined – so that a solution nullspace exists. If the particular SVD solution violates one or more of the inequalities and requires modification, two subcases can be distinguished: (1) Addition of one or more nullspace vectors permits the inequalities to be satisfied. Then the solution residual norm will be unaffected, but ˜x will increase. (2) The nullspace vectors by themselves are unable to satisfy the inequality constraints, and one or more range ˜ will increase. vectors are required to do so. Then both ˜x, n Case (A) is the conventional one.14 The so-called Kuhn–Tucker–Karush theorem is a requirement for a solution x˜ to exist. Its gist is as follows: Let M ≥ N and E be full rank; there are no vi in the solution nullspace. If there is a solution, there must exist a vector, q, of dimension M2 such that ET (E˜x − y) = GT q,
(3.37)
Gx − h = r,
(3.38)
where the M2 elements of q are divided into two groups. For group 1, of dimension m1, ri = 0,
qi ≥ 0,
(3.39)
and for group 2, of dimension m 2 = M2 − m 1 , ri > 0,
qi = 0.
(3.40)
To understand this theorem, recall that in the solution to the ordinary overdetermined least-squares problem, the left-hand side of (3.37) vanishes identically (2.91
166
Extensions of methods
and 2.265), being the projection of the residuals onto the range vectors, ui , of ET . If this solution violates one or more of the inequality constraints, structures that produce increased residuals must be introduced into the solution. Because there are no nullspace vi , the rows of G may each be expressed exactly by an expansion in the range vectors. In the second group of indices, the corresponding inequality constraints are already satisfied by the ordinary least-squares solution, and no modification of the structure proportional to vi is required. In the first group of indices, the inequality constraints are marginally satisfied, at equality, only by permitting violation of the demand (2.91) that the residuals should be orthogonal to the range vectors of E. If the ordinary least-squares solution violates the inequality, the minimum modification required to it pushes the solution to the edge of the acceptable bound, but at the price of increasing the residuals proportional to the corresponding ui . The algorithm consists of finding the two sets of indices and then the smallest coefficients of the vi corresponding to the group 1 indices required to just satisfy any initially violated inequality constraints. A canonical special case, to which more general problems can be reduced, is based upon the solution to G = I, h = 0 – called “non-negative least-squares.”15 The requirement, x ≥ 0, is essential in many problems involving tracer concentrations, which are necessarily positive. The algorithm can be extended to the underdetermined/rank-deficient case in which the addition, to the original basic SVD solution, of appropriate amounts of the nullspace of vi is capable of satisfying any violated inequality constraints.16 One simply chooses the smallest mean-square solution coefficients necessary to push the solution to the edge of the acceptable inequalities, producing the smallest norm. The residuals of the original problem do not increase – because only nullspace vectors are being used. G must have a special structure for this to be possible. The algorithm can be further generalized17 by considering the general case of rank-deficiency/underdeterminism where the nullspace vectors by themselves are inadequate to produce a solution satisfying the inequalities. In effect, any inequalities “left over” are satisfied by invoking the smallest perturbations necessary to the coefficients of the range vectors vi .
3.4 Linear programming In a number of important geophysical fluid problems, the objective functions are linear rather than quadratic functions. Scalar property fluxes such as heat, Ci , are carried by a fluid flow at rates Ci xi , which are linear functions of x. If one sought the extreme fluxes of C, it would require finding the extremal values of the corresponding linear function. Least-squares does not produce useful answers in such problems because linear objective functions achieve their minima or maxima
3.4 Linear programming
167
only at plus or minus infinity – unless the elements of x are bounded. The methods of linear programming are directed at finding extremal properties of linear objective functions subject to bounding constraints. Such problems can be written as minimize : J = cT x, E1 x = y 1 ,
(3.41)
E2 x ≥ y2 ,
(3.42)
E3 x ≤ y3 ,
(3.43)
a ≤ x ≤ b,
(3.44)
that is, as a collection of equality and inequality constraints of both greater than or less than form, plus bounds on the individual elements of x. In distinction to the least-squares and minimum variance equations that have been discussed so far, these constraints are hard ones; they cannot be violated even slightly in an acceptable solution. Linear programming problems are normally reduced to what is referred to as a canonical form, although different authors use different definitions of what it is. But all such problems are reducible to minimize : J = cT x,
(3.45)
Ex ≤ y
(3.46)
x ≥ 0.
(3.47)
The use of a minimum rather than a maximum is readily reversed by introducing a minus sign, and the inequality is similarly readily reversed. The last relationship, requiring purely positive elements in x, is obtained without difficulty by translation. Linear programming problems are widespread in many fields including, especially, financial and industrial management where they are used to maximize profits, or minimize costs, in, say, a manufacturing process. Necessarily then, the amount of a product of each type is positive, and the inequalities reflect such things as the need to consume no more than the available amounts of raw materials. In some cases, J is then literally a “cost” function. General methodologies were first developed during World War II in what became known as “operations research” (“operational research” in the UK),18 although special cases were known much earlier. Since then, because of the economic stake in practical use of linear programming, immense effort has been devoted both to textbook discussion and efficient, easyto-use software.19 Given this accessible literature and software, the methodologies of solution are not described here, and only a few general points are made. The original solution algorithm invented by G. Dantzig is usually known as the “simplex method” (a simplex is a convex geometric shape). It is a highly efficient
168
Extensions of methods
search method conducted along the bounding constraints of the problem. In general, it is possible to show that the outcome of a linear programming problem falls into several distinct categories: (1) The system is “infeasible,” meaning that it is contradictory and there is no solution; (2) the system is unbounded, meaning that the minimum lies at negative infinity; (3) there is a unique minimizing solution; and (4) there is a unique finite minimum, but it is achieved by an infinite number of solutions x. The last situation is equivalent to observing that if there are two minimizing solutions, there must be an infinite number of them because then any linear combination of the two solutions is also a solution. Alternatively, if one makes up a matrix from the coefficients of x in Eqs. (3.45)–(3.47), one can determine if it has a nullspace. If one or more such vectors exists, it is also orthogonal to the objective function, and it can be assigned an arbitrary amplitude without changing J . One distinguishes between feasible solutions, meaning those that satisfy the inequality and equality constraints but which are not minimizing, and optimal solutions, which are both feasible and minimize the objective function. An interesting and useful feature of a linear programming problem is that Eqs. (3.45)–(3.47) have a “dual”: maximize : J2 = yT μ,
(3.48)
E μ ≥ c,
(3.49)
μ ≥ 0.
(3.50)
T
It is possible to show that the minimum of J must equal the maximum of J2 . The reader may want to compare the structure of the original (the “primal”) and dual equations with those relating the Lagrange multipliers to x discussed in Chapter 2. In the present case, the important relationship is ∂J = μi . ∂ yi
(3.51)
That is, in a linear program, the dual solution provides the sensitivity of the objective function to perturbations in the constraint parameters y. Duality theory pervades optimization problems, and the relationship to Lagrange multipliers is no accident.20 Some simplex algorithms, called the “dual simplex,” take advantage of the different dimensions of the primal and dual problems to accelerate solution. In recent years much attention has focussed upon a new, non-simplex method of solution21 known as the “Karmarkar” or “interior point” method. Linear programming is also valuable for solving estimation or approximation problems in which norms other than the 2-norms, which have been the focus of this book, are used. For example, suppose that one sought the solution to the
3.5 Empirical orthogonal functions
169
constraints Ex + n = y, M > N , but subject not to the conventional minimum of J = i n i2 , but that of J = i |n i | (a 1-norm). Such norms are less sensitive to outliers than are the 2-norms and are said to be “robust.” The maximum likelihood idea connects 2-norms to Gaussian statistics, and similarly, 1-norms are related to maximum likelihood with exponential statistics.22 Reduction of such problems to linear programming is carried out by setting n i = n i+ − n i− , n i+ ≥ 0, n i− ≥ 0, and the objective function is min : J = n i+ + n i− . (3.52) i
Other norms, the most important23 of which is the so-called infinity norm, which minimizes the maximum element of an objective function (“mini-max” optimization), are also reducible to linear programming. 3.5 Empirical orthogonal functions Consider an arbitrary M × N matrix M. Suppose the matrix was representable, accurately, as the product of two vectors, M ≈ abT , where a was M × 1, and b was N × 1. Approximation is intended in the sense that M − abT < ε, for some acceptably small ε. Then one could conclude that the M N elements of A contain only M + N pieces of information contained in a, b. Such an inference has many uses, including the ability to recreate the matrix accurately from only M + N numbers, to physical interpretations of the meaning of a, b. More generally, if one pair of vectors is inadequate, some small number might suffice: M ≈ a1 bT1 + a2 bT2 + · · · + a K bTK .
(3.53)
A general mathematical approach to finding such a representation is through the SVD in a form sometimes known as the “Eckart–Young–Mirsky theorem.”24 This theorem states that the most efficient representation of a matrix in the form M≈
K
λi ui viT ,
(3.54)
i
where the ui , vi are orthonormal, is achieved by choosing the vectors to be the singular vectors, with λi providing the amplitude information (recall Eq. (2.227)). The connection to regression analysis is readily made by noticing that the sets of singular vectors are the eigenvectors of the two matrices MMT , MT M
170
Extensions of methods
(Eqs. (2.253) and (2.254)). If each row of M is regarded as a set of observations at a fixed coordinate, then MMT is just proportional to the sample second-moment matrix of all the observations, and its eigenvectors, ui , are the EOFs. Alternatively, if each column is regarded as the observation set for a fixed coordinate, then MT M is the corresponding sample second-moment matrix, and the vi are the EOFs. A large literature provides various statistical rules for use of EOFs. For example, the rank determination in the SVD becomes a test of the statistical significance of the contribution of singular vectors to the structure of M.25 In the wider context, however, one is dealing with the problem of efficient relationships amongst variables known or suspected to carry mutual correlations. Because of its widespread use, this subject is plagued by multiple discovery and thus multiple jargon. In different contexts and details (e.g., how the matrix is weighted), the problem is known as that of “principal components,”26 “empirical orthogonal functions” (EOFs), the “Karhunen–Lo`eve” expansion (in mathematics and electrical engineering),27 “proper orthogonal decomposition,”28 etc. Examples of the use of EOFs will be provided in Chapter 6. 3.6 Kriging and other variants of Gauss–Markov estimation A variant of the Gauss–Markov mapping estimators, often known as “kriging” (named for David Krige, a mining geologist), addresses the problem of a spatially varying mean field, and is a generalization of the ordinary Gauss–Markov estimator.29 Consider the discussion on p. 132 of the fitting of a set of functions f i (r) to an observed field y(r j ). That is, we put y(r j ) = Fα + q(r j ),
(3.55)
where F(r) = { f i (r)} is a set of basis functions, and one seeks the expansion coefficients, α, and q such that the data, y, are interpolated (meaning reproduced exactly) at the observation points, although there is nothing to prevent further breaking up q into signal and noise components. If there is only one basis function – for example a constant – one is doing kriging, which is the determination of the mean prior to objective mapping of q, as discussed above. If several basis functions are being used, one has “universal kriging.” The main issue concerns the production of an adequate statement of the expected error, given that the q are computed from a preliminary regression to determine the α.30 The method is often used in situations where large-scale trends are expected in the data, and where one wishes to estimate and remove them before analyzing and mapping the q. Because the covariances employed in objective mapping are simple to use and interpret only when the field is spatially stationary, much of the discussion
3.7 Non-linear problems
171
of kriging uses instead what is called the “variogram,” defined as V = (y(ri ) − y(r j ))(y(ri ) − y(r j )) , which is related to the covariance, and which is often encountered in turbulence theory as the “structure function.” Kriging is popular in geology and hydrology, and deserves wider use.
3.7 Non-linear problems The least-squares solutions examined thus far treat the coefficient matrix E as given. But in many of the cases encountered in practice, the elements of E are computed from data and are imperfectly specified. It is well known in the regression literature that treating E as known, even if n is increased beyond the errors contained in y, can lead to significant bias errors in the least-squares and related solutions, particularly if E is nearly singular.31 The problem is known as that of “errors in regressors or errors in variables” (EIV); it manifests itself in the classical simple least-squares problem (p. 43), where a straight line is fitted to data of the form yi = a + bti , but where the measurement positions, ti , are partly uncertain rather than perfect. In general terms, when E has errors, the model statement becomes ˜ + E)˜ ˜ x = y˜ + ˜y, (E
(3.56)
˜ ˜y, where the old n is now broken into two parts: where one seeks estimates, x˜ , E, ˜ x and ˜y. If such estimates can be made, the result can be used to rewrite (3.56) E˜ as ˜ x = y˜ , E˜
(3.57)
where the relation is to be exact. That is, one seeks to modify the elements of E such that the observational noise in it is reduced to zero.
3.7.1 Total least-squares For some problems of this form, the method of total least-squares (TLS) is a powerful and interesting method. It is worth examining briefly to understand why it is not always immediately useful, and to motivate a different approach.32 The SVD plays a crucial role in TLS. Consider that in Eq. (2.17 ) the vector y was written as a sum of the column vectors of E; to the extent that the column space does not fully describe y, a residual must be left by the solution x˜ , and ordinary least-squares can be regarded as producing a solution in which a new estimate, y˜ ≡ E˜x, of y is made; y is changed, but the elements of E are untouched. But suppose it were possible to introduce small changes in both the column vectors of E, as well as in y, such that the column vectors of the modified E + E produced
172
Extensions of methods
a spanning vector space for y + y, where both y , E were “small,” then the problem as stated would be solved. The simplest problem to analyze is the full-rank, formally overdetermined one. Let M ≥ N = K . Then, if we form the M × (N + 1) augmented matrix Ea = {E the solution sought is such that
y},
˜ {E
x˜ y˜ } =0 −1
(3.58)
˜ y˜ }. (exactly). If this solution is to exist, [˜x − 1]T must lie in the nullspace of {E A solution is thus ensured by forming the SVD of {E y}, setting λ N +1 = 0, and ˜ y˜ } out of the remaining singular vectors and values. Then [˜x − 1]T is forming {E the nullspace of the modified augmented matrix, and must therefore be proportional to the nullspace vector v N +1 . Also, ˜ ˜y} = −u N +1 λ N +1 vTN +1 . {E
(3.59)
Various complications can be considered, for example, if the last element of v N +1 = 0; this and other special cases are discussed in the reference (see note 24). Cases of non-uniqueness are treated by selecting the solution of minimum norm. A simple generalization applies to the underdetermined case: if the rank of the augmented matrix is p, one reduces the rank by one to p − 1. The TLS solution just summarized applies only to the case in which the errors in the elements of E and y are uncorrelated and of equal variance and in which there are no required structures – for example, where certain elements of E must always vanish. More generally, changes in some elements of E require, for reasons of physics, specific corresponding changes in other elements of E and in y, and vice versa. The fundamental difficulty is that the model (3.56) presents a non-linear estimation problem with correlated variables, and its solution requires modification of the linear procedures developed so far.
3.7.2 Method of total inversion The simplest form of TLS does not readily permit the use of correlations and prior variances in the parameters appearing in the coefficient matrix and does not provide any way of maintaining the zero structure there. Methods exist that permit accounting for prior knowledge of covariances.33 Consider a set of non-linear constraints in a vector of unknowns x, g(x) + u = q.
(3.60)
3.7 Non-linear problems
173
This set of equations is the generalization of the linear models hitherto used; u again represents any expected error in the specification of the model. An example of a scalar non-linear model is 8x12 + x22 + u = 4. In general, there will be some expectations about the behavior of u. Without loss of generality, take its expected value to be zero, and its covariance is Q = uuT . There is nothing to prevent us from combining x, u into one single set of unknowns ξ, and indeed if the model has some unknown parameters, ξ might as well include those as well. So (3.60) can be written as L (ξ) = 0.
(3.61)
˜ is available, with In addition, it is supposed that a reasonable initial estimate ξ(0) T ˜ ˜ uncertainty P(0) ≡ (ξ − ξ(0))(ξ − ξ(0)) (or the covariances of the u, x could be specified separately if their uncertainties are not correlated). An objective function whose minimum is sought is written, T ˜ ˜ J = L(ξ)T Q−1 L(ξ) + (ξ − ξ(0)) P(0)−1 (ξ − ξ(0)).
(3.62)
The presence of the weight matrices Q, P(0) permits control of the elements most likely to change, specification of elements that should not change at all (e.g., by introducing zeros into P(0)), as well as the stipulation of covariances. It can be regarded as a generalization of the process of minimizing objective functions, which led to least-squares in previous chapters and is sometimes known as the “method of total inversion.”34 Consider an example for the two simultaneous equations 2x1 + x2 + n 1 = 1,
(3.63)
0 + 3x2 + n 2 = 2,
(3.64)
where all the numerical values except the zero are now regarded as in error to some degree. One way to proceed is to write the coefficients of E in the specific perturbation form (3.56). For example, write E 11 = 2 + E 11 , and define the unknowns ξ in terms of the E i j . For illustration retain the full non-linear form by setting ξ 1 = E 11 ,
ξ 2 = E 12 ,
ξ 3 = E 21 , u1 = n1,
ξ 4 = E 22 ,
ξ 5 = x1 ,
ξ 6 = x2 ,
u2 = n2.
The equations are then ξ 1 ξ 5 + ξ 2 ξ 6 + u 1 − 1 = 0,
(3.65)
ξ 3 ξ 5 + ξ 4 ξ 6 + u 2 − 2 = 0.
(3.66)
Extensions of methods
174
The yi are being treated as formally fixed, but u 1 , u 2 represent their possible errors (the division into knowns and unknowns is not unique). Let there be an initial estimate, ξ 1 = 2 ± 1, ξ 4 = 3.5 ± 1,
ξ 2 = 2 ± 2,
ξ 3 = 0 ± 0,
ξ 5 = x1 = 0 ± 2,
ξ 6 = 0 ± 2,
with no imposed correlations so that P(0) = diag ([1, 4, 0, 1, 4, 4]); the zero represents the requirement that E 21 remain unchanged. Let Q = diag ([2, 2]). Then a useful objective function is J = (ξ 1 ξ 5 + ξ 2 ξ 6 − 1)2 /2 + (ξ 3 ξ 5 + ξ 4 ξ 6 − 2)2 /2 + (ξ 1 − 2)2 + (ξ 2 − 2)2 /4 +
106 ξ 23
+ (ξ 4 − 3.5) + 2
ξ 25 /4
+
(3.67)
ξ 26 /4.
The 106 in front of the term in ξ 23 is a numerical approximation to the infinite value implied by a zero uncertainty in this term (an arbitrarily large value can cause numerical instability, characteristic of penalty and barrier methods).35 Such objective functions define surfaces in spaces of the dimension of ξ. Most ˜ procedures require the investigator to make a first guess at the solution, ξ(0), and attempt to minimize J by going downhill from the guess. Various deterministic search algorithms have been developed and are variants of steepest descent, conjugate gradient, Newton and quasi-Newton methods. The difficulties are numerous. Some methods require computation or provision of the gradients of J with respect to ξ, and the computational cost may become very great. The surfaces on which one is seeking to go downhill may become extremely tortuous, or very slowly changing. The search path can fall into local holes that are not the true minima. Non-linear optimization is something of an art. Nonetheless, existing techniques are very useful. The minimum of J corresponds to finding the solution of the non-linear normal equations that would result from setting the partial derivatives to zero. Let the true minimum be at ξ ∗ . Assuming that the search procedure has succeeded, the objective function is locally J = constant + (ξ − ξ ∗ )T H(ξ − ξ ∗ ) + J,
(3.68)
where H is the Hessian and J is a correction – assumed to be small. In the linear least-squares problem (2.89), the Hessian is evidently ET E, the second derivative of the objective function with respect to x. The supposition is then that near the true optimum, the objective function is locally quadratic with a small correction. To the extent that this supposition is true, the result can be analyzed in terms of the behavior of H as though it represented a locally defined version of ET E. In particular, if H has a nullspace, or small eigenvalues, one can expect to see all the issues arising that
3.7 Non-linear problems
175
we dealt with in Chapter 2, including ill-conditioning and solution variances that may become large in some elements. The machinery used in Chapter 2 (row and column scaling, nullspace suppression, etc.) thus becomes immediately relevant here and can be used to help conduct the search and to understand the solution. Example It remains to find the minimum of J in (3.67).36 Most investigators are best-advised to tackle such problems by using one of the many general purpose numerical routines written by experts.37 Here, a quasi-Newton method was employed to produce E 11 = 2.0001,
E 12 = 1.987,
E 22 = 3.5237,
x1 = −0.0461,
E 21 = 0.0, x2 = 0.556,
and the minimum of J = 0.0802. The inverse Hessian at the minimum is ⎫ ⎧ 0.4990 0.0082 −0.0000 − 0.0014 0.0061 0.0005 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0.0082 1.9237 0.0000 0.0017 −0.4611 −0.0075 ⎪ ⎪ ⎪ ⎪ ⎬ ⎨ −0.0000 0.0000 0.0000 − 0.0000 −0.0000 0.0000 −1 . H = −0.0014 0.0017 −0.0000 0.4923 0.0623 −0.0739 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0.0061 −0.4611 −0.0000 0.0623 0.3582 −0.0379 ⎪ ⎪ ⎪ ⎭ ⎩ 0.0005 −0.0075 0.0000 − 0.0739 −0.0379 0.0490 The eigenvalues and eigenvectors of H are λi = 2.075 × 106 30.4899 4.5086 2.0033 1.9252 0.4859 , ⎫ ⎧ 0.0000 −0.0032 0.0288 0.9993 0.0213 0.0041 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −0.0000 0.0381 −0.2504 0.0020 0.0683 0.9650 ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎨ −1.0000 0.0000 0.0000 − 0.0000 −0.0000 0.0000 . V= 0.0000 0.1382 0.2459 − 0.0271 0.9590 −0.0095 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ −0.0000 0.1416 −0.9295 0.0237 0.2160 −0.2621 ⎪ ⎪ ⎪ ⎭ ⎩ 0.0000 0.9795 0.1095 0.0035 −0.1691 0.0017 The large jump from the first eigenvalue to the others is a reflection of the conditioning problem introduced by having one element, ξ 3 , with almost zero uncertainty. It is left to the reader to use this information about H to compute the uncertainty of the solution in the neighborhood of the optimal values – this would be the new uncertainty, P(1). A local resolution analysis follows from that of the SVD, employing knowledge of the V. The particular system is too small for a proper statistical test of the result against the prior covariances, but the possibility should be clear. If P(0), etc., are simply regarded as non-statistical weights, we are free to experiment with different values until a pleasing solution is found.
176
Extensions of methods
3.7.3 Variant non-linear methods, including combinatorial ones As with the linear least-squares problems discussed in Chapter 2, many possibilities exist for objective functions that are non-linear in either data constraint terms or the model, and there are many variations on methods for searching for objective function minima. A very interesting and useful set of methods has been developed comparatively recently, called “combinatorial optimization.” Combinatorial methods do not promise that the true minimum is found – merely that it is highly probable – because they search the space of solutions in clever ways that make it unlikely that one is very far from the true optimal solution. Two such methods, simulated annealing and genetic algorithms, have recently attracted considerable attention.38 Simulated annealing searches randomly for solutions that reduce the objective function from a present best value. Its clever addition to purely random guessing is a willingness to accept the occasional uphill solution – one that raises the value of the objective function – as a way of avoiding being trapped in purely local minima. The probability of accepting an uphill value and the size of the trial random perturbations depend upon a parameter, a temperature defined in analogy to the real temperature of a slowly cooling (annealing) solid. Genetic algorithms, as their name would suggest, are based upon searches generated in analogy to genetic drift in biological organisms.39 The recent literature is large and sophisticated, and this approach is not pursued here. Notes 1 Freeman (1965), Jerri (1977), Butzer and Stens (1992), or Bracewell (2000). 2 In the Russian literature, Kotel’nikov’s theorem. 3 Aliasing is familiar as the stroboscope effect. Recall the appearance of the spokes of a wagon wheel in the movies. The spokes can appear to stand still, or move slowly forward or backward, depending upon the camera shutter speed relative to the true rate at which the spokes revolve. (The terminology is apparently due to John Tukey.) 4 Hamming (1973) and Bracewell (2000) have particularly clear discussions. 5 There is a story, perhaps apocryphal, that a group of investigators was measuring the mass flux of the Gulf Stream at a fixed time each day. They were preparing to publish the exciting discovery that there was a strong 14-day periodicity to the flow, before someone pointed out that they were aliasing the tidal currents of period 12.42 hours. 6 It follows from the so-called Paley–Wiener criterion, and is usually stated in the form that “timelimited signals cannot be bandlimited.” 7 Landau and Pollack (1962), Freeman (1965), Jerri (1977). 8 Petersen and Middleton (1962). An application, with discussion of the noise sensitivity, may be found in Wunsch (1989). 9 Davis and Polonsky (1965). 10 See Ripley (2004, Section 5.2). 11 Bretherton et al. (1976). 12 See Fukumori et al. (1991). 13 Luenberger (1969). 14 See Strang (1988) or Lawson and Hanson (1995); the standard full treatment is Fiacco and McCormick (1968).
Notes 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38
39
177
Lawson and Hanson (1974). Fu (1981). Tziperman and Hecht (1987). Dantzig (1963). For example, Bradley et al. (1977), Luenberger (2003), and many others. See Cacuci (1981), Hall and Cacuci (1984), Strang (1986), Rockafellar (1993), Luenberger (1997). One of the few mathematical algorithms ever to be written up on the front page of the New York Times (November 19, 1984, story by J. Gleick) – a reflection of the huge economic importance of linear programs in industry. Arthanari and Dodge (1993). Wagner (1975), Arthanari and Dodge (1993). Van Huffel and Vandewalle (1991). The use of EOFs, with various normalizations, scalings, and in various row/column physical spaces, is widespread – for example, Wallace (1972), Wallace and Dickinson (1972), and many others. Jolliffe (2002), Preisendorfer (1988), Jackson (2003). Davenport and Root (1958), Wahba (1990). Berkooz et al.(1993). Armstrong (1989), David (1988), Ripley (2004). Ripley (2004). For example, Seber and Lee (2003). Golub and van Loan (1996), Van Huffel and Vandewalle (1991). Tarantola and Valette (1982), Tarantola (1987). Tarantola and Valette (1982) labeled the use of similar objective functions and the determination of the minimum as the method of total inversion, although they considered only the case of perfect model constraints. Luenberger (1984). Tarantola and Valette (1982) suggested using a linearized search method, iterating from the initial estimate, which must be reasonably close to the correct answer. The method can be quite effective (e.g., Wunsch and Minster, 1982; Mercier et al., 1993). In a wider context, however, their method is readily recognizable as a special case of the many known methods for minimizing a general objective function. Numerical Algorithms Group (2005), Press et al. (1996). For simulated annealing, the literature starts with Pincus (1968) and Kirkpatrick et al. (1983), and general discussions can be found in van Laarhoven and Aarts (1987), Ripley (2004), and Press et al. (1996). A simple oceanographic application to experiment design was discussed by Barth and Wunsch (1990). Goldberg (1989), Holland (1992), Denning (1992).
4 The time-dependent inverse problem: state estimation
4.1 Background The discussion so far has treated models and data that most naturally represent a static world. Much data, however, describe systems that are changing in time to some degree. Many familiar differential equations represent phenomena that are intrinsically time-dependent; such as the wave equation, 1 ∂ 2 x (t) ∂ 2 x(t) − = 0. c2 ∂t 2 ∂r 2
(4.1)
One may well wonder if the methods described in Chapter 2 have any use with data thought to be described by (4.1). An approach to answering the question is to recognize that t is simply another coordinate, and can be regarded, e.g., as the counterpart of one of the space coordinates encountered in the previous discussion of two-dimensional partial differential equations. From this point of view, timedependent systems are nothing but versions of the systems already developed. (The statement is even more obvious for the simpler equation, d2 x (t) = q(t), dt 2
(4.2)
where the coordinate that is labeled t is arbitrary, and need not be time.) On the other hand, time often has a somewhat different flavor to it than does a spatial coordinate because it has an associated direction. The most obvious example occurs when one has data up to and including some particular time t, and one asks for a forecast of some elements of the system at a future time t > t. Even this role of time is not unique: one could imagine a completely equivalent spatial forecast problem, in which, e.g., one required extrapolation of the map of an ore body beyond some area in which measurements exist. In state estimation, time does not introduce truly novel problems. The main issue is really a computational one: problems in two or more spatial dimensions, when time-dependent, typically generate system 178
4.1 Background
179
dimensions that are too large for conventionally available computer systems. To deal with the computational load, state estimation algorithms are sought that are computationally more efficient than what can be achieved with the methods used so far. Consider, as an example, ∂C = κ∇ 2 C, (4.3) ∂t a generalization of the Laplace equation (a diffusion equation). Using a onesided time difference, and the discrete form of the two-dimensional Laplacian in Eq. (1.13), one has Ci j ((n + 1)t) − Ci j (nt) = κ{Ci+1, j (nt) − 2Ci, j (nt) + Ci−1, j (nt) t + Ci, j+1 (nt) − 2Ci, j (nt) + Ci, j−1 (nt)}. (4.4) If there are N 2 elements defining Ci j at each time nt, then the number of elements over the entire time span of T time steps would be T N 2 , which grows rapidly as the number of time steps increases. Typically the relevant observation numbers also grow rapidly through time. On the other hand, the operation x = vec(Ci j (nt)),
(4.5)
renders Eq. (4.4) in the familiar form A1 x = 0,
(4.6)
and with some boundary conditions, some initial conditions and/or observations, and a big enough computer, one could use without change any of the methods of Chapter 2. As T, N grow, however, even the largest available computer becomes inadequate. Methods are sought that can take advantage of special structures built into time evolving equations to reduce the computational load. (Note, however, that A1 is very sparse.) This chapter is in no sense exhaustive; many entire books are devoted to the material and its extensions. The intention is to lay out the fundamental ideas, which are algorithmic rearrangements of methods already described in Chapters 2 and 3, with the hope that they will permit the reader to penetrate the wider literature. Several very useful textbooks are available for readers who are not deterred by discussions in contexts differing from their own applications.1 Most of the methods now being used in fields involving large-scale fluid dynamics, such as oceanography and meteorology, have been known for years under the general headings of control theory and control engineering. The experience in these latter areas is very helpful; the main issues in applications to fluid problems concern the size of the models
180
The time-dependent inverse problem: state estimation
and data sets encountered: they are typically many orders of magnitude larger than anything contemplated by engineers. In meteorology, specialized techniques used for forecasting are commonly called “data assimilation.”2 The reader may find it helpful to keep in mind, through the details that follow, that almost all methods in actual use are, beneath the mathematical disguises, nothing but versions of leastsquares fitting of models to data, but reorganized so as to increase the efficiency of solution, or to minimize storage requirements, or to accommodate continuing data streams. Several notation systems are in wide use. The one chosen here is taken directly from the control theory literature; it is simple and adequate.3
4.2 Basic ideas and notation 4.2.1 Models In the context of this chapter, “models” is used to mean statements about the connections between the system variables in some place at some time, and those in all other places and times. Maxwell’s equations are a model of the behavior of timedependent electromagnetic disturbances. These equations can be used to connect the magnetic and electric fields everywhere in space and time. Other physical systems are described by the Schr¨odinger, elastic, or fluid-dynamical equations. Static situations are special limiting cases, e.g., for an electrostatic field in a container with known boundary conditions. A useful concept is that of the system “state.” By that is meant the internal information at a single moment in time required to forecast the system one small time step into the future. The time evolution of a system described by the tracer diffusion equation (4.3), inside a closed container can be calculated with arbitrary accuracy at time t + t, if one knows C(r, t) and the boundary conditions C B (t), as t → 0. C (r, t) is the state variable (the “internal” information), with the boundary conditions being regarded as separate externally provided variables (but the distinction is to some degree arbitrary, as we will see). In practice, because such quantities as initial and boundary conditions, container shape, etc., are obtained from measurements, and are thus always imperfectly known, the problems are conceptually identical to those already considered. Consider any model, whether time dependent or steady, but rendered in discrete form. The “state vector” x (t) (where t is discrete) is defined as those elements of the model employed to describe fully the physical state of the system at any time and all places as required by the model in use. For the discrete Laplace/Poisson equation in Chapter 1, x = vec(Ci j ) is the state vector. In a fluid model, the state vector might consist of three components of velocity, pressure, and temperature at
4.2 Basic ideas and notation
181
each of millions of grid points, and it would be a function of time, x(t), as well. (One might want to regard the complete description, x B = [x(1t)T , x(2t)T , . . . , x(T t)T ]T ,
(4.7)
as the state vector, but by convention, it refers to the subvectors, x(t = nt), each of which, given the boundary conditions, is sufficient to compute any future one.) Consider a partial differential equation, ∂ 2 ∂p ∇h p + β = 0, ∂t ∂η
(4.8)
subject to boundary conditions. ∇ h is the two-dimensional gradient operator. For the moment, t is a continuous variable. Suppose it is solved by an expansion, p(ξ , η, t) =
N /2
[a j (t) cos(k j · r) + b j (t) sin(k j · r)].
(4.9)
j=1
[k j = (kξ , kη ), r = (ξ , η)], then a(t) = [a1 (t), b1 (t), · · · a j (t), b j (t), . . .]T . The k j are chosen to be periodic in the domain. The ai , bi are a partial-discretization, reducing the time-dependence to a finite set of coefficients. Substitute into Eq. (4.8), {−|k j |2 (a˙ j cos(k j · r) + b˙ j sin(k j · r)) + βk1 j [−a j sin(k j · r) + b j cos(k j · r)]} = 0. The dot indicates a time-derivative, and k1 j is the η component of k j . Multiply this last equation through first by cos(k j · r) and integrate over the domain: −|k j |2 a˙ j + βk1 j b j = 0. Multiply by sin(k j · r) and integrate again to give |k j |2 b˙ j + βk1 j a j = 0, or d dt
aj bj
=
0 −βk1 j /|k j |2
βk1 j /|k j |2 0
aj . bj
Each pair of a j , b j satisfies a system of ordinary differential equations in time, and each can be further discretized, so that 1 tβk1 j /|k j |2 a j ((n − 1) t) a j (nt) = . −tβk1 j /|k j |2 1 b j (nt) b j ((n − 1) t) The state vector is then the collection x(nt) = [a1 (nt), b1 (nt), a2 (nt), b2 (nt), . . .]T ,
182
The time-dependent inverse problem: state estimation
at time t = nt. Any adequate discretization can provide the state vector; it is not unique, and careful choice can greatly simplify calculations. In the most general terms, we can write any discrete model as a set of functional relations: L[x(0), . . . , x(t − t), x(t), x(t + t), . . . , x(t f ), . . . , B(t)q(t), B(t + t)q(t + t), . . . , t] = 0,
(4.10)
where B(t)q(t) represents a general, canonical, form for boundary and initial conditions/sources/sinks. A time-dependent model is a set of rules for computing the state vector at time t = nt, from knowledge of its values at time t − t and the externally imposed forces and boundary conditions. We almost always choose the time units so that t = 1, and t becomes an integer (the context will normally make clear whether t is continuous or discrete). The static system equation, Ax = b,
(4.11)
is a special case. In practice, the collection of relationships (4.10) always can be rewritten as a time-stepping rule – for example, x(t) = L(x(t − 1), B(t − 1)q(t − 1), t − 1),
t = 1,
(4.12)
or, if the model is linear, x(t) = A(t − 1) x(t − 1) + B(t − 1) q(t − 1).
(4.13)
If the model is time invariant, A(t) = A, and B (t) = B. A (t) is called the “state transition matrix.” It is generally true that any linear discretized model can be put into this canonical form, although it may take some work. By the same historical conventions described in Chapter 1, solution of systems like (4.12), subject to appropriate initial and boundary conditions, constitutes the forward, or direct, problem. Note that, in general, x (0) = x0 has subcomponents that formally precede t = 0. Example The straight-line model, discussed in Chapter 1 satisfies the rule d2 ξ = 0, dt 2
(4.14)
ξ (t + t) − 2ξ (t) + ξ (t − t) = 0,
(4.15)
which can be discretized as
Define x1 (t) = ξ (t), x2 (t) = ξ (t − t),
(a)
(b)
10 2 4 6 8 10
5
0 0
50
0 −0.05 5
10
−0.1
100
(c)
2 4 6 8 10
0.05
(d)
0.8 0.6 0.4
0 5
−0.2
0.2 10 5
10
5
10
−0.4
Figure 2.12 (a) Graph of the singular values of the coefficient matrix A of the numerical Neumann problem on a 10 × 10 grid. All λi are non-zero except the last one. (b) shows u100 , the nullspace vector of ET defining the solvability or consistency condition for a solution through uT100 y = 0. Plotted as mapped onto the two-dimensional spatial grid (r x , r y ) with x = y = 1. The interpretation is that the sum of the influx through the boundaries and from interior sources must √ vanish. Note that corner derivatives differ from other boundary derivatives by 1/ 2. The corresponding v100 is a constant, indeterminate with the information available, and not shown. (c) A source b (a numerical delta function) is present, not satisfying the solvability condition uT100 b = 0, because all boundary fluxes were set to vanish. (d) The particular SVD solution, x˜ , at rank K = 99. One confirms that A˜x − b is proportional to u100 as the source is otherwise inconsistent with no flux boundary conditions. With b a Kronecker delta function at one grid point, this solution is a numerical Green function for the Neumann problem and insulating boundary conditions.
(a)
20
0
15
Meters
−1000
10
−2000
5
−3000
0 −5
−4000
−10
−5000
(b)
−15 −60
−50
−40 Longitude
0
Meters
−200 −400
−79.4 Longitude
−10
0
1
100
−1000
0
0
−80
−20
150
50
−600
−30
−50
−2000 Meters
−70
−1
−3000
−2
−4000
−3
−5000 −6000 −70
(c)
−60
−50 −40 Longitude
−30
−20
Figure 6.2 Geostrophic velocity (colors, in cm/s) relative to a 1000-decibar reference level, or the bottom, whichever is shallower, and isopycnal surfaces (thick lines) used to define ten layers in the constraints. Part (a) is for 36◦ N, (b) for the Florida Straits, and (c) for 24◦ N east of the Florida Straits. Levels-of-no-motion corresponding to the initial calculation are visible at 1000 m in the deep sections, but lie at the bottom in the Florida Straits. Note the greatly varying longitude, depth, and velocity scales.
−4
0
2 5 |
10 |
15 20 30 | | | 26.4
27
26.6
40 |
45 |
25 25.25 5 26 26
50 |
25.5 26
26.4 .4
55 |
60 |
26 26 .4
26.6 26 .6
65 |
70 77 | |
26
26 26.6 .6
500
27 27
27.5
7.5
27 .7
2
27. 5 27.7
27
27.8
.8
5 .87 27 1500
27 27.9
9
27. 875
27 .9
5 .92 27
.9
27.
1000
27.9
27
.92
5
5 .97 27
2000
28
27
.97
5
28
Depth (m)
2500
5
28.0
28.05 3000
.1 28
1
28.
3500
4000
4500
28 .1 4
5000
5500
Gamma-N 6000
KM
0
500
1000
1500
2000
2500
3000
LAT 15
20
25
30
35
40
Figure 6.17 Density field constructed from the measurements at the positions in Fig. 6.16. Technically, these contours are of so-called neutral density, but for present purposes they are indistinguishable from the potential density.
0
2 5 |
10 |
15 20 30 | | |
40 |
45 |
5
55 |
60 |
65 |
70 77 | |
1
1
2
3
2 3
500
22
5
3
3
5
10
15
25
1
1
1
2 3
10
50 |
10
15
1000
15
1500
15
2000
15
20 25
20
3000
29
25
30
30
Depth (m)
20
25
2500
25
30
29
29
29
3500
30
4000
29 30
30
29
29 30
30
35
40
35
4500
45
5000
50 5500
Silicate 6000
KM
0
500
1000
1500
2000
2500
3000
LAT 15
20
25
30
35
40
Figure 6.18 Same as Fig. 6.17, except showing the silicate concentration – a passive tracer – in μ moles/kg.
Global Circulation Summary
15± 2
Standard Solution (Sv) γn