NOISY INFORMATION AND COMPUTATIONAL COMPLEXITY
NOISY INFORMATION AND COMPUTATIONAL COMPLEXITY Leszek Plaskota
Instit...
37 downloads
727 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
NOISY INFORMATION AND COMPUTATIONAL COMPLEXITY
NOISY INFORMATION AND COMPUTATIONAL COMPLEXITY Leszek Plaskota
Institute of Applied Mathematics and Mechanics University of Warsaw Warsaw, Poland
AMBRIDGE
UNIVERSITY PRESS
CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Paulo, Delhi
Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK
Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521553681
© Cambridge University Press 1996
This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 1996
A catalogue record for this publication is available from the British Library ISBN 9780521553681 hardback Transferred to digital printing 2009
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or thirdparty Internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Information regarding prices, travel timetables and other factual information given in this work are correct at the time of first printing but Cambridge University Press does not guarantee the accuracy of such information thereafter.
To
my wife
Aleksandra, and our daughters Kinga, Klaudia, and Ola.
Contents
page ix x
Preface List of symbols 1
Overview
1
2 2.1 2.2 2.3 2.4
Worst case setting
5
Introduction Information, algorithms, approximation Radius and diameter of information Affine algorithms for linear functionals 2.4.1 Existence of optimal affine algorithms 2.4.2 The case of Hilbert noise 2.5 Optimality of spline algorithms 2.5.1 Splines and smoothing splines 2.5.2 asmoothing splines 2.6 Special splines 2.6.1 The Hilbert case with optimal a 2.6.2 Least squares and regularization 2.6.3 Polynomial splines 2.6.4 Splines in r.k.h.s. 2.7 Varying information 2.7.1 Nonadaptive and adaptive information 2.7.2 When does adaption not help? 2.8 Optimal nonadaptive information 2.8.1 Linear problems in Hilbert spaces 2.8.2 Approximation and integration of Lipschitz functions 2.9 Complexity 2.9.1 Computations over the space G 2.9.2 Cost and complexity, general bounds v
5 8 12
20 20 23 32 32 35 44 44 50 54 58 63 63 66 71 71 81
89 89 92
Contents
vi
Complexity of special problems 2.10.1 Linear problems in Hilbert spaces 2.10.2 Approximation and integration of Lipschitz functions 2.10.3 Multivariate approximation in a Banach space 2.10
3
Average case setting
Introduction Information and its radius 3.2 Gaussian measures on Banach spaces 3.3 3.3.1 Basic properties 3.3.2 Gaussian measures as abstract Wiener spaces Linear problems with Gaussian measures 3.4 3.4.1 Induced and conditional distributions 3.4.2 Optimal algorithms The case of linear functionals 3.5 3.6 Optimal algorithms as smoothing splines 3.6.1 A general case 3.6.2 Special cases 3.6.3 Relations to worst case setting Varying information 3.7 3.7.1 Nonadaptive and adaptive information 3.7.2 Adaption versus nonadaption Optimal nonadaptive information 3.8 3.8.1 Linear problems with Gaussian measures 3.8.2 Approximation and integration on the Wiener space 3.9 Complexity 3.9.1 Adaption versus nonadaption 3.9.2 Complexity bounds 3.10 Complexity of special problems 3.10.1 Linear problems with Gaussian measures 3.10.2 Approximation and integration on the Wiener space 3.1
4
Worstaverage case setting
Introduction Affine algorithms for linear functionals 4.2.1 The one dimensional problem 4.2.2 Almost optimality of affine algorithms 4.2.3 Relations to other settings 4.3 Approximation of operators 4.3.1 Ellipsoidal problems in R' 4.3.2 The Hilbert case 4.1 4.2
100 100 106 109 121 121 123 132 132 134 139 140 143 148 155 156 157 159 162 163 165 170 170 180 197 198 202
205 205 210 215 215 216 216 220 230 234 234 238
Contents 5 5.1 5.2
Averageworst case setting
Introduction Linear algorithms for linear functionals 5.2.1 The one dimensional problem 5.2.2 Almost optimality of linear algorithms 5.2.3 Relations to other settings 5.3 Approximation of operators 6
Asymptotic setting
vii 248 248 249 249 252 260 263 268
Introduction Asymptotic and worst case settings 6.2.1 Information, algorithm and error 6.2.2 Optimal algorithms 6.2.3 Optimal nonadaptive information Asymptotic and average case settings 6.3 6.3.1 Optimal algorithms
268
6.3.2 Convergence rate
285
6.3.3 Optimal nonadaptive information
288
References Author index Subject index
293
6.1 6.2
269 270 271
276 281
282
304 306
Preface
In the modern world, the importance of information can hardly be overestimated. Information also plays a prominent role in scientific computations. A branch of computational complexity which deals with problems for which information is partial, noisy and priced is called informationbased complexity.
In a number of informationbased complexity books, the emphasis was
on partial and exact information. In the present book, the emphasis is on noisy information. We consider deterministic and random noise. The analysis of noisy information leads to a variety of interesting new algorithms and complexity results. The book presents a theory of computational complexity of continuous problems with noisy information. A number of applications is also given.
It is based on results of many researchers in this area (including the results of the author) as well as new results not published elsewhere. This work would not have been completed if I had not received support from many people. My special thanks go to H. Wozniakowski who encouraged me to write such a book and was always ready to offer his help. I appreciate the considerable help of J.F. Traub. I would also like to thank M. Kon, A. Werschulz, E. Novak, K. Ritter and other colleagues for their valuable comments on various portions of the manuscript. I wish to express my thanks to the Institute of Applied Mathematics and Mechanics at the University of Warsaw, where the book was almost entirely written. Some parts of the book were prepared in the Mathematical Institute of the ErlangenNuremberg University, and in the Columbia University Computer Science Department. And finally, I am pleased to acknowledge the substantial support of my wife Aleksandra, my daughters Kinga, Klaudia and Ola, as well as my whole family, who patiently waited for completion of this work. ix
List of symbols
S:F >G g = S(f)
EcF A
N:F+Y
solution operator
exact solution set of problem elements a priori distribution (measure) on F
O(y) ewor(N, o), ew0r(N, o; E)
exact information operator exact information about f precision vector correlation matrix (noisy) information operator information distribution information about f (deterministic) information about f (random) information noise noise level (bound and variance) algorithm approximate solution (approximation) worst case error of the algorithm cp
eave(Ifl, W), eave(III, W; p)
average error of cp
N(f)
0 E
N:F+2',N={N, 0} E={if}f,IlI={N,E} y E N(f)
y^'if x=yN(f) 6,0' 2
co:Y >G
e"(11, V), ewa(I1, V; E)
worstaverage error of cp
ea'(N, V), e'' (N, W; /L)
averageworst error of cp
radwor (N), radwor (N; E)
worst case radius of information N average radius of information I I
radave (M), rave (III; /A)
x
List of symbols
diam(N) A
Nn rnor(b) nor(A)
rnve(o2), re(E) C(S), c(a2)
g R(T)
P costWO' (P), costave (P)
comp", (N, 0) compave(III, 0) ICnon
(E)
Comp", (E), Compa"e (e)
diameter of information class of permissible functionals a class of exact information minimal radius (worst case) minimal radius (average case) cost function cost of operations in G Tth minimal radius program
cost of executing P worst case complexity of cp using N average complexity of p using IlI (nonadaptive) information ecomplexity 6complexity (of a problem) reals
A*
App
Int GN
tr(.) {H, F} W, Wr
y)
W0 C
r°
adjoint operator to A approximation problem integration problem reproducing kernel (covariance kernel) Gram matrix for exact information N trace of an operator abstract Wiener space classical and rfold Wiener measure correlation operator of the measure y conditional distribution onF rfold square integrable functions rfold continuous functions Gaussian (normal) distribution on ][8n weak equivalence of sequences weak equivalence functions
strong equivalence of sequences
an b(e)
xi
strong equivalence of functions
1 Overview
In the process of doing scientific computations we always rely on some information. In practice, this information is typically noisy, i.e., contaminated by error. Sources of noise include previous computations, inexact measurements, transmission errors, arithmetic limitations, an adversary's lies.
Problems with noisy information have always attracted considerable attention from researchers in many different scientific fields, e.g., statisticians, engineers, control theorists, economists, applied mathematicians. There is also a vast literature, especially in statistics, where noisy information is analyzed from different perspectives. In this monograph, noisy information is studied in the context of the computational complexity of solving mathematical problems. Computational complexity focuses on the intrinsic difficulty of problems as measured by the minimal amount of time, memory, or elementary operations necessary to solve them. Informationbased complexity (IBC)
is a branch of computational complexity that deals with problems for which the available information is partial, noisy, priced.
Information being partial means that the problem is not uniquely determined by the given information. Information is noisy since it may be contaminated by error. Information is priced since we must pay for 1
2
1 Overview
getting it. These assumptions distinguish IBC from combinatorial complexity, where information is complete, exact, and free. Since information about the problem is partial and noisy, only approximate solutions are possible. Approximations are obtained by algorithms that use this information. One of the main goals of IBC is to find the ecomplexity of the problem, i.e., the intrinsic cost of computing an approximation with the accuracy e. Partial, noisy and priced information is typical of many problems arising in different scientific fields. These include, for instance, signal processing, control theory, computer vision, and numerical analysis. As a rule, a digital computer is used to perform scientific computations. A computer can only use a finite set of numbers. Usually, these numbers cannot be entered exactly into the computer's memory. Hence problems described by infinitely many parameters can be `solved' only by using partial and noisy information. The theory of optimal algorithms for solving problems with partial information has a long history. It can be traced back to the pioneering papers of Sard (1949), Nikolskij (1950), and Kiefer (1953). A systematic and unified approach to such problems was first presented by J.F. Traub and H. Woiniakowski in the monograph A General Theory of Optimal Algorithms, Academic Press, 1980. This was an important stage in the development of the theory of IBC. The monograph was followed by Information, Uncertainty, Complexity, AddisonWesley, 1983, and Informationbased Complexity, Academic Press, 1988, both authored by J.F. Traub, G.W. Wasilkowski, and H. Woiniakowski. Computational complexity of approximately solved problems is also studied in Problem Complexity and Method Efficiency in Optimization by A.S. Nemirovski and D.B. Yudin, Wiley and Sons, 1983, Deterministic and Stochastic Error Bounds in Numerical Analysis by E. Novak, Springer Verlag, 1988, and The Computational Complexity of Differential and Integral Equations by A.G. Werschulz, Oxford University Press, 1991. Relatively few IBC papers study noisy information. One reason is the technical difficulty of its analysis. A second reason is that even if we are primarily interested in noisy information, the results on exact information establish a benchmark. Negative results for exact information are usually applicable for the noisy case. On the other hand, it is not clear whether positive results for exact information have a counterpart for noisy information.
Overview
3
In the mathematical literature, the word `noise' is used mainly by statisticians to mean random error that contaminates experimental observations. We also want to study deterministic error. Therefore by noise we mean random or deterministic error. Moreover, in our model, the source of the information noise is not important. We may say that `information is observed' or that it is `computed'. We also stress that the case of exact information is not excluded, either in the model or in most results. Exact information is obtained as a special case by setting the noise level to zero. This permits us to study the dependence of the results on the noise level, and to compare the noisy and exact information cases. The general IBC model covers a large variety of problems. In this monograph, we are mainly interested in linear problems, i.e., problems which can be described in terms of approximating a linear operator from noisy information about values of some linear functionals. Examples include function approximation and integration, where information is given by noisy function values. For linear problems, many effective algorithms are already known. Among them, algorithms based on smoothing splines and regularization, or the least squares algorithm, are some of the most frequently used in computations. We shall see that effectiveness of these algorithms can be confirmed in the IBC model. In general, optimal algorithms and problem complexity depend on the setting. The setting is specified by the way the error and cost of algorithms are defined. In this monograph we study: worst case setting, average case setting, worstaverage case setting, averageworst case setting, asymptotic setting.
In the worst case setting, the error and cost are defined by their worst performance. In the average case setting, we consider the average error and cost. The mixed worstaverage and averageworst case settings are obtained by combining the worst and average cases. In the asymptotic setting, we are interested in the asymptotic behavior of algorithms. Other settings such as probabilistic or randomized settings are also important and will be involved in topics of future research. Despite the differences, the settings have certain features in common. For instance, smoothing spline, regularization and least squares algo
4
1 Overview
rithms possess optimality properties, independent of the setting. This shows that these algorithms are universal and robust. Most of the research presented in this monograph has been done over the last six to seven years by different people, including the author. Some of the results have not been previously reported. The references to the original results are given in Notes and Remarks at the end of each section. Clearly, the author does not pretend to cover the whole subject of noisy information in one monograph. Only those topics are presented that are typical of IBC, or are needed for the complexity analysis. Many problems are still open. Some of these are indicated in the text. The monograph consists of six chapters. We start with the worst case setting in Chapter 2. Chapter 3 deals with the average case setting. Each of these two settings is studied following the same scheme. We first look for the best algorithms that use fixed information. Then we allow the information to vary and seek optimal information. We also analyze adaptive information and the problem adaption versus nonadaption. Finally, complexity concepts are introduced and complexity results are presented for some particular problems. Chapters 4 and 5 deal with the mixed settings, and Chapter 6 with the asymptotic setting. Each subsequent chapter consists of several sections, most followed by Notes and Remarks, and Exercises. A preview of the results is presented in the introduction to each chapter.
2 Worst case setting
2.1 Introduction In this chapter we study the worst case setting. We shall present results already known as well as showing some new results. As already mentioned in the Overview, precise information about what is known and what is new can be found in the Notes and Remarks. Our major goal is to obtain tight complexity bounds for the approximate solution of linear continuous problems that are defined on infinite dimensional spaces. We first explain what is to be approximated and how an approximation is obtained. Thus we carefully introduce the fundamental concepts of solution operator, noisy information and algorithm. Special attention will be devoted to information, which is most important in our analysis. Information is, roughly speaking, what we
know about the problem to be solved. A crucial assumption is that information is noisy, i.e., it is given not exactly, but with some error. Since information is usually partial (i.e., many elements share the same information) and noisy, it is impossible to solve the problem exactly. We have to be satisfied with only approximate solutions. They are obtained by algorithms that use information as data. In the worst case setting, the error of an algorithm is given by its worst performance over all problem elements and possible information. A sharp lower bound on the error is given by a quantity called radius of information. We are obviously interested in algorithms with the minimal error. Such algorithms are called optimal. In Sections 2.4 to 2.6 we study optimal algorithms and investigate whether they can be linear or affine. In many cases the answer is affirmative. This is the case for approximation of linear functionals and approximation of operators that act between spaces endowed with Hilbert 5
6
2 Worst case setting
seminorms, assuming that information is linear with noise bounded in a Hilbert seminorm. The optimal linear algorithms are based on the well known smoothing splines. This confirms a common opinion that smoothing splines are a very good practical tool for constructing approximations. We show that in some special cases smoothing splines are closely related to the least squares and regularization algorithms. When using smoothing splines or regularization, we need to know how
to choose the smoothing or regularization parameters. Often, special methods, such as crossvalidation, are developed to find them. We show how to choose the smoothing and regularization parameters optimally in the worst case setting, and how this choice depends on the noise level
and the domain of the problem. It turns out that in some cases, the regularization parameter is independent of the noise level, provided that we have a sufficiently small bound on the noise. In Sections 2.7 and 2.8 we allow not only algorithms but also information to vary. We assume that information is obtained by successive noisy observations (or computations) of some functionals. The choice of functionals and noise bounds depends on us. We stress that we do not ,exclude the case where errors coming from different observations are correlated. This also allows us to model information where the noise of information is bounded, say, in a Hilbert norm. With varying information, it is important to know whether adaption can lead to better approximations than nonadaption. We give sufficient conditions under which adaption is not better than nonadaption. These conditions are satisfied, for instance, if we use linear information whose noise is bounded in some norm. Then we study the optimal choice of observations with given precision. This is in general a difficult problem. Therefore we establish complete results only for two classes of problems. The first class consists of approximating compact operators acting between Hilbert spaces where the noise is bounded in a weighted Euclidean norm. In particular, it turns
out that in this case the error of approximation can be arbitrarily reduced by using observations of fixed precision. This does not hold for noise bounded in the supremum norm. When using this norm, to decrease the error of approximation, we have to perform observations with higher precision. We stress that observations with noise bounded in the supremum norm seem to be most often used in practice. Exact formu
las for the minimal errors are in this case obtained for approximating Lipschitz functions based on noisy function values.
2.1 Introduction
7
In Section 2.9 we present the model of computation and define the ecomplexity of a problem as the minimal cost needed to obtain an approximation with the (worst case) error at most E. In the worst case setting, the cost of approximation is measured by the worst performance of an algorithm over all problem elements. In general, the cost of successive observations depends on their precisions. However, the model also covers the case when only observations with a given, fixed precision are allowed.
The complexity results are obtained using previously established results on optimal algorithms, adaption and optimal information. We first give tight general bounds on the ecomplexity. It turns out that if the
optimal algorithms are linear (or affine) then in many cases the cost of combining information is much less than the cost of gaining it. In such a case, the problem complexity is roughly equal to the information complexity, which is defined as the minimal cost of obtaining information that guarantees approximation within error e. This is the reason why we are so interested in the existence of optimal linear or affine algorithms. In the last section we specialize the general complexity results to some specific problems. First, we consider approximation of compact opera, tors in Hilbert spaces, where the information is linear with noise bounded in a weighted Euclidean norm. We obtain sharp upper and lower complexity bounds. We also investigate how the complexity depends on the cost assigned to each precision. Next, we derive the ecomplexity for approximating and integrating Lipschitz functions. For a fixed positive bound on the noise, the complexity is infinite for sufficiently small E. To make the complexity finite for all positive e, we have to allow observations with arbitrary precision. Then the ecomplexity is roughly attained by information that uses observations of function values at equidistant points with the same precision, proportional to E. Finally, we consider approximation of smooth multivariate functions in a Banach space. We assume that the noise of successive observations
is bounded in the absolute or relative sense. We show that in either case, the ecomplexity is roughly the same and is achieved by polynomial
interpolation based on data about function values at equispaced points with noise bounds proportional to 6.
2 Worst case setting
8
2.2 Information, algorithms, approximation Let F be a linear space and C a normed space, both over the reals. Let
S:F >G be a mapping, called a solution operator. We are mainly interested in linear S. However, for the general presentation of the basic concepts we do not have to put any restrictions on S. We wish to approximate S(f) for f belonging to a given set
ECF of problem elements. An approximation is constructed based only on some information about f. We now explain precisely what we mean by information and how the approximations are obtained. An information operator (or simply information) is a mapping
N: F>2Y, where Y is a set of finite real sequences, Y C U°_1 ][Fn. That is, N(f) is a subset of Y. We assume that N(f) is nonempty for all f E F. Any element y E N(f) will be called information about f. Note that knowing y, we conclude that f is a member of the set { fl E F I y E N(f1) }. This yields some information about the element f and justifies the names for N and y.
If the set N(f) has exactly one element for all f E F, information N is called exact. In this case, N will be identified with the operator N : F > Y, where N(f) is the unique element of N(f). If there exists f for which N(f) has at least two elements, we say that N is noisy. Knowing the information y about f, we combine it to get an approximation. More precisely, the approximation is produced by an algorithm which is given as a mapping
cp:Y>G. The algorithm takes the information obtained as data. Hence the approximation to S(f) is cp(y), where y is information about f. The error of approximation is defined by the distance JIS(f)  cp(y) 11 where 11
11 is
the norm in the space G. We illustrate the concepts of noisy information and algorithm by three simple examples.
2.2 Information, algorithms, approximation
9
Example 2.2.1 Suppose we want to approximate a real number (parameter) f based on its perturbed value y, I y  f I < S. This corresponds
to F = G = R and S(f) = f. The information is of the form
N(f)={yE]RI Iyfl 0 we have noisy information. An algorithm is a mapping co : IR + R. For instance, it may be given as co(y) = y.
Example 2.2.2 Suppose we want to approximate a smooth function based on noisy function values at n points. This can be modeled as follows.
Let F be the space of twice continuously differentiable real functions
f : [0,1] 4 R. We approximate f E F in the norm of the space G = L2 (0, 1). That is, S(f) = f. For ti E [0, 1], the information operator is given by n
(yif(ti))2 0, this is defined as the function co. (y) which minimizes the functional
r7(f, y) _ 'Y .
f (f
/I
n
(t)) 2 dt +
(yi
 f (ti))2
over all f E F.
Example 2.2.3 Let F be as in Example 2.2.2 or another `nice' class of smooth functions. The problem now is to approximate the integral of f based on noisy function values f (ti) with different precisions. That is, the solution operator is given as
S(f) = J0
1
f (t) dt
and information is defined as
N(f) = { y E 1R" j
l yi  f (ti) I < 6i, 1 < i < n }.
An example of the algorithm is a quadrature formula co(y) = E 1 ai yi.
2 Worst case setting
10
In all the above examples, information operators belong to a common class. This class is defined in the following way.
An extended seminorm in a linear space X is a functional
II

IIx
X i [0, +oo], such that the set X1 = { x E X I I1 xll x < +00 } is a linear subspace, and 11  IIx is a seminorm on X1. That is,
Va E R, V x E X1, (a) II axll x = Ial IIxIIx, V x1, x2 E X1. (b) Ilx1 + x2I1x 0. That is, information y about f is a noisy 11
value of exact (linear) information N(f), and the noise x = y  N(f) is bounded by 6 in the extended seminorm 11 For instance, in Example 2.2.2 we have

11y.
N(f) = [f(t1),f(t2),...,f(tn)]As the extended seminorm 11 1l' we may take the Euclidean norm, In Example 2.2.3, the operator N is as 'I I1xlIY = IIx112 = (Ej= 1 xz)1/2. above and 
IIxIIY = 1 sup r(A(y)) = radwor(N), yEYo
and consequently
inf ewo (N, gyp) > rad" (N). V
To prove the reverse inequality, it suffices to observe that for any e > 0 it is possible to select elements co,(y), y E Yo, such that sup IIS(f)  oE(y)II >
inf sup
II9all+II9(2a*a)II)
inf sup IIa a*II = sup IIa a*II,
9EG aEA
aEA
which shows that a* is a center. To prove the remaining equality, observe
that d(A) > sup Ila  (2a*  a)II = 2 sup IIa  a*II. aEA
aEA
The diameter of information N is defined as
diam(N) = sup d(A(y)). yEYo
Observe that in view of the equality
d(A(y)) = sup{IIS(fi)  S(.fi)II
I
fi, fi E Fo, y E N(.fi) nN(fi)},
the diameter of information can be rewritten as
diam(N) = sup IIS(fi)  S(f1)I1,
2 Worst case setting
16
where the supremum is taken over all f_1,f1 E E such that N(f_1) fl N(fl) # 0. Thus diam(N) measures the largest distance between two elements in S(E) which cannot be distinguished with respect to the information.
The diameter of information is closely related to the radius, though its definition is independent of the notion of an algorithm. That is, in view of (2.4), we have the following fact.
Theorem 2.3.2 For any information N, diam(N) = c radwor (N) where c = c(N) E [1, 2].
In general, c depends on the information, the solution operator and the set E. However, in some cases it turns out to be an absolute constant.
Example 2.3.2 Let S be a functional, i.e., let the range space G = R. Then for any set A C 1[8 we have d(A) = 2 r(A) and the center of A is (sup A + inf A) /2. Hence for any information N the constant c in Theorem 2.3.2 is equal to 2.
The relation between the radius and diameter of information allows us to show `almost' optimality of an important class of algorithms. An algorithm cpitp is called interpolatory if /for all y E Yo
coitp(y) = S(fy), for an element fy E E(y). Since S(fy) is a member of A(y), for any f E E(y) we have
1I5(f)  Witp(y)ll = IIS(f)  S(fy)II G, exact information operator N : F + Y, and set k C F as S(f, Y)
N(f, y)
= =
S(f),
E =
{(f, y) I f E E, Y E N(f)}. Show that for any algorithm co : Y + G we have ewor(N
o; S E) = ewor(N, Pi S, E)
where the second quantity stands for the error of cp over E, in approximating S(f, y) based on exact information y = N(f ).
E 2.3.8 Show that information whose graph gr(N; E) is convex and balanced satisfies the conditions (2.5) and (2.6) of Lemma 2.3.1. E 2.3.9
Let
N(f) = {yER"I (yN(f))EB}, where N : F  Rr` is linear and B is a given set of R'. Show that if both the sets B and E are convex (and balanced), then the graph gr(N; E) is convex (and balanced).
2 Worst case setting
20
2.4 Affine algorithms for linear functionals We start the study of problems with the case of the solution operator S being a linear operator. In this section, we assume that S is a linear functional. We are especially interested in finding optimal linear or affine algorithms.
2.4.1 Existence of optimal affine algorithms Since now the space G = R, we have
diam(N) = 2 rad bor(N) = sup (sup A(y)  inf A(y)), yEYo
where Yo = UfEEN(f), A(y) = {S(f) I f E E, y E N(f)}. The algorithm W(y) = (supA(y) + inf A(y))/2 is optimal and also central. We now ask if there exists an optimal algorithm which is linear or affine. It is easily seen that, in general, this is not true.
Example 2.4.1 Let F = R2 and E = { f = (fl, f2) E IR2
I
f2 = fi ,
Ifi I 0, we have p(u) = g/r. Indeed, Lemma 2.3.1 yields
r = sup { I S(h) I
I
= sup{aEIlk
h E bal(E), O E N(h) }
(0,a)EA1}.
Hence the infimum over all t > 0 such that (0, g/t) E Al is equal to g/r. Recall that p(u) is a seminorm on the space P = {u E R'+1 I p(u)

S(hy) = S(ho) + s (S(h6)  S(ho))
> r(0) + y r(b)
b
r(0)
 e (1 + Y).
Letting e > 0, we obtain the desired inequality K(y) > K(b).
2.4 Affine algorithms for linear functionals
23
We now prove that K(S) is bounded. To this end, let cpaff be the optimal affine algorithm for S = 0. Then cpiin (y) = coaf (y)  coaff (0) is a linear functional whose norm IIW1inllY =
SUP
Icplin(x)I
11x11y 0 and the norm II  IIy is induced by an inner product Clearly, in this case the graph gr(N, E) is convex and an optimal affine algorithm exists. We also assume that the radius r = rador (N) is finite and is attained. That is, there exists h* = (fl*  f * 1) /2 E bal(E) with f* 1, fl E E such
that IIN(h*)IIY < 6 and r = S(h*). We shall see later that the latter assumption is not restrictive.
For two elements f_1i f1 E F, let I = I(f_1, f1) denote the interval I = {a f_1 + (1  a) f1 I 0 < a < 1}. It is clear that if f_1, fi E E then I(f_1, fi) C E and radwo`(N; I) radw"(N; E). In words, this
24
2 Worst case setting
means that any one dimensional subproblem is at least as difficult as the original problem. Furthermore, for I * = I (f 1, fl*) we have radwor(N; E) = rad"°r(N; I*) (compare with E 2.3.5). Hence the problem of approximating S(f) for f belonging to the one dimensional subset I * C E is as difficult as the problem of approximating S(f) for f E E. We shall say, for brevity, that I* is the hardest one dimensional subproblem contained in E. In particular, we have that any algorithm which is optimal for E is also optimal for I*. The latter observation yields a method of finding all optimal affine algorithms. That is, it suffices to find all such algorithms for I* and then check which of them do not increase the error when taken over the whole set E. In the sequel, we follow this approach. Observe first that if IIN(h*)IIy < 6 then the only optimal affine algo
rithm is constant, cp(y) = S(fo) where fo = (fl + f1)/2. Indeed, let y = N(fo) +x where IIxily < 6 II N(h*)IIY. Then y is noisy information for any f E I* and therefore cpaff(y) = S(fo). Hence gaff is constant on a nontrivial ball. Its unique affine extension on R" is cpaff  S(fo). In what follows, we assume that II N(h*) II Y = 6.
Lemma 2.4.2 For the one dimensional subproblem I*
all
optimal affine algorithms are given as
Waff (y) = S(fo) + d . (y  N(fo), w )y,
(2.10)
where w = N(h*)/II N(h*)IIY and d = cr/6, for any c E [0,1].
Proof Let yo = N(fo) and w* = N(h*). For yc, = yo + aw*, a E R, the set of all elements which are in the interval S(I*) and cannot be distinguished with respect to the information y,, is given as S(I*) fl B(S(fo) + ar, r), where B(a, T) is the ball with center a and radius r. From this it follows that for any optimal affine algorithm cpaff we have
cpaff(y«) = S(fo) + car
(2.11)
where 0 < c < 1. Since a = (ya  yo, w)y/6, (2.11) can be rewritten as cPaff (Yc) = S(fo) + c. r . (y.  yo, w )Y
(2.12)
We now show that for any c E [0, 1], the formula (2.12) is valid, not only for ya, but for all y E R''. To this end, it is enough to show that
2.4 A, fine algorithms for linear functionals
25
for any y = yo + x, where IIxIIy < 6, (x, w) y = 0, we have cpaff(y) = Waff(yo) = S(fo). Indeed, let W ff(y) = S(fo) + a where (without loss of generality) a > 0. Then Waff(yo + Ex) = S(fo) + Ea. Since yo + Ex is noisy information for fE = fo  h* 1  E2IIxIIy/b2, we obtain ewor(N, Waff;
I*)
?
Waff(yo + Ex)  S(ff)
Ea+r
1c2IIxIIr/62
For small E > 0, the last expression is greater than r, which contradicts the assumption that the algorithm Waif is optimal, and completes the proof.
So the affine algorithm (2.10) is optimal for the hardest one dimensional subproblem P. We now wish to find the values of d for which (2.10) is an optimal algorithm for the original problem E. To this end, we first evaluate the error ew" (N, tpaff; E) of the algorithm
(2.10). For any f EEandy=N(f)+xEN(f)wehave S(f)  Waff(y)
= S(f)  S(fo)  d (N(f)  yo, w )y  d (x, w)y = S(f)  Waff(N(f))  d(x,w)y
Hence
sup
IS(f)Waff(y)I = IS(f)Waff(N(f))I + d6.
(2.13)
IIxIIY d (N(fi)  N(.f ), w )y, S(f*_1)  S(f) d (N(f*1)  N(f),w)y.
(2.15) (2.16)
We now show that (2.15) and (2.16) are equivalent to
S(h*)  S(h) > d (N(h*)  N(h), w )y,
V h E bal(E).
(2.17)
2 Worst case setting
26
Indeed, let (2.15) and (2.16) hold. Then for any h f2 E E, we have
S(h*)  S(h)
2 ((S(fi)  S(f1))  (S(f*1)  S(f1))) 2 d ((N(fi  fi),w)y  (N(f*1  f1),w)y) d (N(h*)  N(h), w )y.
Suppose now that (2.17) holds. Let f E E. Then, for h = (f  f * 1)/2 E bal(E) we have
S(fl)  S(f) = 2 (S(h*)  S(h)) > 2d (N(h*)  N(h), w)y
= d(N(fi)N(f),w)y, which shows (2.15). Similarly, taking h = (fl*  f)12 we obtain (2.16). Thus the number d should be chosen in such a way that (2.17) holds. This condition has a nice geometrical interpretation. That is, for y > 0, let
r(y) = sup { S(h) I h E bal(E), II N(h) II y < y } be the radius of information N with the noise level 6 replaced by y.
Lemma 2.4.3
The condition (2.17) holds if and only if the line with slope d passing through (6,r(6)) lies above the graph of r(y), i.e.,
r('y) < r(6) + d (y  6),
dry > 0.
(2.18)
Proof Observe first that (2.18) can be rewritten as S(h*)  S(h) > d( II N(h*) II y  II N(h) II y ),
V h E bal(E). (2.19)
Indeed, if (2.18) holds then for any h E bal(E), y = II N(h) II y, we have
S(h*)  S(h) > r(S)  r(y) > d(8  y) d (IIN(h*)IIy  IIN(h)IIy)
Let (2.19) hold. Then for any y > 0 and e > 0 there is he E bal(E) such that II N(he) II y (Iw*Ily. Hence the optimal affine algorithm is also constant, cpaff sup fEE S(f)  r(b).
So far we have not covered the exact information case b = 0. It turns out, however, that exact information can be treated as the limiting case. Indeed, let cp6 = w6) y be the optimal affine algorithm for b > 0. Let wo be an attraction point of {w6} as b p 0+. As limb.o d6 = r'(0+) and
S(h)  d6(N(h),w6)y < r(6)  6d6i Vh E bal(E), letting 6 + 0+ we obtain
S(h)  r'(0+) (N(h), wo)y < r(0),
V h E bal(E).
Hence, for go = supfEE(S(f)  r'(0+)(N(f),wo)y)  r(0), we have
JS(f)  r'(0+)(N(f), wo)y  9ol 5 r(0), V f E E, and the algorithm cpo(y) = r'(0+) . (y, wo) y + go
2.4 Affine algorithms for linear functionals
29
is optimal. (See also E 2.4.8 for another construction.) We end this section with a simple illustration of Theorem 2.4.3.
Example 2.4.2 Let F be a linear space of Lipschitz functions f: [0, 1] > R that satisfy f (0) = f (1), i.e., f are 1periodic. Let
E = If EFI If(xl)f(x2)I CIxlx21,b'x1ix2}. We want to approximate the integral of f, i.e.,
S(f) = f 0f(t) dt. l Noisy information is given by perturbed evaluations of function values
at equidistant points, y = [yl, ... , y,,,] E R', where yi = f (i/n) + xi, 1 < i < n, and the noise IIxI12 = ( x? )1/2 < 1 Since S is a functional and the set E is convex and balanced, Theorem 2.3.3 yields
radwor(N) = sup {
f
f E E, E f 2(i/n) < 62 }.
l f (t) dt I
i=1
The last supremum is attained for
h(t) _
6
1
V/n +2nIt
i1 1.
For f E F, let fj = (f,ej)F be the ith Fourier coefficient of f.
Consider the problem of approximating a functional S = S)F for f from the unit ball of F, based on noisy values of the Fourier coefficients, yi = fi + xi, where Ixi1 < 6i, 1 < i < n. Osipenko showed, in particular, that the optimal linear algorithm opt is in this case given as follows. Let A E (0, IIsIIF] be the solution of n
IIsIIF E(IsI2A262)+A2 = 0. j=1
Then n
popt(y) =
(1  SjIsjI 1)+sjyj, j=1
and the radius is n radwor (N)
= A + E 6j (I sj I  A6)+. j=1
2.4 A, fine algorithms for linear functionals
31
Exercises E 2.4.1 Show that an optimal affine (linear) algorithm for a functional S exists if the set
A = {(y,S(f)) I yEN(f), fEE} is convex (convex and balanced).
E 2.4.2 (MagarilIl'yaev and Osipenko, 1991) Let c(A) (cb(A)) be the smallest convex (convex and balanced) set which contains A. Show that an optimal affine (linear) algorithm exists if radwor(c(N),
radwor(N; E) c(E)) = (rad"(cb(N), cb(E)) = rad`"or (N; E))
E 2.4.3 Suppose that the radius is attained for two elements hi, h2 E bal(E) such that N(hi) $ N(h2). Show that then the only optimal affine algorithm is constant, Waff = (Sup fEE S(f) +inf fEE S(f ))/2. Use this result to show the formula for h* in Example 2.4.2.
E 2.4.4
Consider the problem of estimating a real parameter f from the
interval I = [rr, r] C 118, based on the data y such that l y  f I < 6. Show that in this case the radius is equal to min {r, 6} and the optimal affine algorithm is given as
6 26. Show that for the interval I = [f1, f1] we have radwor(N. I) 
IS(fO  S(f01 II N(fl)  N(fl)Ily
6
and the only optimal affine algorithm is given as (Paff (y) = S(.fo) +
S(fl)  S(f1)
(y  N(.fo), w )y,
IIN(fl)  N(fl)Ill' where fo = (f1 + fl)/2 and w = (N(fi)  N(fl) )/IIN(fi)  N(fl)Ijy. E 2.4.6 Let E be a convex set. Show that the radius r(6) = sup{ S(h) I h E bal(E), JIN(h)IIy < 61 is a concave and subadditive function of 6.
E 2.4.7 Why does the number d in Theorem 2.4.3 exist? For 0 < a < b < +oo, give an example where the set of all such d forms the closed interval [a, b].
E 2.4.8 (Bakhvalov, 1971) Let E be a convex and balanced set. Consider approximation of a linear functional S for f E E, based on exact linear information y = [L1(f ), ... , L, (f )] where the functionals Li are linearly independent
on span E. Let rk (x) = sup { S(h) I h E E, Lk (h) = x, Lj (h) = 0, i # k }.
2 Worst case setting
32
Assuming that the derivatives r'k(0) exist for all 1 < k < n, show that cp(y) _ En r' (0) y3 is the unique optimal linear algorithm.
,i
E 2.4.9 Give an example showing that if the solution operator S is linear but not a functional, then the assertion of Thorem 2.4.2 is no longer true.
E 2.4.10 Find an example of a balanced but not convex set E such that for some linear functional S and some linear information N with noise bounded uniformly in a Hilbert norm we have rad` or (N) = 0, but the error of any affine algorithm is infinite. E 2.4.11 Find an optimal linear algorithm for the integration problem of Example 2.4.2, in which the function values are observed at arbitrary, not necessarily equidistant, points.
2.5 Optimality of spline algorithms In this section we assume that:
S is an arbitrary linear operator, E is the unit ball in an extended seminorm 11  11F,
E={fEFI IIfhIF
I'«(fy, Y) 
This means that fy is an asmoothing spline. (iii) The orthogonal projection on V is determined uniquely if II ' II« is an extended norm on V. This in turn is equivalent to r,(f, N(f )) > 0, for f # 0, as claimed. (iv) l 'om (2.24) it follows that smoothing splines are linear on the subspace Y1. That is, if s«(yl), sa(y2) are asmoothing splines for
Yi, y2 E Y1, then 71 s«(yi) + rye s«(y2) is an asmoothing spline for 'yl Yi +'y2 y2. Hence, s« (y) can be chosen in such a way that the mapping
0
y > s« (y), y E R', is linear. We now turn to the error of the asmoothing spline algorithm.
Lemma 2.5.1 For any f E E and y E N(f) IIS(f)  coa(y)II x sup { IIS(h)II
1  r«(y) I
a IIhIIF + (1  a)
(2.25) 62II N(h)II2 < 1 }.
2 Worst case setting
38
In particular, if a E (0, 1) then ewor(N Wca)
< c(a) radwor(N),
where c(a) = max { a1/2, (1  a)1/2 }. Proof Theorem 2.5.3(ii) yields
II (f, N(f))  (o, y)II' = II (f, N(f))  (Sa(y), N(Sa(y)))II' + II (sa(y), N(sa(y)))  (o, y) IIa
= aIIf
S.(y)IIF + (1a)62IIN(f Sa(y))IIY + ra(y)
We also have
II(f,N(f))(0,y)I1' =
aIIhIIF+(1a)62IIyN(f)IIY < 1.
Hence, setting h = f  sa(y), we obtain
aIIhIIF +
(1a)62IIN(h)IIY 0. Let L = { a E R2 I (w, a) = c} be the line passing through a(x) and a(y). (Here is the Euclidean inner product.) Since IluhI00 < min{IIa(x)II00, IIa(y)II00}, the line L passes through the halflines {(t, 0) I t > 0} and {(0, t) I t > 0}. We consider two cases. 1. IIx  yllo > 0. Write x(t) = t x + (1t) y and u(t) = a(x(t)lhl x(t) ll o ) Since (x, y)o > 0, the 0seminorm of x(t) is positive. Then { u(t) I  oo < t < +oo } is a continuous curve in A with limtf00 u(t) = a((x  y)/IIx  yllo) E A.
2.5 Optimality of spline algorithms
41
Since the quadratic polynomial Q(t) = II x(t) IIo( (w, u(t) )2  c) vanishes
for t = 0, 1, the line L divides the curve into two curves that lie on opposite sides of L and join a(x) with a(y). One of them passes through [0, u].
2. IIxyIIo=0. In this case IIx(t)IIo = 1, for all t E R. Hence limt'± (u(t)/t2) a(x  y) # 0. Using this and the previous argument about the zeros of the polynomial Q(t), we conclude that the curve {u(t) I 0 < t < 1} passes through [0, u].
We have shown that there exists an optimal linear smoothing spline algorithm provided that II IF, III' and II II are all Hilbert extended seminorms. This was a consequence of the fact that for some a we have the equality 
II


IIhIIF < 1, IIN(h)IIy 0. Then the optimal parameter a* is given as a* = arg min max { A I A E Sp(SAa1S*)}, O A2 > ... > 0 are the eigenvalues of S*S and 77i are the eigenvalues of N*N. (If d < +oo then we formally set Ai = rli = 0 for i > d.) In this case, { Ski/II Ski jj I i 0 ker S} is an orthonormal basis in S(F) of eigenelements of the operator SA;1S*, and the corresponding eigenvalues are AZ (a)
a+8 ,qi(l  a)'
i > 1.
Hence, to find the optimal a and the radius of N, we have to minimize maxi>1 Ai(a) over all a E [0, 1]. Let 1=P1 pi and Ar 0+. For a* # 1 (which holds when A8 > At) the parameter ly
=
rl"At
A3  At
is constant. This means that the optimal algorithm is independent of the noise level, provided that 6 is small. (The same algorithm is also optimal for exact information, see E 2.6.3.) This nice property is, however, not preserved in general, as shown in E 2.6.4.
2.6.2 Least squares and regularization Consider the Hilbert case of Subsection 2.6.1 with 11 y a Hilbert norm. In this case, as an approximation to S(f) one can take 
vi(y) = S(ui (y)), where ul8(y) is the solution of the least squares problem. This is defined by the equation
IIyN(u1$(y))IIY = mm IIyN(f)IIY,
2.6 Special splines
51
or equivalently, N(uls(y)) = PNy where PN : W > Rn is the orthogonal projection of y onto YN = N(F) (with respect to the inner product (
)y). This is in turn equivalent to the condition that uls(y) is the
solution of the normal equations
N*Nf = N*y.
(2.34)
Indeed, if (2.34) holds then (yNf,Nf)y = (N*yN*Nf, f)F = 0, which means that N f = PNy. On the other hand, since N* (YN) = {0}, for any solution h of the least squares problem we have N*Nh = N*PNy = N*y, as claimed. The algorithm Vls is called the (generalized) least squares algorithm. For finite dimensional problems with small noise level, the least squares algorithm turns out to be optimal, as we see in the following theorem.
Theorem 2.6.2 Let dim F = dim N(F) = d < +oo. Let g E G be such that II9II = 1 and
II S(N*N)lS*gII = II S(N*N)1S* II .
Let 6 be chosen so that 62
(S(N*N) 2S*g, g) < II S(N*N)1S* II.
(2.35)
Then the (generalized) least squares algorithm cpls is optimal. Furthermore,
radwor(N) = ewor(N,pls) = 6
IIS(N*N)1S*II
Proof We can assume that IIS(N*N)1S*II > 0 since otherwise S = 0 and the theorem is trivially true. Let h = (N*N)1S*g and h = h/IIhIIF Observe that then IIS(h)II = II S(N*N)1S*II and II N(h) II Y =
(N(N*N)1S*g, N(N*N)1S*g)Y 2
= (S(N*N)1S*g,g)1/2 =
IIS(N*N)1S*II1/2.
Hence the condition (2.35) gives 6 < IIN(h)IIy, and consequently radwor(N; E)
radwor(N, [h, h] )
sup{ IIS(f)II
=6
I
f E [h,h], IIN(f)IIY 0. Unfortunately, in general, the least squares algorithm can be arbitrarily bad. For instance, for the one dimensional problem of Example 2.5.1 we have cpis (y) = y. Hence the error of cpis equals ewor (ICY, (pi.,) = 6, while
radwor(N) = min{a, 6}. Consequently, ewor (N, 01s) radllwor (N)
' +oo
as
a 6
0.
Observe also that the solution of (2.34) is in general not unique and therefore the least squares algorithm cpis is not uniquely determined.
A simple modification of the least squares algorithm relies on regularization of the normal equations (2.34). That is, instead of (2.34) we solve `perturbed' linear equations
(wI+N*N)f = N*y,
(2.36)
where I : F > F is the identity operator and w > 0 is a regularization parameter. Then the solution u,(y) of (2.36) exists and is unique for any y. Moreover, it turns out that for a properly chosen parameter w the regularization algorithm S(uu,(y)) is optimal. Indeed, we have the following fact.
2.6 Special splines
53
Lemma 2.6.3 For 0 < a < 1, the asmoothing spline is the regularized solution, i. e.,
s«(y) = u,,,(y) = (wI + N*N)'N* y where w = a (1a)152. Or, equivalently, uu, is the asmoothing spline with a = w(w + 52)1.
Proof Let a E (0, 1). Define the Hilbert space F = F x 1IV with the extended norm
II(f,y)IIz = wIIJ F + IIYIIY7
where w = a(l  a)152. We know from Theorem 2.5.3 that s«(y) is the asmoothing spline if II(O,y)  (s«(y))II = min II(O,y)  (f,N(f))II As in (2.34) we can show that sa (y) is the solution of
N*Nf = N*y, where the information operator N : F > F is defined by N(f) _ (f, N(f )), and y = (0, y). Since N*y = N*y and N*N = w I + N*N, the lemma follows.
Thus the well known technique of regularization leads to the smoothing spline algorithms. Observe that the optimal value of the regularization parameter w is the same as the y in Lemma 2.6.1.
Example 2.6.2 Let F = G with the complete orthonormal basis {Si}i>i. Let W
tti)F i,
S(f) = i=1
with 01 > /32 > ... > 0, and information consist of noisy evaluations of the Fourier coefficients, i.e.,
N(f) = and N(f) = {N(f) +x I IIxII2 /3n+1 > 0. Then the regularization algorithm with the parameter w*
=
Qn+1
01  Nn+1
is an optimal algorithm. Observe that for S . 0+ we have a* > 1. However, the regularization
parameter w* is constant. This seems to contradict the intuition that w* decreases with S.
Clearly, we can always apply the algorithm p,,,(y) = S(u,,(y)) with w = S2. Then w * 0+ as S . 0+ and ewor(N, Ww) < f rad,r(N).
2.6.3 Polynomial splines In this subsection we recall the classical result that for an appropriately chosen space F, polynomial splines are also asmoothing splines.
Let a < b and knots a < tl < t2
1 corresponding to the knots t; is a function p : [a, b] > R satisfying the following conditions:
(a) p E 112,.1 on each subinterval [a, ti], [tm, b], and [ti, ti+1] for 1
1k possessing the property that the functionals f H f (t) are continuous, we can associate a uniquely determined r.k. R such that H = HR is an r.k.h.s., namely,
R(s, t) = (L8, Lt)H,
s, t E T,
where Lt E H is the representer of function evaluation at t. Hence there exists a onetoone correspondence between symmetric nonnegative functions and Hilbert spaces in which function evaluations are continuous functionals.
Example 2.6.3 Let a < b and r > 1. Define the separable Hilbert space W10 (a, b) as W10 (a, b)
f : [a, b] + R I
f(r1) is abs. cont.,
f (')(a) = 0, 0 < i < r  1, f(r)
E G2 (a, b) ),
with inner product (fl, f2)W, = fa fl r) (u)f2r) (u) du. Then W°(a, b) is 1 This means that for any n > 1 and t; E T, 1 < i < n, the matrix {R(ti,tj)}j is symmetric and nonnegative definite.
2.6 Special splines
59
an r.k.h.s. with r.k. b
R (s, t) = R,._1(s, t) =
f
Gr_1(s, u) Gr._1(t, u) du,
where
(t  u)+1
Gr1(t, u) =
(r  1)!
and x+ = max{x, 0}. Indeed, applying the formula f (t) = fa f'(u) du r times we obtain
u)+1 (r)
At) = f b (t(r 1)! a
f
(u) du =
fb
a
Gr1(t, u) f (r) (u) du. (2.38)
Hence f i f (t) is a continuous functional with b
If(t)I
fa I Gr1(tu)I2du
IIfIIw .
Letting f = Lt (the representer of evaluation at t) in (2.38), we get that Ltr) (u) = Gr1(t, u) and b
R r1(s,
t) = (Ls, Lt),. =
f
Gr1(t, u) Gr1(s, u) du,
as claimed.
In particular, for r = 1 we have Ro(s, t) = min{s, t}.
The fact that any r.k.h.s. is determined by its r.k. R allows us to write the formulas for the asmoothing spline in terms of R. That is, using Lemma 2.6.1 we immediately obtain the following theorem.
Theorem 2.6.4 Let F = H be an r. k. h. s. with r. k. R : T x T + R. Let information y = N(f) + x, where
N(f) = [f(t1),f(t2),...,f(t, )], for t1 E T, 1 < i < n, and IIx1I2 < 6. Define the matrix
Rt,...t = {R(t%,tj)}2;_1 Then the asmoothing spline is given as n
s«(y)
_ EzaR(ti, j=1
where (yI + Rt,...t )z = y and y = all  a) 162.
2 Worst case setting
60
Observe that the values of s« (y) at ti, 1 < i < n, are equal to y  ryz. Hence the spline is the function from span{R(ti, )}1 G is compact. The set E of problem elements is the unit ball in F. The class A of permissible information functionals consists of all linear functionals with norm bounded by 1,
A = { linear functionals L
IILIIF = sup IL(f)I < 1 }. 11f 11P=1
We also assume that the observation noise is bounded in the weighted Euclidean norm. That is, f o r A = [bl, ... , 6, ] and N = [L1,.. . , Ln ] E
2 Worst case setting
72
Nn, a sequence y E 1Rn is noisy information about f E F if n
6i2(yiI'i(f))2 < 1.
i=1
To cover the case where some Li(f )s are obtained exactly (Si = 0), we use above the convention that 02a2 = +oo for a # 0, and 0202 = 0. Note that if all Sis are equal, i5i = S, then > 1(yi  Li (f))2 < b2, i.e., the noise is bounded by S in the Euclidean norm. Before stating a theorem about optimal information, we first introduce
some necessary notation. Let d = dim F < +oo. Let {Si}d 1 be a complete Forthonormal system of eigenelements of the operator S*S. Let A, be the corresponding eigenvalues,
S*S1; = Mi. Since S is compact, we can assume without loss of generality that the Ais are ordered, Al > A2 > ... > 0. We consider the sequence {A2} to be infinite by letting, if necessary, Ai = 0 for i > d. Similarly, Z;i = 0 for i > d. Obviously, we have limi . Ai = 0. We also need the following important lemma.
Lemma 2.8.1 Let the nonincreasing sequences 01 >_ 02 >_ and 1)i > 772 >_
> On > 0
> rln > 0 be such that n
n
1 ,Q9 > 77.+,. Set Proof
matrix for the sequences 01 >
R(n1) x (n1))
be the required > /3n and > ,3s_1 > Qs+1 >
r]e + 11s+1 )3,, > 0. Let U = {uij }n 11 E
2.8 Optimal nonadaptive information
1
77s+1)
7(r7s 
C=
77s/ ()3s

b = (1a2)1/2,
J 1/2
77s+1)
C (773 
73
d = (1  c2)1/2 .
77s+1)
Elementary calculations show that as W we can take the matrix W = {wig }i =1 with the coefficients given as follows. For 1 < i < n  1
wi,2
uij
1<j A' > ... of the operator Sl S1 : F1 1 Fl satisfy ) > Ano+i, `d i > 1. Moreover, for Lj _ ( , l;j)F, 1 < i < no, we have AZ = Ano+i, V i > 1. Hence we obtain the desired result by reducing our problem to that of finding optimal Ni E .N _no for approximating Si from data y E ll$nno satisfying I I y  Ni(f)IIyi < 1.
Thus, to construct the optimal information No, we have first to solve
the minimization problem (MP) and then to find the matrix W. The solution of (MP) will be given below. The matrix W can be found by following the construction from the proof of Lemma 2.8.1. Note that the optimal approximation cpo is given by the a*smoothing spline where a* comes from the solution of (MP).
We now show how to solve the problem (MP). For 0 < a < 1 and no < q < r < n, define the following two auxiliary problems: Problem P« (q, r)
Minimize Ai
q+1 0 satisfying
Problem P(q)
Minimize
Q(a q
7q+x..... 7 n )
over all 0 < a
En
1 and r?q+1 >
=
a+(1a)77i q+1171 =
2. q+l 6i
Ai max q+1 11n+1 = 0 satisfying E q+1171 =
i=q+1 bi 2
Consider first the problem P« (q, r). If a = 1 then 1l  A. Let 0 < a < 1. Then the solution 17* _ (77q+1, ... , r7r) of P« (q, r) can be
obtained as follows. Let y = y(a) = a/(1  a). Let k = k(a; q, r) be
2.8 Optimal nonadaptive information
77
the largest integer satisfying q + 1 < k < r and
rk
y uj=q+1A7
Ak >
(2.54)
y (k  q) + uj=q+l 5.2
Then 771= y (k 
q) + Er
1=q+l E72
rk
uj=q+1
Aj
q + 1 < i < k,
Ai  Y,
(2.55)

and ?ii = 0 for k + 1 < i < r. Furthermore,
Ek
"j
(2.56)
Qgr(rl*) = a (k  q) + (1  a) Ej=q+1 8j 2
We now pass to the solution of P(q). Let ai, i > q + 1, be defined in such a way that we have equality in (2.54) when k and y are replaced by i and yi = ai/(1  ai), respectively. Such an cei exists only for
i>s=min{jIAj aq+2 > .
.
.
and the
solution qi of the problem PQ (q, r) with r = n satisfies 77q+1 >_
>
77 = 0. Since in addition the right hand side of (2.56) is a monotone function of a, we obtain min I q(a; 77) = C1,77
min
S2q(ai, r7i)
q+1 nP > > no > 0 be (uniquely) defined by the conditions
nP = min Is > no I solution of P(s) is acceptable}, (2.59) n2 = min is > no I solution of Pa. (s, n2+1) is acceptable}, (2.60) 0 < i < p  1, where a* comes from the solution of P(np).
Theorem 2.8.2 Let p, the sequence no < nl
> rl* > 71k+1 = 0 and the maximal value of Aj/(a*+(1a*)77 ), no+1 < j < n,
is attained for j = nP + 1. The definition of nP yields in turn that the 77(P) are the last (n  nP) components of the solution of (MP) and that a* is optimal. This completes the proof. As a consequence of this theorem we obtain the following corollary.
Corollary 2.8.1 Let n, and k = k(np) be defined by (2.59) and (2.57), respectively. Then k
7n°r(O) l
_
Ak+1 +
_ 2k+1)
En
A=nn+1
63
Observe that we always have an+1 < rn°''(A) /1. The lower bound is achieved if, for instance, the 62's are zero, i.e., if we are dealing with exact information. The upper bound, rn°' (A) = al, is achieved if
2.8 Optimal nonadaptive information
S2
for instance is useless.
79
_< 1, see also E 2.8.4. In this case, the information
Let us now consider the case when all Sis are constant, Si = 6, and 0 < 6 < 1. That is, noisy information satisfies n
(yi  Li(f))2 < 6. 1 i=1
Then the solution of P(0) is acceptable and k = k(0) = n. Hence the formula for the radius reduces to rwor(Q) = rn°C(S) = l
If Al =
S2
An+1 +  E(Aj  An+1) n
_ )n+1 then rn°r(0) =
(2.61)
j=1
Al and the zero approximation is
optimal. For Al > An+1 we have ry* = 62 Y** where ** __
n An+1
1(Aj  An+l)' and the optimal i s are i = 627?z * with **
rli
=
n (Ai  An+1)
1 < i < n.
i(Aj  An+l)'
l 1)F, ... , The optimal information Nn = is given by Theorem 2.8.1 with the matrix W constructed for 77i = rl2 * and Qi = 1,
1 < i < n. The optimal algorithm is con(y) = E 1 zjS(fj) where z = y and the parameter ryn = y**. We stress that neither (rynI + the optimal information nor the optimal algorithm depends on the noise level 6.
We now comment on the minimal radius rn°r(6). If we fix n and let the noise level 6 tend to zero, then = rn°r(6) approaches the minimal radius of exact information, rnor(0) An+l > 0. For rn°C(0) > 0 we have rnwor(6)
l
 rwor(0)
S
2
2n/ n+l
n
E(aj  An+1),
j=1
1 The symbol `' denotes here the strong equivalence of sequences. We write an
if 1imn_.(an/bn) = 1.
bn
2 Worst case setting
80
while for rn°r(0) = 0 we have rwor(S)
 rnor(0) =
7,nor(6)
=
S
Hence for rnO1(0) > 0 the convergence is quadratic, and for rn°r(0) = 0 it is linear in 6. Consider now the case where the noise level 6 is fixed and n > +oo. The formula (2.61) can be rewritten as rnwor(6)
62 n
_ N
An+1(162) + n EA3. j=1
The compactness of S*S implies limj Aj = 0. Hence An+1 as well as
n1 E 1 Aj converges to zero with n, and consequently, lim rn°r(6) = 0. This result should not be a surprise since for noise bounded in the Eu
clidean norm we can obtain the value of any functional L at f with arbitrarily small error. Indeed, repeating observations of L(f) we obtain information yl, ... , A such that Ek1(yj  L(f ))2 < 62. Hence for large k most of the yes are very close to L(f), and for Ill IIF < 1 the least squares approximation, k1 1 y2, converges uniformly to L(f ).
Ej
Thus for S # 0 the radius
Observe also that rnor(S) > 6 cannot tend to zero faster than S//.
To see more precisely how rn(6) depends on the eigenvalues Aj, suppose that A,
Ins
)P
as
j > +00,1
where p > 0 and s > 0. Such behavior of the eigenvalues is typical of some multivariate problems defined in tensor product spaces; see NR 2.8.4. In this case, for 6 > 0 we have rwor (6)
1
6 (lns n/n)P/2 6 (lne+1 n/n)1/2
0 < p < 1, p = 1,
6 (1/n) 1/2
p > 1,
(2.62)
For two sequences, we write an x bn if there exist constants 0 < ci < c2 < +00 such that for all sufficiently large n, cl an < bn < c2an. Such sequences are said to be weakly equivalent.
2.8 Optimal nonadaptive information
where the constants in the
81
()'2
notation do not depend on 6. Since
ror(O) ..
we conclude that for p < 1 the radius of noisy information essentially behaves as the radius of exact information, while for p > 1 the radius of noisy information essentially behaves as 6//i. Hence, in the presence of noise, the best convergence rate is 6/'i.
2.8.2 Approximation and integration of Lipschitz functions In this subsection we deal with noise bounded in the supnorm. We assume that F is the space of Lipschitz functions f : [0, 1] > R, and consider the following two solution operators on F: Function approximation. This is defined as App : F > C([0,1]),
App(f)
= f,
f E F,
where C([0,1]) is the space of continuous functions f : [0, 1] f R with the norm
IIf1I = IIfII. = ma i If(t)I. Integration.
The solution operator is given as Int : F 
Int(f) =
J0
1
f (t) dt.
The set E C F of problem elements consists of functions for which the Lipschitz constant is 1,
E _ If: [0,1]  + ]R I
I f (tl)  f (t2)1 '.) This observation should be contrasted to the case of noise bounded by 6 in the Euclidean norm where the radius always converges to zero as n  +oo, see (2.61).
Notes and remarks NR 2.8.1 Some parts of Subsection 2.8.1 (e.g. Lemma 2.8.1) have been taken from Plaskota (1993a), where the corresponding problem in the average case setting was solved, see also Subsection 3.8.1. The other results of Subsection 2.8 are new.
NR 2.8.2 In the case of exact linear information, the minimal radius rn°L = rnwor(0) is closely related to the Gelfand nwidths and snumbers from classical approximation theory. If one allows only linear algorithms, there are relations
to the linear nwidths. These relations are discussed in Mathe (1990), Novak (1988), Traub and Woiniakowski (1980, Sect. 6 of Chap. 2 and Sect. 5 of Chap. 3), Traub et at. (1988, Sect. 5 of Chap. 4), and Kowalski et al. (1995). A survey of the theory of nwidths and many other references are presented in Pinkus (1985). We note that for noisy information such close relations do not hold any longer. Indeed, as we convinced ourselves, the minimal radius rn°r(f) may for S > 0 tend to zero arbitrarily more slowly than rnO7(0) (or it may not converge at all), and consequently the ratio of the nwidth and rn°S(6) may be arbitrary.
NR 2.8.3 For exact data, optimal nonadaptive information that uses n observations is usually called nth optimal information and its radius the nth minimal radius, see e.g. Traub et at. (1988, Sect. 5.3 of Chap. 4). If observations are always performed with the same precision 6, the notion of nth minimal radius and nth optimal information can be carried over to the noisy case in a natural way.
2 Worst case setting
86
NR 2.8.4 We now present an example of a problem for which the results of Subsection 2.8.1 can be applied. Let d > 1 and ri > 1, 1 < i < d. Let F = W° ::°d be the r.k.h.s. defined in NR 2.6.7. That is, F is the r.k.h.s. of multivariate functions f : [0, 1]d : R with r.k. R = ® 1 R,.i where R,.i is the r.k. of the rifold Wiener measure. Define the solution operator as S : F p G = £2([0, 1] d), S(f) = f . That is, we want to approximate functions in the G2norm. It is known (see e.g. Papageorgiou and Wasilkowski, 1990) that in this case the eigenvalues of the operator S* S satisfy
where r = min{r1,... , rd} and k is the number of indices i for which ri = r. Observe that the exponent 2r > 2. Hence owing to (2.62) we have rnO1(8) x 5/ f for 8 > 0, and rnO1(0) x (lnk1 n/n)r.
NR 2.8.5 It would be interesting to know the radius and optimal information in the Hilbert case for a restricted class A of permissible functionals. For instance, for the problem of NR 2.8.4 it is natural to assume that only noisy function values can be observed. Much is known in the exact information case, see e.g. Lee and Wasilkowski (1986), Woiniakowski (1991, 1992), Wasilkowski and Woiniakowski (1995). Unfortunately, finding optimal noisy information
turns out to be a very difficult problem. Some results can be obtained from the average case analysis of Chapter 3, see NR 3.8.5. Optimal information in the Hilbert case with noise bounded in the supnorm is also unknown, even when Si = 8, `d i.
NR 2.8.6 The fact used in (2.52) is a special case of the following well known theorem, see e.g. Marcus and Minc (1964).
Let A be a compact, selfadjoint and nonnegative definite operator acting in a separable Hilbert space X with an inner product Let {Si}i>1 be the orthonormal basis of eigenelements of A, and Al > A2 > ... > 0 the corresponding eigenvalues. Then for any 1 < m < dim X and orthonormal elements {xi}'i__1 we have m
m
T (Axi, xi) < E Ai.
(2.67)
i=1
i=1
Indeed, since xi = Ej> 1(xi, j) f j with E,>1 (X,, ej )2 = 1, we have
(Axi,xi)
_ (J(xirSj)`4Sj, E(xi, k)Sk/ j>1
k>1 tt
tt
tt
CC
(Xi, Sj) (Xi, Sk) (ASj, Sk) j, k> 1
1: '\j (xi, ,j) 2. C
j>1
2.8 Optimal nonadaptive information
87
Hence
m
m
E(Axi, xi) =
Aj (X,, j) 2
i=1 j>1
i=1
m
E
C
i=1 m
( E(Aj j=1
 Am+1)(xi,sj)2+ EA+n+l(xi,Sj)2 C
{{
j>1
m
m
E(Aj  Am+,)E(xi,Sj)2 + Am+, EE(xi,Sj)2 j=1
i=1 j>1
i=1
m
m
L(AjAm+1)+mAm+1 = EAj j=1
j=1
(for m = dim X we let Am+1 = 0), as claimed. If d = dim X < +oo then (2.67) can be equivalently written as d
d
E(Axi,xi) > Eai,
1 1. Suppose that there exist nonadaptive information Ne using n(e) observations, and a linear algorithm c' such that CoStwor (Ne) < p ICnon (e)
and
ewor (Ne, Ve) < e.
Then
Compwor(e) < p . ICnon(e) + (2 n(e)  1) g. Proof (i) Let N
be arbitrary, in general adaptive, information with radius rad"(N) < e. Let Nnon be the nonadaptive information from Theorem 2.7.1 corresponding to ICYad. Then cost(Nnon) < cost(Nad) and there is an algorithm cp such that ewor(Nnon' gyp) < n , radwor(Nad)
< ice.
Hence Comp(e) > ICnon(rce) (ii) Since the algorithm cpe is linear, for given information y it requires
at most 2n(E)  1 linear operations in the space G to compute V, (Y), see Example 2.9.1. Hence comp(NE1 cpe)
< cost(Ne) + (2 n(e)  1) g < p . ICnon(e) + (2 n(e)  1) g,
as claimed.
Lemma 2.9.1 immediately gives the following theorem, which is the major result of this section.
Theorem 2.9.1 Suppose that the assumptions of Lemma 2.9.1 are fulfilled for all e and some p independent of E. If, in addition, ICnon(e) = O(IGrnon(/Ce))
and n(e) = O(ICnon(e))'
2 Worst case setting
96
then Compwor(e) ^ IC ... (e),
as e+0+.1 Hence, for problems satisfying the assumptions of Theorem 2.9.1, the ecomplexity, Comp(,), is essentially equal to the information ecomplexity ICnon(c) Note that the condition ICn°n(e) = O(ICnon(Ke)) means that ICnon(e) does not increase too fast as a > 0. It is satisfied if, for instance, ICn°n(e) behaves polynomially in 1/e. The condition n(e) = O(ICnon(e)) holds if the information operators NE use observations with costs bounded uniformly from below by a positive constant. Indeed, if c(b) > co > 0, b'6 > 0, then n(e) < ICnon(e)/coIt
turns out that the information complexity is closely related to the minimal radius of information. More precisely, let n
R(T) = inf {
bn,)
c(bi) < T }
n > 1, i=1
be the Tth minimal radius of nonadaptive information. Then we have the following relation.
Lemma 2.9.2 Suppose that the function
R1(e) = inf IT I R(T) < e } is semicontinuous. That is, there exist a continuous function constants 0 < a < ,Q < +oo such that for small e, 0 < e < E0,
and
ab(e) < R1(e) < ,3&(e) Then
R1(e) < ICnon(e) < a . R1(e),
b0 < e < eo.
Proof For ,3 > 0, take information No and an algorithm cpp such that e` e and cost(N,3) < ICnon(e) +,3. Then R1(e) < ICnon(e) + ,(3 Since this holds for arbitrary ,Q, we obtain R1(e) < ICnon (e)
We now show the second inequality. Let 0 < a < e and 3 > 0. Then 1 For two functions, a(e) x b(e) as e p 0+ means the weak equivalence of functions.
That is, there exist eo > 0 and 0 < K1 < K2 < +oo such that K1 < a(e)/b(e) < K2 for all e < eo.
2.9 Complexity
97
R(R1(Ea)+,Q) < aa and there are information Np and an algorithm cpp such that cost(Ng) < R1(E  a) + 3 and ewo1(Np, cpp) < E. Then ICn°n(e) < cost(Np). Since ,Q can be arbitrarily small, we get
ICnon(e) < R1(E  a) < Letting a > 0+ and using continuity of 1/i we finally obtain
ICnon(e) < i'G(E) 0, and co = c(bo), then ICnon(C; E) < ICnon(Cfix; E).
On the other hand, if the cost function c is bounded from below by a positive constant co, the information ccomplexity is not smaller than IC(e) of the same problem with exact information, where the cost function Cexa = c0
Notes and remarks NR 2.9.1 In the case of exact information or information with fixed noise level, our model of computation corresponds to that of Traub et at. (1983, Chap. 5, and 1988, Chap. 3). As far as we know, the model with cost dependent on the noise level was first studied in Kacewicz and Plaskota (1990).
NR 2.9.2 Some researchers define a model of computation using the concept of a machine. The most widely known is the Universal Turing Machine
98
2 Worst case setting
which can be used to study discrete problems. Another example is the Unlimited Register Machine (URM) discussed in Cutland (1980). Machines and complexity over the reals are presented in Blum et at. (1989, 1995). Recently, Novak (1995b) generalized these concepts and defined a machine that can be used to study complexity of problems with partial information. For other models, see also Ko (1986, 1991), Schonhage (1986) and Weihrauch (1987).
NR 2.9.3 The choice of a model of computation and, in particular, the set of primitive operations is in general a delicate question. We believe that this choice should depend on the class of problems we want to solve. For instance, for discrete problems the Turing machine seems to be a good choice. For problems defined on the reals it seems that a real number model should be used. We study problems defined on linear spaces. Hence to obtain nontrivial results we have to assume that at least linear operations on the space G are possible.
NR 2.9.4 In view of the previous remark, we can say that we use the basic version of the model of computation over linear spaces. However, natural generalizations of this model are sometimes possible. We now give one example.
Suppose that G is a Cartesian product of some other linear spaces over R, G = Gl x G2 x . . . x G,. For instance, G = R9 = II8 x R. Or, if G is a space of functions g : Rd > R3, then any element g E G can be represented as g(x) = (91(x),... , ge (x)) with gi : Rd > R. In these cases it is natural to assume that we can perform linear operations over each `coordinate' Gi. Clearly, we would also be able to perform linear operations over G (via the gi, where identification g = (gi, ... , g8) E G, gi E Gi) with the cost g gi is the cost of linear operations over Gi. For our purposes any such `generalization' is, however, not necessary. As will turn out, for problems considered in this monograph the complexity essentially equals the information complexity. Hence using a `more powerful' model will not lead to appreciably different results. NR 2.9.5 The assumption that we can use arbitrary elements of R or G corresponds in practice to the fact that precomputation is possible. This may be sometimes a too idealized assumption. For instance, even if we know theoretically that the optimal algorithm is linear, opt (y) = yigi, the elements gi may sometimes not be known exactly, or may be very dificult to precompute. We believe, however, that such examples are exceptional. We also stress that the `precomputed' elements can depend on e. One may assume that precomputing is independent of E. This leads to another, also useful, model, in which one wants to have a 'good' single program which allows
one to produce an eapproximation to S(f) for any e > 0. Some examples on this can be found in Kowalski (1989), Novak (1995b), and Paskov (1993). NR 2.9.6 Clearly, our model also assumes other idealizations. One of them is that the cost of observing a noisy value of a functional depends on the noise level only. That is, we neglect the dependence on the element for which information is obtained. Errors that may occur when the value of W(y) is computed are also neglected. NR 2.9.7 One can argue that the assumption that linear operations over G are allowed is not very realistic. In practice digital computers are usually
2.9 Complexity
99
used to perform calculations, and they can only manipulate with bits. This is certainly true. On the other hand, computers have been successfully used for solving some very complicated problems including, in particular, continuous and high dimensional problems which require at least computations over the reals. This paradox is possible only because computer arithmetic (which is in fact discrete) can imitate computations in the real number model very well. Similarly, by using an appropriate discrete model, we can make computations over an arbitrary linear space G possible. This point can also be expressed as follows. Even if it is true that the real world is discrete in nature, it is often more convenient (and simpler!) to use a continuous model to describe, study, and understand some natural phenomena. We believe that the same applies to scientific computations.
NR 2.9.8 We consider a sequential model of computation, where only one instruction can be performed at each step. It would also be interesting to study a parallel model, see, e.g., Heinrich and Kern (1991), Kacewicz (1990), Nemirovski (1994).
NR 2.9.9 We note that the assertion of Theorem 2.9.1 does not always hold if all the assumptions are not satisfied. That is, there are linear problems for which the Ecomplexity is much larger than information Ecomplexity (or even infinite), see e.g. Wasilkowski and Wozniakowski (1993).
NR 2.9.10 Information about the programming language Pascal can be found, e.g., in Jensen and Wirth (1975).
Exercises E 2.9.1 Show that if the conditional and repetitive statements were not allowed then only algorithms using nonadaptive information would be realizable and the cost of computing cp(y) would be independent of y.
E 2.9.2 Give an example of an algorithm cp and information N that cannot be realized.
E 2.9.3 Let N be an information operator with Y = R', and let cp be an algorithm of the form n
42(y) = E qi (y) 9i, i=1
where gi E G and qi are some real rational functions of n variables yl, ... , yn. Show that then there exists a realization of cp using N.
E 2.9.4 Let coi and 422 be two algorithms that use the same information Show that if 422 = A41, where A : G + G is a linear transforma
N.
tion, then comp(N, 422) < comp(N, pl). If, in addition, A is onetoone then comp(N, (pi) = comP(N, 422).
E 2.9.5
Give an example of a problem for which optimal information is
nonadaptive and the upper bound in Lemma 2.9.1 is not sharp, i.e., Comp(e) < ICnon(c) + (2n(c)  1)g.
2 Worst case setting
100
E 2.9.6 Prove that
lim R1(e  a) > IC"°°(e) > R1(e)
a»0+
2.10 Complexity of special problems In this section, we derive the ecomplexity for several classes of problems.
To this end, we use the general bounds given in the previous section. Special attention will be devoted to the dependence of Comp(e) on the cost function.
2.10.1 Linear problems in Hilbert spaces We begin with the problem defined in Subsection 2.8.1. That is, we assume that S is a compact operator acting between separable Hilbert spaces F and G, and E is the unit ball in F. The class A of permissible information functionals consists of all continuous linear functionals with norm bounded by 1. The noise x satisfies 1(xi/Si)2 < 1, where n is the length of x and [b i ... , 6, ] is the precision vector used. In this case, it is convenient to introduce the function
c(x) = c(x2),
0 < x < +oo,
where c is the cost function. We assume that c is concave or convex. We first show how in general the Tth minimal radius can be evalu
ated. As in Subsection 2.8.1, we denote by j, j > 1, the orthonormal basis of eigenvectors of the operator S*S, and by Aj the corresponding eigenvalues, Ai > A2 > . We shall also use the function Sl which was defined in Subsection 2.8.1,
0 = 0(a' rj l
,
fin
Ai ) = 1 0, 1 < i < n, satisfying (al) for concave c
4rli)
0
1
1
1:n Aj < e2 }.
n j=1
It turns out that the cost function clip is in some sense `worst' possible. That is, we have the following fact.
Lemma 2.10.2 Let c be an arbitrary cost function. Let So be such that c(6o) < +oo. Then `de > 0,
CompWOr(C; E) < M Compwor(Clin, e),
where M = M(c, So) = 12602] (c(So) + 2g).
Proof Since R(clin; T) 2 > Al/ max{1, T}, we have
f =0
ICnon (Clingy a ) ll
e>
A1,
>1
< A1. In the first case, zero is the best approximation and the lemma is true. Let e < A11. Let N be such that rad"'°r(N) = e and cost(clin; N) = ICnon(clin; E). We can assume that N uses n = LICnon(Clin; e) j observa
tions with the same precision S satisfying 62 = ICROR(clin; e)/n. Let k = k(So) = 126021. Consider the information operator N` which repeats, k times, observations of the same functionals as in N, but with precisions S where S2 = S2/k. We obviously have radw°r(]Y) = radw"(N) and non
cost(c; ICY)
= k n c (IC
Clin; E
)
kn
1 a(e) b(e) means the strong equivalence of functions, i.e., (0/0 = co/oo = 1).
(a(e)/b(e)) = 1
2 Worst case setting
104
< k n c(2/k) < k c(bo) ICnon(clin; e). Since the optimal algorithm cp for information lY is linear, we finally obtain CompWOr(c; e)
< comp(c; l T, cp)
60, and cfix(b) = +oo for b < 60, with co, So > 0. Since cfix(b) > Co bo ciin(b), we have ICnon(cfix;
e) > CO by
ICnon(c11,,;
e). Hence Cfix is also the `worst' cost
function and Comp(cfix; e) x Comp(ciin; e).
We now consider the exact information case where the cost function Hence is constant, e.g., cexa  1. In this case rw,°r(0) R(cexa; T) = T,+, where n = n(T) = LTj, and Comp(cexa; e)
min { n > 1
I
Xn+1 < E2 }
Clearly, Comp(ceX8; e) gives a lower bound for complexity corresponding to a cost function that is bounded from below by a positive constant.
That is, if c(b) >co > 0 for all 6 > 0, then Comp(c; e) > co Comp(cexa;0
Let us see more precisely how the complexity depends on the cost function c and eigenvalues A3. Consider cq(b)
(1 + 62)Q
+00
b > 0, b = 0,
where q > 0. Note that for q > 1 the function E. is convex, while for 0 < q < 1 it is concave. The case q = 0 corresponds to exact information. Since for all q we have Comp(q; E) = O(Comp(1; e)), we can restrict
ourselves to 0 < q < 1. To obtain the formula for the Tth minimal radius we set, for simplicity, a = 1/2 in the asmoothing spline algorithm and use Lemma 2.10.1. The minimum min, in+l = 0 such that E 1(1 + 77z)q < T is attained at
771
T11q 17j
1
(Ea=
q1/q
< j < n,
' A.
)
where n is the largest integer satisfying n Ajq < An AT.
j=1
Furthermore, R(q;
T)2
n
=
b
1/q
Gz=1 A9)
where 1/2 < b < 1. Assume now that the eigenvalues of the operator S*S satisfy Aj
^
(hf)" j
with p > 0 and s > 0. As we know, this corresponds to function approximation in tensor product spaces, see NR 2.8.4. In this case we have
R(q,p,s;T) ^ R(1,pq,s;T)1/q The formulas for R(1, pq, s; T) can be derived based on (2.70). We obtain
that for all q (1/T)1/q R(q, p, s; T)
(Ins+1 T/T) P (lns T/T)P
p4> 1, p4 = 1,
0 n* (rc e) c(K K e).
Proof We first show that ICa (e) > n* (e) c(K e).
(2.74)
If E > IISIIF then the zero approximation is optimal. Hence n*(e) = 0 and (2.74) follows.
Let e < IISIIF Let N = IN, A} where N = [Ll, ... , Ln] and A =
2.10 Complexity of special problems
111
[61'...'64 0 < Si < _< Sn, be an arbitrary information operator with radius radabs (N) < e. Let k = max { i < n I Si < a Kdiamab8 (N) } . (If Si > Kdiamabs(N)/2 then k = 0.) We claim that k > n* (E).
(2.75)
To show this, it suffices that for information N' = {N', 0'} where N' = [L1,. .. , Lk] and A' = [Sl, ... , Sk] (or for information N'  {0} if k = 0), we have diamabs(N') < 2e. Indeed, suppose to the contrary that diamabs (N') > 2 E. Then there is h E F such that IIhtIF < 1, IL2(h)I < Si, 1 < i < k, and 2IIS(h)II > 2e > diamab. (N). Let h'
=min { 1,
K (IS(h)II 1
.
h.
Then 11 W11 F < 1, and for all k + 1 < j < n the following holds:
min{KIIS(h)II,S;} = S.
2IIS(h')II 2 min{ 1,
Sk+i
K IIS(h)II
} IIS(h)II
min {2IIS(h)II, 2Lk+ } > diamabs(N), which is a contradiction. Hence diamab8 (ICY') < 2 e. Since the information N' uses k observations, (2.75) follows.
Observe that (2.75) also yields k > 1. Hence we have n
cost(N)
k
_ Ec(Si) > Ec(Si) > k c (2Kdiam(N)) i=1
i=1
> kc(KE) > n* (E) c(KE). Since N was arbitrary, the proof of (2.74) is complete. Now Lemma 2.9.1 together with (2.74) yields Compab8(E) > IC,on (ICE) > n*(ICE) . C(/CE)
which proves the lemma.
2 Worst case setting
112
To show an upper bound, we assume that for any n > 1 and 6 > 0, there exist information N that uses n observations with the precision vector 0 = S, ... , S , and a linear algorithm cp, such that n
ebs (N,
0) 0, n* (a e) x n* (e)
and
c(a e) x c(e),
as a + 0+. If the conditions (2.73) and (2.76) are satisfied and a r.hard element exists, then
Compaq (e) x n*(e) c(e). Furthermore, optimal information uses n x n* (e) observations with the same precision 6 x E.
This theorem has a very useful interpretation. It states that the ecomplexity is proportional to the cost c(e) of obtaining the value of a functional with precision e, and to the complexity in the exact information case with c  1. We stress that Theorem 2.10.4 applies only to
2.10 Complexity of special problems
113
problems for which c(e) and n*(e) tend to infinity at most polynomially in 1/e, as a > 0+. It also seems worth while to mention that we obtained the complexity results without knowing exact formulas for the minimal radii rwor(A)
We now pass to the case of relative noise. We assume that all functionals L E A satisfy IILIIF 0.
(2.77)
Theorem 2.10.5 Suppose that the assumptions of Theorem 2.10.4 and the condition (2.77) are satisfied. Then
Comp elr(e) x Comp bs (e) x n* (e) c(e). Proof Observe first that if I yi  Li (f) I < Si I Li (f) I then I yi  L2(f) I < Si II Li II F I I f II F
< Si. This means that for any information N and an
element f with I I f I I F 1 min {1  IlahollF, aA/2} rad bs (N). Taking a = (1 + A/2)1 we obtain
rad elr(N) ?
2 A + 2 rad bs (N)
Hence for any B > 2(A + 2) /A we have Comprel(e)
> ICreln(e) > ICabs (Be) ICaba (e) x Compabs (e),
which shows the lower bound for Comprel(e) and completes the proof.
0
2 Worst case setting
114
Thus, under some assumptions, the cases of relative and absolute noise
are (almost) equivalent. We note that such an equivalence does not always hold. For instance, for the problems App and Int of Subsection 2.10.2 and information N using n observations of function values with precision 6 E (0, 1), we have rad?r(N) = +oo. Indeed, the vector
y = a, ... , a a > 0, is noisy information about f1 . a/(1  S) and n
f 1 = a/(1 + 6). We also have fl, f1 E E. Hence, for S E {App, Int}, Tad elr(N)
>
_
z IIS(fl)  S(f1)II 26
a 1  ,52
+oo
as a  +oo.
(See also E 2.10.8.)
Multivariate approximation
We now apply the results obtained above to a concrete problem. We consider approximation of multivariate functions from noisy data. Let F = F' be the space of all real valued functions defined on the s dimensional unit cube D = [0,1]8 that possess all continuous partial derivatives of order r, r > 1. The norm in F8 /is given as 11f 11P = o (r  1)9, where M is independent of n and 6. Assume first that r > 2. Let n > r9. Let k > 1 be the largest integer such that (k(r  1) + 1)9 = m < n. Information Nom, consists of function evaluations at m equispaced points, i.e., N(f) = { f (t)}tEK, where
k = {tEU
1)'0 0 such that xc(x) > d, `d x > xo.
Prove then that ICnon (c; E) > d . ICnon (co;
xo
`
1
+l 6), xo
V E > 0.
Show also that if the condition is not satisfied then ICnOn(c; E) = 0, V6 > 0.
2. 10 Complexity of special problems
119
E 2.10.2 Let the cost function cl"(6) = ln(1 +62). Prove that then in the Hilbert case we have n
( j1
R(T)2 = d
) 1/n
eT
'
where n = n(T) is the largest integer for which A. > (f 1 Aj)l/neT/n, and
1/2 0, while for Aj = e' we have Comp(cl"; e) x In( 1/E)2 and Comp(co; E) x ln(1/E).
E 2.10.4 Show that the equivalence IC"°" (App; 2 E) x IC "°" (App; e) holds if c(6) tends to infinity at most polynomially in 1/6, as 6 > 0.
E 2.10.5
Show that Theorem 2.10.4 can be applied for the problems App
and Int.
E 2.10.6 (Kacewicz and Plaskota, 1990) Let F = F9 be the space defined as in the multivariate approximation problem. Let A be the class of functionals L : F + R of the form i for some t E [0,1]9, L(f) = (axl) 19 i f((axs)ke ,
for any integers k1, ... , ks and i such that 0 < k1 + 0 < k < r. Show that sup
+ ks = i < k, where
inf IL(f) I _> e min{s,k} (min{1, k/s})k
lIfllF 1, define F = { f E IR°° 11I f 11p < +oo }, where JJ f Ilp = (E: I filp)1/p, f = [fl, f2, ...] E R°°. For 11f lip < 1, we approximate values S(f) of the operator S : F > F,
`5(f) = [aifl,a2f2,...], where ai = 2(1i)/p. Exact information is given as Nn(f) = [fl, f2, ... , fn]. Show that sup
min Ifil = n1/p
II f llp 0 and a2 > 0. In this case, F = G = R and S(f) = f. The measure p on F is defined as
µ(B) =
ex2/(2,\)
1
27ra
dx,
b Borel sets B of R.
B
Furthermore, the set Y = R and the measures 7r f = AI(f, a2). That is, for a > 0 we have
7rf(B) =
e(yf)2/(202)
1
dy,
27ra2
B
while for or = 0 we have 7rf(B) = 0 if f 0 B, and 7rf(B) = 1 otherwise. Hence for or = 0 information is exact, while for a > 0 it is noisy.
3.2 Information and its radius
125
Example 3.2.2 Suppose we wish to approximate the value of the integral S(f) = fo f (t) dt of a continuous function f [0, 1] * R. In this case, as p we take the classical Wiener measure, p = w. Recall that w is defined on the afield of Borel sets of the space
F = { f : [0,1] * R I f continuous, f (0) = 0 }, with the supnorm, II f II = supXE[o,1] If (x) I. It is uniquely determined by the following condition. Let m > 1 and B be a Borel set of Rm. Let Btl...tm = { f E F I (f (t1), ... , f (t..)) E B } where 0 < t1 < t2 < ... < to < 1. Then  t1) ... (tn,  to1) }1/2 1 xi + (x2 x1)2 + ... + (xn  xn1)2 t2  t1 to  to1 / 2 \ t1
w(Btl...tm)
x
(27r)n't1(t2
exp B
dx1dx2...dxn.
Information about f may be given by independent noisy observations of f at n points. That is, in the ith observation we obtain yi = f (ti)+xi, where xi  N(0, a2), 1 < i < n, and a > 0. This corresponds to Y = Rn and
7r f(B) = (27ra2)n/2
Jn
exp
{

2 o,2
(yi  f (ti))2} dy1 ... dyn. a=1
Let III = {7r f} be a given information distribution. The average case error of an algorithm W that uses information y " 7rf is given as ea%e(111, o)
=
f f II S(f)  O(y)II21rf(dy)
tt(df)
That is, we average with respect to the joint distribution (measure) i on the product space F x Y,
µ(B) =
JF
7r f (B f) p(df ),
V Borel sets B of F x Y,
where Bf = { y E Y I (f, y) E B }. In order that the error be well defined, we assume measurability of cp with respect to the measure µ1 defined by (3.1) below. (Note that by slightly redefining the error, this assumption can be relaxed, see NR 3.2.2.) An algorithm opt is optimal if eave(LI Wopt) = inf eave(I, W).
3 Average case setting
126
Let µl be the a priori distribution of information values y on Y,
µ1(B) = f irf(B) µ(df ),
VBorel sets B of Y.
(3.1)
F
We assume that there exists a unique (up to a set of p1measure zero) family of probability measures that satisfy the following conditions: (i) are for a.e. y probability measures on the vfield of F, (ii) the maps y > µ2(BIy) are µ1measurable for all measurable sets
BC F, and (iii) µ(B) = fY µ2 (By I y) µ1(dy), for any Borel set B C F, where B. =
{fEFI(f,y)EB}. Such a family is called a regular conditional probability distribution. It exists under some mild assumptions, e.g., if F is a separable Banach space and Y = Rn; see NR 3.2.3. We interpret 42 (.1 y) as the a posteriori (or conditional) distribution on F, after information y has been observed.
The most important will be property (iii). It says that the joint measure µ can be equivalently defined by the right hand side of the first equality in (iii). Hence the error of an algorithm cp can be rewritten as eave(1H, o)
_
f L II S(f)  O(y) II2 µ2 (df J Y) Al (dy)
For a probability measure w on G, let r(w)
inf
a
JCG
IIg  aII2 w(dg)
We call r(w) a radius of the measure w. An element g,., E G is a center of w iff r(w) = fG IIg  91"11 2w(dg).
Example 3.2.3
Suppose that the measure w is centrosymmetric. That is, there exists g* E G such that w(B) = w({ 2 g*  g I g E B }) holds for any measurable set B C G. Then g* is a center of w and r(w) = fG I I g  g* II2 w(dg). Indeed, since IIx + y1I2 + IIx  y1I2 >
2(IIx+yII+IIxvii )2 > 211x112,
for any a E G we have f IIg  aII2 L.)
(dg)
=j
II2g*  g  aII2 w(dg)
3.2 Information and its radius
127
1 fG (11(9*  p) + (g*  a)112 + II (9  g*)  (g*  a)112) w(d9) >
fG II99*II2w(d9)
For y E Y, define the measures v2 ( I Y) = µ2 (S'(.) l y). That is, v2 ( l y)
is the a posteriori distribution of the elements S(f) after information y has been observed. Assuming the mapping y '> is µlmeasurable, an (average) radius of the information distribution III is given as
radave(E) =
f (r(v2( Iy)))I µl(dy)
Hence, in other words, rad&°e (111) is the average radius of the conditional
distributions in G.
Lemma 3.2.1 If the space G is separable then the function V)(y)
= aEG inf fGIIa 
9II2 v2(dgIy),
y E Y,
is p, measurable.
Proof It suffices to show that the set
B = {yEYI z/)(y)>a} is pimeasurable for any a E R. Let
O(x, y) = f
x E G, y E Y
Then 0 is continuous with respect to x and measurable with respect to y, and '(y) = infxEG O(x, y). Choosing a countable set A dense in G, we obtain
B = {yEYI `dxEG,b(x,y)>a} = {yEYI VxEA,O(x,y)>a}
= n{yEYI O(x,y)>a}. sEA
Hence B(a) is a countable intersection of measurable sets, which implies that B is also measurable.
3 Average case setting
128
From now on we assume that the space G is separable. As we have just shown, separability of G makes the radius of information well defined. We are ready to show the main result of this section.
Theorem 3.2.1 For any information distribution I1I we have inf e' (III, cp) = radae(IH). W
If rada"e(I1I) < +oo then a necessary and sufficient condition for the existence of an optimal algorithm is that for a. e. y E Y, there exists a center gy of the measure In particular, the algorithm coctr(y) = gy is optimal.
Proof We first show that for any e > 0 there is an algorithm with error at most rad8Ve(III) + E. We can assume that rad&Ve(III) < +oo. Then the set
A={ y E Y I
y)) = +oo}
is of µ1measure zero. Let O(x, y) be as in the proof of Lemma 3.2.1. We have already mentioned that 0 is continuous with respect to x. We also have supXEG zb(x, y) _ +oo. Indeed, there exists t > 0 such that the ball Bt = { g E G I IIgII t, we have ?b(x,y) ? JBt
.(IIxg1I2v2(dgly) >v2(Btly)
IIxIIt)2,
and consequently ?/'(x, y) p +oo as x > +oo. Thus, for fixed y E Y \ A, the function z0 (x, y) assumes all values from the interval y)), +oo). Hence for any e > 0 we can find an element ay E G such that
0(ay, y) = f
G
Ilg  ay II2 v2(dgI y) = (r(v2(.
ly)))2 + e2.
(3.3)
We now define cpe (y) = ay for y E Y \ A, and cp6 (y) = 0 for y E A. Then the algorithm co is p1measurable and, using (3.2) and (3.3), we have
e'(111, pe) = as claimed.
JY
(r(v2(.l y))2 p1(dy) + E2 < radave(l )/ + E,
3.2 Information and its radius
129
On the other hand, for an arbitrary algorithm cP we have (eave(LI P))2
= fj II S(f)  (y)II2, 2(df I y) P1(dy) = ff IIg(y)II2v2(dgIy)i(dy) j(r(v2(1Y)))2 µ1(dy) =
(radave(111,
))2,
which proves the first part of the theorem. Let cP be an algorithm such that e&Ve(III,V) = rada"e(11I) Let 01(y) _ V(y)112 v2(dgly) and 1b2(y) = y)))2. Then 1/' (y) >_ 02(y), JG IIg 
`dy E Y, and fl,Vi1(y)pi(dy) = fl, &2(y)pi(dy). This can hold if and only if ibl(y) = 1'2(y), for a.e. y. Since the last equality means that V(y) is the center of v2( I y), the proof is complete. Following the terminology of the worst case setting, we can call Vctr a central algorithm. Unlike in the worst case, we see that in the average case setting there is not much difference between the central and the optimal algorithm. Indeed, they can only differ from each other on a set of lu measure zero.
In some cases, the optimal algorithms turn out to be mean elements of conditional distributions. Recall that m,,, is the mean element of a measure w defined on a separable Banach space G if for any continuous linear functional L : G > R we have
JL(g)w(dg) = L(m.) We also recall that by v we denote the a priori distribution of S(f) E G,
i.e., v =,uS1.
Lemma 3.2.2 Let G be a separable Hilbert space and m(y) the mean element of the measure v2 ( I y), y E Y. Then the unique (up to a set of µlmeasure zero) central algorithm is c0,tr(y) = m(y) and
radave(III) = e'(III Wetr) =
f IIghI2 v(dg)
Proof For any y E Y and a E G we have IG IIg  a1I2 v2(dgly)
 fy
IIm(y) II2 µl (dy)
3 Average case setting
130
= Ila1I2  2 (a, m(y)) +
IIgII2 v2(dgl y)
Ila  m(y)II2 + IG I IgII2 v2(dgly)

IIm(y)II2.
The minimum of this quantity is attained only at a = m(y). Hence coopt(y) = m(y), a.e. y, and (radave(]rj))2
= (eave(LI, 'opt))2 L IG IIgII2v2(dgly)pj(dy)  f IIm(y)II2 A,(dy)
To complete the proof, observe that
j IIgII2v2(dgly) = L IIS(f)II2a2(dfly), and consequently
L L IIgII2v2(dgly)µi(dy)
= f L IIS(f)II2A2(dfly)Al (dy)
= f IIS(f)II2 u(df) = f IIgII2 v(dg). F
G
Notes and remarks NR 3.2.1 Neglecting some details, the results of this section have been adopted from Traub et at. (1988, Sect. 2.3 of Chap. 6) (see also Wasilkowski (1983)), where exact information is considered. The concepts of the radius and center of a measure were introduced in Novak and Ritter (1989).
NR 3.2.2 We assume that the algorithm is a measurable mapping. One can allow arbitrary algorithms and define the error e8Ve (IlI, cp) as the upper
integral, as in the papers cited in NR 3.2.1 (see also Novak, 1988, where even nonmeasurable S and N are allowed). As it will turn out, for problems considered in this monograph, optimal algorithms are the same in the two cases.
NR 3.2.3 We now give a general theorem on existence of the regular conditional probability distribution. Let X and Y be two separable Banach spaces, and w a probability measure on Borel sets of X. Let V) : X > Y be a measurable mapping and wl = wo1. Then there exists a family of probability measures such that (1) w2(i,b1(y)Iy) = 1, a.e. Y,
(ii) for any Borel set B the mapping y + W2 (B I y) is measurable, and (iii) w(B) = PYw2(BIy)w1(dy)
3.2 Information and its radius
131
Moreover, any other family satisfying (i)(iii) can differ from only on a set of p1measure zero. For a proof, see Parthasarathy (1967) or Varadarajan (1961). Observe that this theorem tells us about decomposition of the measure w with respect to the `exact' mapping 0. The `noisy' version can be derived as follows.
We set X = F x Y, w = µ, and Vi(f, y) = y, b'f E F, Vy E Y. Then wzii pi. Hence there exists a family of measures defined on F x Y such that the are a.e. concentrated on F x {y}, the maps y i, µ2(Bly) are measurable and µ(B) = f},µ2(BIy)µ1(dy). Letting Vy E Y, we obtain that the are a.e. y concentrated on F, the maps are measurable and
µ(B) = j1l2(BY)Pi(dY) = as claimed. We also note that if F is a Banach space and Y = R', then F x Y is also a Banach space and the regular conditional distribution exists. NR 3.2.4 In the exact information case with linear information N, the radius of N is closely related to average widths, see e.g. MagarilIl'yaev (1994), Maiorov (1993, 1994), Sun and Wang (1994).
Exercises E 3.2.1
Give an example of a measure w for which 1. The center does not exist. 2. The center is not unique. E 3.2.2 The diameter of a measure w on G is defined as
d(w) =
1191 92112 w(dgl) w(d92).
fG fG
Consequently, the (average) diameter of information IlI is given as diam8Ve(U)
= V
fy ( d(µ2('Iy)))2µ1(dy)
Show that r(w) < d(w) < 2 r(w) and rad'(I[I) < diamBVe(I[I) < 2 rad&Ve(I I).
E 3.2.3 Let G be a separable Hilbert space. Show that then d(w) = /r(w) and diama°e(U) = v radaVe(I[I) E 3.2.4 Let F = R' and p be the weighted Lebesgue measure,
p(A) = J f a(.f)df, A
for some positive a : R  ]R+ such that fRm a(f) d,,,, f = 1, where dm is the m dimensional Lebesgue measure. Consider the information distribution III
with Y = R" and
lrf (B) = f 3(y  N(f)) day, s
3 Average case setting
132
where N : R'm * RT, 3: Rn p R+, and ff /3(y) dny = 1. Show that in this case
pi (B) =
fY(Y)dnY
and
p2(Al y)
= 7(y) f a(f) 0(y  N(f )) dmf,
where 'Y(y) = fue,n o:(f)l3(y  N(f)) d,nf, Vy E Y.
E 3.2.5 Let the solution operator S : F * G, measure p on F and information Li with Y = R' be given. Define the space F = F x Y, solution operator S : F + G, measure µ on F and exact information N : F + Y as S(f, y)
= S(f ),
A(B) = y)
=
f7rf(Bf)(df)N(f,
Y.
Show that for any algorithm cp : Y + G we have eave(III,
p; S, p) =
eve(N,,pi
where the second quantity stands for the average error of cp with respect to µ, for approximating S(f, y) based on exact information y = N(f, y).
3.3 Gaussian measures on Banach spaces Gaussian measures defined on Banach spaces will play a crucial role in our studies. In this section, we recall what a Gaussian measure is and cite those properties of Gaussian measures that will be needed later.
3.3.1 Basic properties Assume first that F is a finite dimensional space, F = Rd, where d < +oo. A Gaussian measure p on Rd is uniquely defined by its mean Rd element m E Rd and its correlation operator (matrix) E : Rd which is symmetric and nonnegative definite, i.e., E = E* > 0. If m = 0 and E is positive definite (E > 0) then p(B) = (27r)d/2(detE)1/2
JB{2 (E1f,.f,
)2} df.
(3.4)
(Here df stands for the Lebesgue measure on Rd.) In the case of m# 0 and/or singular E, the Gaussian measure p is concentrated on m + Xl where X1 = E(X), and given as follows. Let El : X1 * X1, El(x) =
3.3 Gaussian measures on Banach spaces
133
E(x), Vx E X1, let d1 = dim X1. Then, for any B = m + B1 where B1 is a Borel subset of X1, the measure µ(B1) is given by the right hand side of (3.4) with E, d and B replaced by E1, d1 and B1, respectively, and with the Lebesgue measure df on X1.
If m = 0 and E = I is the identity then p is called the standard d dimensional Gaussian distribution. Let p be a Gaussian measure on Rd. Then for any x, x1i x2 E Rd we have fR' (x, f) 2 p(df) = (x, m)2 and
f<xi, f  m)2 (x2, f  m)2 µ(df) = (Ex1, x2)2 = (Ex2i x1)2. Consider now the more general case of a separable Banach space F. A Borel measure p on F is Gaussian if for any n and continuous linear mapping N : F  Rn, the measure UN = pN1, given as
µ(N1(B)) = p{ f E F I N(f) E B }, V Borel sets B of Rn, is Gaussian.
As in the finite dimensional case, any Gaussian measure p defined on a separable Banach space F is determined by its mean element m, E F and correlation operator Cµ : F* > F. 1 These are characterized by the relations
L(mµ,) = IF L(f) p(df ),
V L E F*,
and
L1(CN.L2)
=
F IF
Li (f  m,..)L2(f  mµ) p(df ),
V L1, L2 E F*.
That is, for any mapping N(f) _ [L1(f ), ... , Ln (f )] where Li E F*, 1 < i < n, the Gaussian measure pN1 has mean element m = N(m..) and correlation matrix E = {Li(CµLj)}?=1. The correlation operator is always symmetric, L1(C, L2) = L2(CN.L1),
and nonnegative definite, L(CL) > 0. It is positive definite, i.e., L(C1L) > 0, V L 0, if p has full support, supp p = F. In general, p is concentrated on the hyperplane mN. + C, (F*).
Let F be a separable Hilbert space. Then CN, : F* = F + F is the correlation operator of a Gaussian measure on F if it is symmetric, 1 For F = Rd or, more generally, for the case of F being a Hilbert space we have F* = F. Then C, can be considered as an operator in F, Cµ : F  F.
3 Average case setting
134
nonnegative definite and has finite trace, i.e., tr(Cµ)
=J
IIf II2
A(df) = E(C i1i, ii) < +00,
F
i=1
where the {rji} are a complete orthonormal system in F. The complete characterization of correlation operators of Gaussian measures on Banach spaces is not known. However, in this case we have the following fact. Let C. be the correlation operator of a Gaussian
measure on F. Let a E F and C' : F* * F be a symmetric operator, L1(C'L2) = L2(C'L1), such that
0 < L(C'L) < L(CµL),
V L1i L2, L E F*.
Then there exists a (unique) Gaussian measure on F with mean element a and correlation operator C'. The characteristic functional ,iN, : F* 4 C (where C stands for complex numbers) of a measure it is given as Oµ(L)
L
e1L(f)
i(df)
(1 =
)
= Any measure is uniquely determined by its characteristic functional. If it is Gaussian with mean mµ and correlation operator Cµ then
'µ(L) = exp { i L(mµ)  L(CAL) }
.
2
The correlation operator Cµ generates a psemiinnerproduct on the F* x F* ,. R, space F*. This is defined as
(L1,L2)µ = L1(C, L2) = L2(CCL1)
fLi(f)L2(f)P(df)L1iL2 E F*. We denote by I 'IIµ the corresponding seminorm, IILIIN, = (L, L)v,. If supp p. = F then C. is onetoone, so that is an inner product and IIµ is a norm. The space F* with the norm II IIµ is complete only if II
dim F < +oo. porthogonality in F* means orthogonality with respect to
3.3.2 Gaussian measures as abstract Wiener spaces We noticed that any Gaussian measure p is determined by its mean element and correlation operator. Sometimes it is convenient to define p in another way.
3.3 Gaussian measures on Banach spaces
135
Let H be a separable Hilbert space. For any (cylindrical) set B C H of the form B = { g E H I P(g) E A }, where P is the Horthogonal projection onto a finite dimensional subspace of H, and A is a Borel set of P(H), we let
µ (B) =
1
(
21r)n
LA
e11911;,/2 dg
(3.5)
where n = dimP(H) and dg is the Lebesgue measure on P(H). That is, u' is the standard weak distribution on the algebra of cylindrical sets.
Note that µ' is an additive measure but, in the case dim H = +00, it cannot be extended to a vadditive measure on the Borel vfield of F. be another norm on H which is weaker than the original norm II' IIH, i.e., II* 0. Let F be the closure of H with respect to II . II *. It turns out that if II F possesses some additional properties (it is in some sense measurable, see NR 3.3.2), then there exists a unique aadditive measure p defined on the Borel sets of F such that the following holds. For any n and continuous linear functionals Li E F*, 1 < i < n, we have Let II
. II*
II
II
µ({ f E F I

(L1(f ), ... , Ln(f )) E B })
= 1({9 E H I ((9L1,9)H,..., (9Ln,9)H) E B}), for all Borel sets B C 1Rn. Here gL is the representer of L in H, i.e., L(f) = (9L, f )H for f E H. The pair {H, F} is called an abstract Wiener space.
Such an extension of a' to a Gaussian measure p always exists. For instance, we can take I I g I I * = (Ag, g) H where A : H > H is an arbitrary symmetric, positive definite operator with finite trace. Then the resulting space F is a separable Hilbert space and the correlation operator of u is given by the continuous extension of A to the operator
A:F*F. On the other hand, for any separable Banach space F equipped with a zeromean Gaussian measure p, there exists a unique separable Hilbert
space H, such that Cµ(F*) C H C C,(F*) and {H, Fl} with F1 = supp p = CN,(F*) is an abstract Wiener space. The space H is given as follows. Let Ho = Cµ (F*). For f2, f2 E H0, we define (f1, f2)H = (L1, L2)µ where the Li for i = 1, 2 are arbitrary functionals satisfying CµLi = fi. Since (f1, f2)H does not depend on the choice of Li, it is a well defined inner product on Ho. Then H is the closure of Ho with respect to the norm II  IIH = (, )H Clearly, (3.5) also holds.
3 Average case setting
136
Thus any zeromean Gaussian measure on a separable Banach space can be viewed as an abstract Wiener space (and, of course, vice versa). Let p be the Gaussian measure for an abstract Wiener space {H, F}. Then. for L, L1, L2 E F* we have 2 =o LL(f)f) = (2IIIgLIIH2 )1/2 jRxexp{x2/(2II9LIIH)}dx
and
L1(CCL2)
=
LL1I)L2f(df)
= (9L1,9L2)H.
(3.6)
Hence p has mean element zero and positive definite correlation operator Cµ(L) = gL, VL E F*.
Finally, consider the case when H is an r.k.h.s. with r.k. R T x T + 1l (see Subsection 2.6.4). Suppose that function evaluations
Lt(f) = f (t), f E F, are continuous functionals in F for all t E T. In this case, by (3.6) we have
Lt(C,L8) = (Rt, R3) H = R(t,s),
d s, t E T.
For this reason, the reproducing kernel R is also called the covariance kernel. The measure p is uniquely determined by its covariance kernel.
Example 3.3.1 Let r > 0. Let H = W+1 be the reproducing kernel Hilbert space of Example 2.6.3 with (a, b) = (0, 1). That is,
W+1 = {f:[0,1]RI f (r) is abs. cont., f (2) = 0, 0 < i < r, f(r+1) E G2([0,1])}. Then I l f 1 I cr 0, such that for all t1, t2 E [0,1]
I x1  x2I' µt, t2 (dx) < Kt 1 t2I 1+A J2 Here x = (x1, x2) and µt1t2 is the Gaussian measure in 1R2 with correlation matrix {R(ti,tj)}2J=1
NR 3.3.4 Let Fall be the space of all functions f : [a, b] + R. Then any function f (t), a < t < b, can be viewed as a realization of the stochastic process corresponding to a covariance kernel R(s, t), a < s, t < b. For instance,
3 Average case setting
138
the process corresponding to the kernel R(s, t) = min{s, t} is called a Brownian motion. The reader interested in stochastic processes is referred to, e.g., Gikhman and Skorohod (1965).
NR 3.3.5 In the case of multivariate functions, Gaussian measures may be defined based on Gaussian distributions on univariate functions. An example is provided by the Wiener sheet measure which is given as follows.
Let d > 1 and ri _> 0, 1 < i < d. Let F be the Banach space of functions f : [0, 1]d + R that are ri times continuously differentiable with respect to the ith variable,
F = Coi::°rd
= {f:
Drl...rd f
[0, 11d > R I
cont.,
(Drl"'rd f)(t) = 0, 0 < ij < rj, 1 < j < d, when at least one ti is zero } ,
with the norm IIfII = suptE[0,l]d I(Dr1...rdf)(t)I. The Wiener sheet measure on F is defined as Wrl...rd(B) = w°...o(Drl"'rd(B)), VBorel sets B of F, where w0...0 is the classical Wiener measure on C0;;;00. Its covariance kernel is given as R o...0(s,t) ...0 JCO 0...0
f(S)f(t)
ftmin{sj,tj}, j=1
where s = (Sl,...,sd), t = (tl,...,td).
It is easy to see that wrl...rd is the zeromean Gaussian measure with covariance kernel d
Rrl...rd(S,t) _ 11 T?Rrj(sj,tj), j=1
where Rr, is the covariance kernel of the rjfold Wiener measure on CO, . Hence the abstract Wiener space for Wrl...rd is {W° +01 ...rd+1, C°i...rd}, where
W°+i...rd+l is the r.k.h.s. defined in NR 2.6.7. Another example of a Gaussian distribution on multivariate functions is the isotropic Wiener measure (or the Brownian motion in Levy's sense) which is defined on the space C([0, 1]d). Its mean is zero and its covariance kernel is given as
R(s, t) =
(115112 + 11t112  Its  t112),
s,t E [0, 1]d,
2
see, e.g., Ciesielski (1975) for more details.
Exercises E 3.3.1 Let H be a separable Hilbert space. Let the {ei} be a complete orthonormal system in H, and let P,, : H + R', n > 1, be defined as Pn(x) = {(x, ei)} 1. Prove that there is no a Gaussian measure µ on H such that for all n, µPP 1 is the n dimensional zeromean Gaussian measure with identity correlation operator.
3.4 Linear problems with Gaussian measures
139
E 3.3.2 The space 12 can be treated as the space of functions f : {1, 2,. ..} f2(i) < +00. Show that R(i, j) = A 6 2 , i, j > 1, is R, such that IIf 112 = 1 the covariance kernel of a Gaussian measure on 12 iff E°° 1 Ai < +oo.
E 3.3.3 Let H and F be separable Hilbert spaces with the same elements and equivalent norms. Show that {H, F} is an abstract Wiener space if and only if dim H = dim F < +oo.
E 3.3.4 Let {H, F} bean abstract Wiener space with dim H = +oo, and it the corresponding Gaussian measure. Show that µ(H) = 0.
E 3.3.5
Show that the rfold Wiener measures wT satisfy wr = where Dk is the differential operator of order k, and r > s > 0.
w,Drs
E 3.3.6 Let Rr be the covariance kernel of the rfold Wiener measure. Check that s
Rr(s, t) =
t
ffRr_i(uiu2)duidu2.
3.4 Linear problems with Gaussian measures We start the study of linear problems with Gaussian measures. These are problems for which
F is a separable Banach space, G is a separable Hilbert space, and the solution operator S : F > G is continuous and linear, the a priori distribution u on F is a zeromean Gaussian measure, information is linear with Gaussian noise.
The latter assumption means that the information distribution E = {irp } is given as
irf = N(N(f), o2E),
(3.7)
where N : F > R is a continuous linear operator,
N(f) = [L1(f),L2(f),...,Ln(f)], f E F, E : Rn > Rn is a symmetric and nonnegative operator (matrix), and o > 0. That is, information y about f is obtained by noisy observation of the value N(f), and noise x = y  N(f) is the zeromean Gaussian random variable with covariance matrix o'2 E. The parameter c is interpreted as noise level. Hence it plays a similar
role to 6 in the worst case setting. If a p 0+ then noisy information approaches exact information, which is obtained by letting a = 0. Note that independent observations of Lis correspond to a diagonal
matrix E. If E is the identity, E = I, then the noise x = y  N(f) is
140
3 Average case setting
said to be white. In this case x  N(0, 0,2I). For instance, in Example 3.2.2 we have
N(f) =
[f(t1),f(t2),...,f(tn)I
and the information noise is white.
The goal of this section is to find optimal algorithms and radii of information for linear problems with Gaussian measures. To this end, we first give formulas for induced and conditional distributions.
3.4.1 Induced and conditional distributions The following lemma is well known. For completeness, we provide it with a proof.
Lemma 3.4.1 Let w be the Gaussian measure on F with mean element mu, and correlation operator C4,. Then the induced measure WS1 is also and the correlation Gaussian. The mean element of wS1 is
operator equals S(C.,S*), where S* : G = G* + F* is the adjoint operator to S, i.e., S*(g) = g) . Proof Indeed, the characteristic functional of wS1 is given as
JG
e'(x,g) wS1(dx) ei(s(f),g) w(df)
F
= JF
e`(s*g)(f) w(df)
= exp {i(S*g)(mu,)  2(S*g)(C,,,(S*g) )} = exp {i(S(m.), g)  2 (SC,,,(S*g), g) } .
Hence S(m4,) is the mean element and S(CWS*) is the correlation operator of w.
For given information (3.7), let
GN = {(L;, Lk)µ}j k=1 be the (p)Gram matrix for the functionals Li. Clearly, GN is symmetric and nonnegative definite. Let Y1 = (o, 2E + GN)(Rn). Then for any y E Y1 there is exactly one element z E Yl satisfying (a2E + GN)Z = y.
Lemma 3.4.2 For L E F* we have N(C,LL) E Yl.
3.4 Linear problems with Gaussian measures
141
Proof Indeed, any L E F* can be decomposed as L = Lo+E1 ajLj, where Lo 1A span{L1,... , L,,,}. Then N(C,LL) = GN(a) with the vector
a = (a,,.  , an). Since both matrices E and GN are symmetric and 
nonnegative definite, we have GN(Rn) C (a2E + GN) (Rn) = Y1 and N(CµL) E Y1.
We now show formulas for the regular conditional distribution. Recall
that the distribution of information y on Rn is denoted by µl, and the conditional distribution on F with respect to y is denoted by
Theorem 3.4.1 For the linear information with Gaussian noise, µ1 is a zeromean Gaussian measure and its correlation matrix is o2E + GN. Furthermore, the conditional measure U2(' J Y), y E Y1, is also Gaussian. Its mean element equals n
m(y) _ E zj (CµLj ), j=1
where z = z(y) = (zl,... , zn) E Y1 satisfies (v2E + GN) z = y.
The
is independent of y and given as
correlation operator of
C,2(L) = CC(L)  m(N(CL)),
VL E F*.
Observe that the measure µl is concentrated on the subspace Y1. Therefore it suffices to define
only for y E Y1.
Proof The characteristic functional of the measure µl is given as (a E 1R' and i = ) 0µi (a)
=
f
ei (y,a)2
R8
JF
pi (dy) = IF f ei (y,a)2 7f1(dy),u(df) F n
exp { i(N(f ), a)2  Z v2 (Ea, a)2 } ,u(df ).
a)2 we have La(CN,La) = (GNa,a)2, we find
Since for
that
(a) = exp{1((v2E+GN)a, a)2}. Hence ul is the zeromean Gaussian measure with correlation matrix or2E+GN. We now pass to the conditional distribution. Observe first that owing to Lemma 3.4.2 the element m(N(CN,L)) in the definition of Cµ2 is well
defined. For y E Y1, let
be the Gaussian measure on F with
3 Average case setting
142
mean m'(y) = E, 1 zj (CµLj), where (a2E+GN) z = y, and correlation Then the operator C'() = Cµ() are well defined Gaussian measures. Indeed, for y E Y1 we have n
L(m'(y)) = L (E zjCCLj) =
((0,2E
+ GN)Iy, NCµL)2
j=1
= (i'
(Or2E
+ GN)1NCµL)2.
Hence, for any L, L' E F*, we have
L(C'L') = L(CCL')  L(m'(NCL')) = L(C,L')  (NCµL', (a2E + GN)1NCCL)2 = L'(C,L)  (NCµL, (a2E + GN)1NCµL')2
= L'(C'L), and 0 < L(C'L) < L(C,L). We need to show that the characteristic functional of the measure µ is equal to the characteristic functional of the measure j1' defined as
µ`'(B) =
JY
µ2(Byjy) µ1(dy),
VBorel sets B of F = F x Rn
To this end, let L E F*. Then there exist L E F* and w E Rn such that
L(f)=L(f)+(y,w)2,df =(f,y) EF. We have 'bµ, (L)
fn
(JFexp{i(L(f) + (y,w)2)}µ2(dfly)) p1(dy)
J n exp{i(y, w)2}
exp{iL(f)} µ2(df Iy)) µ1(dy)
j exp {i((y, w)2 + L(m'(y))) n
2tL(CCL)  L(m'(NCµL))) } µ1(dy). Recall that for y E Y1 we have L(m'(y)) = (y, (a2E + GN)1NC1L)2. Hence L(m'(NCµL)) = (NCµL, (a2E + GN)1NCµL)2i and
,µ, (L) = exp {1(L(C,L)  (NCµL, (a2E + GN)1NC,L)2)} x f exp{i(y, w + (a2E + GN)1NCCL)2 } µ1(dy) n
exp {2(L(CµL)  (NCµL, (a2E +GN)1NCML)2 )} x exp {  (((a2E + GN)w, w)2 2
+(NCµL, (a2E + GN)NCCL)2 + 2 (w, NCµL)2) }
3.4 Linear problems with Gaussian measures
exp { 
143
2(L(CN,L) + 2(w, NCN,L)2
+((O'2E + GN)w, w)2 ) }.
On the other hand, for the characteristic functional 0i, of the measure µ we have
0µ(L) = J J n F2
exp{i(L(f) + (y, w)2)} 7r f(dy) µ(df )
L exp{iL(f )}
(1
exp{i(y, W)2} 7r f(dy)) µ(df )
exp{ 2 ((02E + GN)w, w)2} x
f
F
exp{i(L(f) + (N(f ), W)2)} µ(df )
exp{2(L(C,L)+2(w,NCN,L)2 +((v2E + GN)w, w)2)}. Thus 4/1N = OA, which completes the proof.
3.4.2 Optimal algorithms We are now ready to give formulas for the optimal algorithm opt and
radius of information rad'(IlI). These can be easily found using the formulas for induced and conditional distributions. Indeed, Lemma 3.2.2 states that opt is determined uniquely (up to a set of y having p1measure zero), and that cpopt(y) (y E Y1) is the mean element of the measure By Lemma 3.4.1, ccopt(y) = S(m(y)) where m(y) is the mean element of Theorem
3.4.1 yields in turn that m(y) = E' 1 zj(CLj) where z = (Q2E + GN)1y E Y1. Furthermore, by Lemma 3.2.2, the radius of information ICI is given as
(rada"e(Irj))2
=
(eave(1II, W0Pt))2
IG II9II2v(d9)  f IIWopt(y)II2µ1(dy) tr (SC',,S*)  tr (cpopt(a2E + GN)cpopt ), where cpopt : G * Y1 is the adjoint operator to cpopt, i.e., it is defined by (cpopt (y), g) = (y, cpopt (g)) 2. Observe now that cp*opt = (0 2E +
GN)1NCN,(S*g).
Indeed, for any y E Y1 and g E G we have n
zjS(CN.Lj), g) = E zj(S(Cµ.Lj),9)
4opt(y),9) j=1
j=1
3 Average case setting
144 n
n
E zj(S*g)(CN.Lj)
_ E zjLj(CS*g)
j=1
j=1
_ (z,NC,(S*9))2 = (y,(o2E+GN)1NCµ(S*g))2 (2J, WO*'t(9))
Thus cpopt(v2E + GN)cpopt9 = cPopt(NC.. (5*9)), V9 E G.
We summarize this in the following theorem.
Theorem 3.4.2 Let the solution operator S be continuous and linear, and let information IlI be linear with Gaussian noise. Then the optimal algorithm is linear and is given by n
zj S(CcLj),
coopt(y) _
y E Y1,
j=1
where z = z(y) E Y1 satisfies (o2E + GN) z = y. Furthermore, radave(III) = eave(LI Wopt)
tr (SCS*)
 tr (cPopt(NCN,S*)) .
The above formulas can be simplified if we assume a special form of N and E. That is, suppose that information consists of independent observations of n functionals that are porthonormal. That is, the matrix E = diag{r71, ... , ijn} and N = [L1, ... , Ln] with (Li, Lj)µ = bij. (Here bij stands for the Kronecker delta.) In this case, the Gram matrix GN is the identity and a2E + GN = diag{1+x2,11,...,1 +o2rln}. Hence n
cpopt(y) = E(1 +Or2r/j)lyjS(CLj) j=1
If we replace y above by N(CLL), L E F*, then ropt (NCL)
= j=1
2 +v 'I7j
,
so that for g E G we have n
(cpopt(NCµ(S*9)), g)
=
I
j=1
(S*9, Lj)µ 1 + v2r7j
(S(CI Lj),9)2 2 j=1
113
(S(C",Lj),
9)
3.4 Linear problems with Gaussian measures
145
Choosing an orthonormal basis {gi}°_1 in G, we obtain tr(popt(NCµS*)) _ n
'9i)2 (S(1+ j "3
i=1 j=1
?dIj)00
IS(C2j) II2
1 2
1+ Q 71j
j=1
1
1+ O 1Jj
Thus we have the following corollary.
Corollary 3.4.1 If the functionals Li are ,uorthonormal, (Li, Lj) ,, = Sij for 1 < i, j < n, and observations of Li (f) are independent, E = diag{771i
.
,
77n}, then the optimal algorithm
VS(C,Lj) n
opt(y) _ j=1
2
1
Yj "3
and the radius of information IIS(CµLj)II2
tr (SCAS*) 
radave (III) =
1 + U2 77j
j=1
VI
It turns out that the assumptions of Corollary 3.4.1 are not restrictive. More precisely, any linear information with Gaussian noise can be linearly transformed to information of the same radius, and consisting of independent observations of porthonormal functionals. Moreover, this transformation does not depend on the noise level or. The proof is given in NR 3.4.3.
Let us see how the radius of information depends on the noise level v. As explained above, we can assume without loss of generality that r = D = diag{771i... , rln} and (Li, Lj)µ = Sij, 1 < i, j < n. Let r(v) be the radius of information y v2 D). Then n 0,2
r2(o) = r2(0) +
j=1
1 + Q2 111
II
S(CµLj)II2.
Letting or > 0+, for r(O) > 0 we have
En r(or)  r(O)
Q2
77j1IS(CµLj)II2
2 (tr(SC, S*) 
E
1 II S(CCLjII2)
1/21
3 Average case setting
146
while for r(0) = 0 we have r(o)  r(0) = r(a)
E%IIS(C
o .
Lj)II2.
N j=1
Thus the radius r(a) is a constant or increasing function of o. If r(0) = 0 then the radius of noisy information converges to the radius of exact information linearly in a. Otherwise we have quadratic convergence in o. For S a functional, this is in contrast to results of the worst case setting, where we always have linear convergence of r(S) to r(0); see Theorem 2.4.2.
Notes and remarks NR 3.4.1 This section is based mainly on Plaskota (1990), where the case of white noise E = I is considered. Exact information is treated, e.g., in Traub et at. (1988). NR 3.4.2 It is worth while to mention that the space G in Theorem 3.4.2 need not be a Hilbert space. That is, the algorithm cp(y) = S(m(y)), where m(y) is mean element of is still optimal when G is a separable Banach space. Indeed, observe that in this case the measure remains Gaussian with the mean element S(m(y)) (see E 3.4.3). Any Gaussian measure is centrosymmetric with respect to its mean element (see e.g. Vakhania et at. , 1987). Hence, owing to Example 3.2.3, the element S(m(y)) is the center of v2(.Iy) and the algorithm cp(y) = S(m(y)) is optimal. NR 3.4.3 We now show the result already announced that any linear information with Gaussian noise can be linearly transformed to information of the same radius and consisting of independent observations of tiorthonormal µ2(S1(,)ly)
functionals.
Let IlI = fir f) with 1rf = N(N(f), ozE) and N : F * R'. Suppose first that the matrix E is nonsingular. Denote by Li the functionals which form the operator E112 N, i.e., E1/2N = [L'1, ... , Ln]. Let G' = {(Li, Lj)µ}i j=1 and 1 be the orthonormal basis of eigenvectors of the matrix G', G'q(i) _ giq(i) where 711 > ... > qi , > 0 = 71,,,+1 = ... = 71n. Letting Q be the (orthogonal) n x n matrix of vectors q(i), and D1 the m x n diagonal matrix diag{rli 1/2, ... we define the linear transformation T = {q(2)}
D1Q*E1/2
](Pn > Rm.
The problem of approximating S(f) from the data y = N(f) + x where x I(0, ozE) can be translated to the problem of approximating S(f) from y = Ty = N(f) + fi, where N = TN and = Tx. The functionals forming N are 1/zq(i)L µorthonormal. Indeed, let N = [L1, ... , L,,,.]. Then Li = Em j=1 and
n
(Li, I'j)li =
Ti1/2 1/2 r1j
q(ei) qt(j)
s,t=1
'
i
(be, I'j)f'
3.4 Linear problems with Gaussian measures
147
q1 1/2,73. 1/2(G!q(z) q(j))z = bij.
Moreover, the random variable y` is Gaussian with mean 1V (f) and correlation matrix a2E = a2 DLQ*E1/z a2dia 1 , ... , i7m1 }. )
(
g{771
(
Thus information y is obtained by independent observations of porthonormal functionals.
We show that the information III and III = {irf} with irf = N(N(f), 0,2 D), D = diag{771 1, ... , i '}, are of the same radius, i.e., radBVe(III) = radBVe(IU) To this end, it suffices that the conditional measures with respect to information ICI and III have the same correlation operator. This in turn holds if the zjLj where (0,2E +GN)z = NCL, and E'=1 zjLj where functionals >
i
(0,2D + I)z" = NCL, coincide for all L E F*. Indeed, simple calculations show that z and z satisfy Q*E1/2z = (DL/2z, 0, ... , 0). Hence nm M
EzjLi(f
(z,N(f))2 = (z,D1Q*E112N(f))2
j=1 (Q*EL/2z,Q*E1/2N(f))2
=
(£1/2QQ*E1/2z,N(f
)) 2
n
1: zjLj(f) j=1
Consider now the case when E is singular, dim EN(Rn) = k < n. Then there exists a nonsingular and symmetric matrix V such that VEV is diagonal with n  k zeros and k ones. Let VN = Ln]. We can assume that
the functionals Li and Lj are porthogonal for 1 < i < n  k < j < n, since otherwise Lj can be replaced by their porthogonal projections onto (span{Li,... , Ln_k})'. Let No = [Li, ... , Lnk] and N1 = [L,,k+1, ... , Ln]. Let 0 be the zero matrix in II2nk and I the identity matrix in Rk. Now we can use the procedure above to transform y(°)  f(No(),0) and y(1) N(N1(), a2I) into equivalent information y(0) and 9(1), where 9(0) is exact and both consist of independent observations of porthonormal functionals. Then
y=
[y(o), 9(1)]
also consists of independent observations of porthonormal
functionals, and y is equivalent to y".
Exercises
E 3.4.1 Prove that N(C,(F*)) = GN(Rn), and for a > 0 N(Cµ(F*)) + E(Rn) = (02E + GN)(Rn). E 3.4.2 Show that the joint measure d defined on F x En is Gaussian, and that the mean element of a is zero and the correlation operator is given as
Cµ(L) _ (C,(L) + N(C,,L) + (a2E+GN)w) E F x Rn, where L(f,y)=L(f)+(y,w)2i f EF,yER'.
3 Average case setting
148
E 3.4.3 Let F and G be separable Banach spaces and S : F * G a continuous linear operator. Let w be a Gaussian measure on F with mean element m. and correlation operator C,,. Show that then the measure WS1 on G is also
Gaussian, and that its mean element is S(m,) and its correlation operator C,,,S (L) = S(C,(LS)), L E G*.
E 3.4.4 Suppose that the functionals Lj, 1 < j < n, are orthonormal and E = diag{ii,... , ?)n}. Let PN : F* , F* be the ttorthogonal projection onto the subspace V = span{Ll,... , Ln}, and let D : V + V be defined by D(Lj) = (1 + a2rij)Lj, 1 < j < n. Show that then the correlation operator Cµ2 of the conditional distribution µ2( I y) can be rewritten as Cµ2 = Cµ(I D1PN), and that for small or the operator Cµ2 is roughly the superposition of the `almost' ,uorthogonal projection onto V1, and Cµ.
E 3.4.5 Show that coopt(y) = epo(ya2Ez) where (a2E+GN)z = y and epo is the optimal algorithm for exact information (a = 0).
E 3.4.6 Let S : F * G be a given solution operator, and let III be a given linear information with Gaussian noise. Let µ,n be a Gaussian measure on F with mean element m, not necessarily equal to zero. Let cp,n and e denote the optimal algorithm and radius of information with respect to µm.
Show that for all m E F we have rad8;e(LJ) = rad" (L1) and cam (y) _ S(m) + cpo(y  N(m)). E 3.4.7 Let IlI be linear information with Gaussian noise, 7rf = N(N(f), E), with Y = Rn. Let B : Rn +R ' be a linear mapping. Show that for information 1LI given as Irf = N(BN(f ), BEB*) we have radBVe(III) < radaVe(llI) If B is nonsingular then rad8Ve(E) = rad8Ve(1I) and the corresponding optimal algorithms satisfy epopt(y) = cpopt(B1y)
E 3.4.8 Suppose we approximate S(f) from information y = [N(f) + x, t] E 1$n, where N : F * Rn is a continuous linear operator and the noise (x, t) E Rn
has mean zero and is Gaussian. Prove that if the random variables x and t are independent then the `pure noise' data t do not count. That is, the radius of information corresponding to y is equal to the radius of y(') = N(f) + x, and the optimal algorithm uses only y(l).
3.5 The case of linear functionals In this section we make an additional assumption that the solution operator S is a continuous linear functional. In this case, the formulas for caopt and rad&Ve(11I) in Theorem 3.4.2 can be expressed in a simple way. That is, we have n cOopt(y)
n
= EzjS(C,Lj) = >zjLj(C,4S) j=1
= (z, N(CµS))2
j=1
3.5 The case of linear functionals
149
where (Q2E + GN)Z = y, or equivalently, coopt(y) _ (y, W )2
where w satisfies (a 2E + GN)W = N(CN,S). To find the radius, observe that tr(S(CµS*)) = IISIIµ and tr(cpopt(NCµS*)) = c opt(NCµS). Hence
rada"e(1I) =
IISIIµ  V_pt(NCNS).
For independent observations porthonormal functionals, i.e., for E _ D = diag{rll...... l,,,} and (Li, Lj)1,, = Sij, we have
=
Oopt (2J)
2 1 j=1 + Q2)
yj
and
radaVe(111) =
(S, Lj)µ
IIS112 µ 
LJ1 + V 2 .' 113
j=1
Observe that 1(S, Lj)µLj is the porthogonal projection of S onto V = spa.1{L1, . . . , L,,, }. Denoting this projection by PN, the above formulas can be rewritten as
(S, L)µ
caopt(y) = (PNS)(f) +j=1E 1+a2 1)j xj (y = N(f) + x), and n
rada"e(LI) =
IIS  PNS112 + a2 E µ j_1
1)j
1
+ O'2 1)j
(S, Lj) 2
Hence, for small noise level or, the optimal algorithm is close to the porthogonal projection of S onto V, and the radius is close to the p.. distance of S from V. In particular, for exact information cpopt(N(f )) = (PNS)(f) is the exact projection and radaVe(III) is the exact pdistance. In Subsection 2.4.2, we noticed that in the worst case setting the prob
lem of approximating a linear functional based on linear information with uniformly bounded noise is as difficult as the hardest one dimensional subproblem contained in the original problem. E 2.3.5 shows that this holds not only for functionals but also for all linear solution operators. We shall see that in the average case setting the corresponding property is preserved only for S a linear functional.
3 Average case setting
150
We first explain what we mean by one dimensional subproblems in the average case setting. For a functional K E F* with IIKIIµ > 0, let
PK:F >Fbegivenas
PK(f) = f  K(I C.K. II
Note that for any L E F* we have
PK(CµL) = COIL  (K, L) K ) . That is, PKCµ : F* p F is the superposition of the porthogonal projection onto the subspace porthonormal to K, and C. We also have that ker PK = span{CK} and PK(F) = ker K. The a priori Gaussian measure p on F can be decomposed with respect to PK, i.e.,
µ = fer K AKI9)pPKl(d9) where µK( Ig) is the conditional measure on F given g = PK(f). Obviously, µK( Ig) is concentrated on the line
PK'(g) _ {g+aCµKI aER}. We also formally allow K E F* with IIKIIµ = 0. In this case, we set
PK(f) = f, d f . Then µPK' = p and AK(19) is the Dirac measure concentrated on {g}, Vg E F.
Lemma 3.5.1 Let IIKIIµ > 0. Then for a.e. g the measure µK( 19) is Gaussian with mean m(g) = g and correlation operator A K (L)
=
(L, K)µ 2
Cµ K
Proof Consider the characteristic functional 0, of the measure w = µPK'. We havef
w(L) =
exp{iL(g)} pPK'(dg) = J exp{iL(PKf)} p(df) er K
F
= exp{2 (LPK(Cµ(LPK)))} . Hence the measure w is a zeromean Gaussian and its correlation operator is given as
Cu,(L) = PK(CC(LPK)) = CML 
(K, L)1, IIKIIµ
CµK.
3.5 The case of linear functionals
151
Now let µ'K( Ig) be the Gaussian measure with mean g and correlation
operator AK. Then the characteristic functional of the measure p' _ fker K µK (' I9) ,uPx 1(d9) is given as
µ,(L)
=
JkerKJF exp{iL(f)}µ'x(df I9)µPxl(d9) kerK
exp i L(g)

(K, L)µ 2IIKII2
1
µPK (dg)
2
exp
JerK exp{iL(g)} pP '(dg)
2IKI12
exp{2(L,L)N,}
This shows that p = p'. Since conditional distribution is determined uniquely (up to a set of p1measure zero), the lemma follows.
Any functional K E F* determines a family of one dimensional subproblems which is indexed by g E ker K and given as follows. For g E ker K, the subproblem relies on minimizing the average error
eave( O; AK('I9)) =
IF Rn
IS(f)  cc(y) I2 7f (dy) PK (df 19)
over all algorithms W. That is, in the subproblem g we use additional (exact) information that PK(f) = g or, in other words, that f is in the line PK1(g) Since the measures radius
have the same correlation operator, the is independent of g (compare with E 3.4.6). We denote this radius by radKe(IlI). It is clear that radKe(IlI) < radaVe(III).
(3.8)
Indeed, we have
A)) 2
=
Iker K (radave(LI; AK('I9)))2 pPxl(d9) inf (eave(1, 0; µK('19)))2 pPxl(dg) ker fK `o
< infJ `p
ker K
(eave(L (P;pK('I9)))2 pPxl(d9)
inf (eave( 0; µ))2 W
(rave (111; /_,))2.
3 Average case setting
152
We now prove that for a special choice of K we have equality in (3.8).
Theorem 3.5.1
Consider the family of one dimensional subproblems determined by the functional n
K. = S  cpoptN = S  EwjLi, j=1
where (a2E + GN)W = N(C,,S). Then for a.e. g the algorithm coopt(y) = (y, w)2 is optimal not only for the original problem, but also for the subproblem indexed by g E ker K.. Moreover,
radK(III) = radave(11). Proof If IIK.IIµ = 0 then S(f) = cpopt(N(f )), V f E C,(F*). In this case PK = I, the measure /2K. (. g) is concentrated on {g}, and any algorithm with the property cp(N(g)) = S(g) is optimal for the subproblem indexed by g. Hence optimality of opt for all subproblems (a.e.) follows from the fact that PK. (F) = ker K. C C, (F*). We also have
rad8Ve(II) = JS(CN2K.) = 0 = radK; (IlI). Assume now that I I K. I I N, > 0. Let w be the zeromean Gaussian measure with correlation operator A = AK. , where AK* is defined in Lemma 3.5.1. We need to show that the algorithm cpopt = (',W)2 is optimal if the average error over f is taken with respect to w, i.e., rada%e(lrI, w) = inf eave(III, (p; w)
= e' (III, (p.pt; w)
where eave(III'cp;w)
=
if f F
IIS(f)(y)II2f(dy)w(df).
'
Owing to Theorem 3.4.2, the optimal algorithm with respect to w is
given by cpu,(y) = E'i zjS(ALj), where (a2E + HN)Z = y, HN = {Li(ALj)}i _1, and z, y E (o2E + HN)(IRn). We have Li(ALj) = (K.,Li)F+(K.,Lj), IIK.JI 2 and S(ALj) = (K*,S)FL(K*,Lj), IIK*IIµ2 Hence, letting a = N(CµK.), we obtain
co (y) = (II?
(z, a)2 2
(3.9)
3.5 The case of linear functionals
153
where z satisfies (a, z)2 a2E z + IIK*IIµ a = y.
(3.10)
Observe now that n
a = N(CN,K*) = N(C,S)  EwjN(CµLj) j=1
= N(CµS)  GNW = a 2E w. This and (3.10) yield (y, w) 2
= o 2 (Ez, W)2 + (a, z) 2 (a, w) 2 II
K* 11
µ2
= (z,a)2(1+a2(Ew,w)2IIK*IIµ2), so that __
(z, a)2
IIK*II (Y, W)2 IIK* I I + a2 (Ew, w)2
(3.11)
µ
We also have
(S, K*)µ = IISI12  (w, N(CS))2 = (IISII  2 (w, N(CS))2 + (GNW, w)2 ) + ((w,
=
(02E
+ GN)w)2  (GNW, w)2 ) (3.12) IIK*IIIA + a2(Ew,w)2.
Taking (3.9), (3.11) and (3.12) together we finally obtain
K.)µ
cpw(y)
=
IIK* 112 (
+,K
(y,w)2 = (y,w)2,
11
as claimed.
Now let w9 be the Gaussian measure with mean g E ker K. and correlation operator A. By E 3.4.6, the optimal algorithm for w9 is given as cpg(y)
= S(9) + (y  N(9), w)2 = S(9)  (N(9),w)2 + (y, w)2 = (y,w)2 = (y,w)2 = coopt(y)
Since µK. ( 19) = wg a.e. g, the algorithm cpopt is optimal for all subproblems almost everywhere.
3 Average case setting
154
To prove the equality radKe(IlI) = radaVe(IlI) observe that (radK;(III))2
f
er K,
(radave\IlIi µK.
fkerK, IF ( J
('1g)))2'Up,
(dg)
IIS(f)Popt(y)II27rf(dy)) ILK, (df 19) p PK' (dg)
IF IR'
II S(f)  coopt(y) II21rf (dy) it(df ) (radave(]rj))2.
This completes the proof of the theorem.
Theorem 3.5.1 together with (3.8) immediately yields the following corollary.
Corollary 3.5.1 If the solution operator S is a linear functional then for any information l[I we have
rad8Ve(III) = sup radKe(Ll) = radKe(lII), KEF*
where the functional K* is given as K* (f) = S(f)  cpopt (N(f )), f E F.
That is, the difficulty of the original problem is determined by the difficulty of the hardest one dimensional subproblems, and these subproblems are determined by the functional K. If S is not a functional then Corollary 3.5.1 is no longer true. For an example, see E 3.5.4.
Notes and remarks NR 3.5.1 This section is based on Plaskota (1994).
Exercises E 3.5.1 Suppose we want to estimate a real random variable f, where f N(0, A) for A > 0, based on the data y = f + x where x  N(0, v2). Show that in this case the radius equals 02 A
V v2 +A'
3.6 Optimal algorithms as smoothing splines
155
and the optimal algorithm cpopt(y) =
A Q2 +A y,
y E R.
E 3.5.2 Consider the problem of the previous exercise but with information y = [yi,... , yn] where y2 " N(f, v2) and o > 0, 1 < i < n. Show that the radius of information is given as 2
A
z
r(al,...,vn) _
1+A>
1oi 2
and n
'popt(y) =
a? yz
1+AEnz=1 0_2 2
:=1
In particular, show that n observations of f with variances vi are as good as one observation of f with the variance v2 = (E vz 2)1 E 3.5.3 Consider the one dimensional linear problem with the correlation operator CC(L) = AL(fo) fo, VL E F*, where A > 0 and fo E F, and information y " v2E). Let go = S(fo) E R and yo = N(fo) E R' . Show that for yo E E(Rn) we have 0,2',
Ig0
o2 + (E1yo, yo)2 1
Wopt(y)
= go a2 +(A (Eyoyo,yo)2'
while for yo V E(Rn) we have radaVe(III) = 0 and cpopt(y) = go
(Pyo,y)2 (Pyo, y0)2
where P is the orthogonal projection in Rn onto
E 3.5.4 Let F = G = Rd and S be the identity. Let it be the standard Gaussian measure on Rd, p = N(0, I). Consider information IlI consisting of n < d noisy observations. Show that radSVe(IlI) > d  n, while for any functional K E F* we have radKe(IlI) < 1. Hence show that
rad'(IlI) > (d  n) sup radf{e(IlI) KEF`
and that any one dimensional subproblem is at least d  n times as difficult as the original problem.
3.6 Optimal algorithms as smoothing splines Recall that in the worst case setting we defined an asmoothing spline algorithm as cpa(y) = S(sa(y)). Here sa(y) is the asmoothing spline element, which minimizes the functional
ra(f,y) =
(3.13)
3 Average case setting
156 where I I
 11F
and I I

1 k ' are extended Hilbert seminorms. Moreover, if the
set of problem elements is the unit ball with respect to II  IIF and the noise is uniformly bounded in II  IIY by 6, then for appropriately chosen a, the algorithm cpa turns out to be optimal; see Subsection 2.5.2. In this section, we show that in the average case setting optimal algorithms can also be viewed as smoothing spline algorithms.
3.6.1 A general case We consider the linear problem with Gaussian noise of Section 3.4. That
is, the measure p on F is Gaussian with zero mean and correlation operator C.. Information III = {irf} is linear with Gaussian noise, lrf = N(N(f), U2E) with v2 > 0. The operator N consists of functionals Li E F*, N = [ L1i L2, ... , Ln ].
Let H be the separable Hilbert space such that the pair (H, Cµ(F*)) is an abstract Wiener space, see Subsection 3.3.2. Recall that C,,(F*) C H C CN,(F*). Let II  IIY be the extended norm in R' defined as IIxIIY =
{
(Ex, x)2 +oo
XE
x
E(][8 ).
We denote by s(y) E H the smoothing spline that minimizes
F(f,y) = IIfIIH + v2IIyN(f)IIY over all f E H. That is, in the average case setting, r corresponds to 21'12 with 6 replaced by a in (3.13). For instance, in the case of independent observations, E = diag{ai, ... , a,2,1 and v2 = 1, s(y) is the minimizer of n
IIfIIH +
1: 0,32 (Yj
 Lj(f))2
j=1
(with the convention that 0/0 = 0). As usual, the smoothing spline algorithm is given as
cospi(y) = S(s(y)),
Y
Let fj = CµLj E H, 1 < j < n. Then fj is the representer of Lj in H, and for all f E H we have
N(f) =
[(fl,f)H,(f2,f)H,...,(fn,f)H1.
3.6 Optimal algorithms as smoothing splines
157
Applying Lemma 2.6.1 we immediately obtain that IF(y) = inf faF F(f, y) is finite if and only if y E Y2 = N(H)+E(Rn). For y E Y2, the smoothing spline is unique and given as n
s(y) = Ezjfj j=1
where z E Y2 satisfies (yE + GN)Z = y and y = v2. Comparing this with Theorem 3.4.1 and noting that Y2 = Y1 = (o2E + GN)(Rn) we obtain that s(y) is the mean element m(y) of the conditional distribution
for information y. Hence cpspi(y) = S(s(y)) = S(m(y)) is the optimal algorithm.
Theorem 3.6.1 In the average case setting, the smoothing spline algorithm cpspi is optimal.
We stress that, unlike in the worst case, this time we have no problems
with the optimal choice of the parameters a and y. That is, we have
a*=1/2and y*=Q2 3.6.2 Special cases The formulas for asmoothing splines in some special cases were given in Section 2.6. Clearly, the formulas can be applied to obtain optimal algorithms in the average case. It suffices to set a = 1/2 and replace 82 and 'y by v.2, and the norm We now discuss relations between the smoothing spline algorithm and regularized approximation as well as analyzing the least squares algorithm in the average case setting. Consider the linear problem of Subsection 3.6.1 with a positive definite
matrix E. Then II
 IIY
is a Hilbert norm. We denote by Y the Hilbert
space of vectors in R with inner product (., .)y = Let NH : H > Y be the restriction of N F + Y to the subspace H C F, i.e., NH(f) = N(f), V f E H. Let NH : Y > H be the adjoint operator to NH. That is, NH is defined by (NH(f), y)r = (f, NH(y))H, V f E H, `dy E Y. Similarly, we define the operators SH : H G and SH : G H. Recall that the regularized approximation cpy(y) = S(u.y(y)), where u,y(y) E H is the solution of the equation
('YIH + NHNH)f = NHy.
3 Average case setting
158
Here ry > 0 is the regularization parameter and IH is the identity in H; compare with Subsection 2.6.2. In view of Lemma 2.6.3 and Theorem 3.6.1, we immediately obtain the following result.
Corollary 3.6.1
The regularized solution
u7(y) _ ('YIH + NHNH)1NHy is the smoothing spline, u.y(y) = s(y), if and only if the regularization parameter ry = 02. Hence the regularization algorithm W,2 (y) = S(uQ2 (y)) is optimal.
Let us now consider the case where the space F is finite dimensional,
dim F = d < +oo. Recall that the worst case error is then minimized by the least squares algorithm, provided that the noise level 6 is small, see Theorem 2.6.2. We want to see whether a similar result holds in the average case setting.
We assume that the correlation operator C. of the a priori Gaussian measure p is positive definite. Information about f is given as y = N(f) + x E R', where dim N(F) = d and x _ /(0, 0,2 E). Recall that the (generalized) least squares algorithm is defined as cpls = S(N*N)1N*, or equivalently, cpls = SN1PN where PN is the orthogonal projection onto N(F) with respect to (.,.)Y. Since, in the average case setting, the optimal value of the regularization parameter is ry = o'2, the least squares algorithm is optimal only for exact information (v = 0), and then its error is zero. It turns out, however, that for small noise level or as well, this algorithm is nearly optimal. That is, we have the following theorem.
Theorem 3.6.2 For the (generalized) least squares algorithm cpls we have or
tr(S(N*N)1S*)
and
e,e(]11 GIs) = radave(l[I)
(1+0(0,2))
as
or* 0+.
Proof The formula for ea°e(111, cpl,,) follows from the fact that for any f
f
n
II S(f)  wwi.(N(f) + x) II2 ir(dx)
=J
n
II SN1PN(x)II27r(dx) = v2 tr((SN1)(SN1)*)
3.6 Optimal algorithms as smoothing splines
159
o'2 tr(S(N*N)1S*). We now derive a formula for the radius rad&Ve(III). Let
1 be the
orthonormal basis in H (which is now equal to F with inner product of eigenelements of NHNH, and let the 1mis be the corresponding eigenvalues, Since dimN(F) = dim F, all 11is are positive. Theorem 3.4.2 and the equalities NN = C,LN* and
SH = CS* yield (radave(Irj))2 = tr(SCµS*)  tr(cpopt(NCµS*))
= tr(SCN,S*)  tr( S(v21H + NHNH)1NH(NC,,S*)) = tr(SHSI*.f)  tr(SH(o2IH + NHNH)1NHNHSH) d
d
ff Y II SH((a2IH + NHNH)1NHNH)112SjII2
II
i=1
i=1
_
01
2.
IIS(+0II2 2
i=1
(3.14)
+ 77i
Since S(N*N)1S* = SH(NHNH)1SH, d
(eave(LI' ls))2 = O2 tr( SH(NHNH)1SH) = 02 l
Comparing this with (3.14) we have eave(III, cpls) = as claimed.
i=1 rad8Ve(111)(1+O(o.2)),
3.6.3 Relations to worst case setting The fact that smoothing spline algorithms are optimal in the worst and average case settings enables us to show a relation between the two settings. That is, consider the following two problems.
(WW) Approximate S(f) for f E E C F, based on information y = N(f)+x E 1R' where x E E(R") and IIxIIY =
(E1x,x)2 0. Let { fi}di i H be the complete and Horthonormal basis consisting of eigenelements of the operator NHNH : H > H, i.e., Nj NH fi = rli fi, i > 1. Owing to NR 3.6.3 we have dimH (eave(111, cpspl))2
= (rad8Ve(III))2 =
.2 i=1
IIS(fi)II2 0'2 +,q,
(3.15)
On the other hand, in view of Lemma 2.5.1 and Theorem 2.6.2 we have (ewor(N, 0spi))2
< 2 62 II S(62IH + NI NH)112112 < 2 (radwor(N))2
Note that any operator T : H > G satisfies IITII2 < tr(T*T) with equality if T is a functional. This and (3.15) yield dim H
(ewor(N, Wspl))2
< 2 62
II S(62IH + i=1 dimes
262
IIS(fi)II2
62 +,q,
NHNH)112f,112
= 2(radv(]rj))2,
which proves rad\ oC(N) < f rad8 (111). If S is a functional then radave(IlI) =
ewor(N, ,spl)
0, assumes all values from the interval [1, v/'2].
E 3.6.3 Suppose that the solution operator S in Theorem 3.6.3 is finite dimensional, i.e., dim S(F) = d < +oo. Show that then d
rad8Ve(IlI) < rad"'°r(N)
0, 1 < i < n. Using N and E we obtain information y = [yi i ... , yn], where yi = Li(f) + xi and the xis are independent random variables with xi  N(0, o,?). That is, 111 = {7r f} is identified with the pair {N, E},
Ill = IN, E}, and given by
7rf = N(N(f),E) We now define adaptive information. As in the worst case, we assume that the set Y of possible information values satisfies the condition:
for any (y1, y2, ...) E R'
there exists exactly one index n such that (y1, y2, ... , yn) E Y.
(Recall that we also assume measurability of the sets Y = Y fl R2.) An adaptive information distribution Ill is determined by a family N = {Ny}yEY of exact nonadaptive information, Ny = [L1(.), L2(.; y1), ... , Ln(y) (.; yl, ... , Yn(y)1)], where
y1i ... , yn_1) E A, 1 < i < n, and n(y) is the length of y,
y = (yi,... , yn(y)), and by a family E = {Ey}yEY of diagonal matrices,
Ey =
2
2
0,2
To complete the definition, we have to specify the distributions 7r f on Y, for f E F. These are defined as follows. We let W1 = R, and
W. = {(yl,...,ym)ERmI
1<j<m1}
for m > 2. (In words, y E Wm, iffy E Rm and y can be extended to a sequence belonging to Y.) Note that the Wm are measurable. Indeed, so is W1, and for arbitrary m > 2 we have Wm = (Wmi \ Ym1) X R. Observe also that Wm_1 is the domain of Lm(f;) and o ().
3 Average case setting
164
Assuming that the maps Li(f; ) : Wi_1 > R and Wi_1 > R+ are Borel measurable, the distribution w,,,, f on Wm is as follows. Let t, U2) be the one dimensional Gaussian measure with mean t E Ilk and variance v2 > 0. Then
wl,f(B) = N(BILl(.f),cri), wm+1,f (B) = fWm\Ym N(B(t) I Lm,+l (f ; t), a2+1(t)) wm,f (dt), where t E R'm' and B(t> = {u E IR I
(t, u) E B}. That is, the condi
tional distribution of ym+1 given (y1i... , ym) is Gaussian with mean , ym) and variance Q,2n,(yl, , ym) Lm+l(.f; yl, The distribution irf is now given as 00
7rf(.)
= Ewm,.f( nYm). m=1
Lemma 3.7.1 lr f is a well defined probability measure on Y. Proof The ofield on Y is generated by the cylindrical sets of the form
/ m1 B = ( U Bi) U { y E Y I ym EAm}, \ i=1
where the Bi are Borel sets of Yi, Am is a Borel set of Wm, and y"° is the sequence consisting of the first m components of y, m > 1. For any such set, we let m1
7rf(B) _
wi(Bi) + wm(Am). i=1
Observe that if (B) is well defined since it does not depend on the representation of B. Indeed, a representation of the same set B with m replaced by m + 1 is given as m1
B
((I
I Bi) U (AmnYm)) U {yEYIym+1 E(Am\1'm)
i=1
Then m1
wi(Bi) +wm(A, n Ym) +wm+1((Am \Ym) x R) i=1
3.7 Varying information
165
m1
Wi(Bi) +Wm(Am nYm) +Wm(Am \Ym) i=1
M1
wi(Bi) +Wm(Am). i=1
*f is an additive measure defined on the cylindrical sets. Hence it can be uniquely extended to a oadditive measure defined on the Borel sets of Y. Since Fr (Y) = wt(Wi) = 1, this is a probability measure. Now, for any B = U°° 1 Bi where Bi = B fl Yi, we have m
m
irf(B) = moo lim irf(UB2) = lim Ewi(Bi) = 7rf(B). m.oo i=1
i=1
Thus 7r f = *f and 1rf is well defined.
Clearly, nonadaptive information can also be treated as adaptive in
formation. Then Y = R' and the maps
yi, ... , yi_1) = Li and 1 < i < n.
o (yi, ... , yi_1) = v2 are independent of yr,... , yi_1 E
l[8z_1
3.7.2 Adaption versus nonadaption In the worst case setting, for adaptive information N it is often possible to select y* E Y in such a way that the radius of nonadaptive information
N. is not much larger than the radius of N, see Theorem 2.7.1. Our aim now is to show a corresponding result in the average case setting for linear problems with Gaussian measures. That is, we assume that F is a separable Banach space, G is a separable Hilbert space, and the solution operator S : F + G is continuous and linear. The measure a is zeromean and Gaussian with correlation operator Cµ : F* + F.
Let III = {IIIy}yEy be arbitrary information. Recall that for y = (yt,... , yn) E Y the nonadaptive information 1[Iy = {Ny, Ey} is given by Ny = [ Lt,y, L2,y, ... , L,2,y ] and 2 2 2 Ey = diag{ vty, 472,y, ... , Orn,y }
where, for brevity, Li,y =
yi, ... , yi_1) and vi y = vi (yt, ... , yi1),
1 < i < n. Recall also that pi denotes the a priori distribution of
3 Average case setting
166
information y in Y, i.e.,
µl =
f
F
f() a(df),
and 111 is in general not Gaussian even when Y = r. For any f E F, the measure 7r f is supported on Y1, f = { y E Y I y E Ny (f) + Ey (Rn(y)) }. Hence µl is supported on Y1 = { y E Y I y E Ny(F1) + Ey(1[8n(y))}
where F1 = Cµ(F*) = supp p, or equivalently, Y1 = { y E Y I y E (Ey + GN,,)(Rn(y))}.
We need a theorem about the conditional distribution of a Gaussian measure with respect to adaptive information.
Theorem 3.7.1 For adaptive information III = {Ny, Ey}yEy the conditional distribution given as
y E Y1, is Gaussian. Its mean element is n(y)
m(y) _)7 zj(CALj,y)' j=1
where z is the solution of (Ey+GN,,)z = y, Ey = diag{o y, ... , vn(y) y}, GNP = {(L2,y, Lj,y)µ}? )1, and y = [y1i ... , yn(y) erator of is given as
CI2,Y(L) = CN.(L)  m(Ny(C,L)),
The correlation op
L E F*.
Proof We first give a proof for adaptive information IlI with n(y)  n,
i.e., y = V. Let F = F x Rn and A be the joint probability on F,
µ(B) = f irf(B(f)),u(df), F where B(f) y E Y I (f, y) E B }. Let XB be the characteristic function of B, i.e., XB (f) = 1 for f E B, and XB(f) = 0 for f B. We denote by pl(.Il[I1) the a priori distribution of information values with respect to (adaptive or nonadaptive) information Iii1i and by µ2( I Y, IlI1) the conditional distribution on F given y.
In view of Theorem 3.4.1, we only need to show that µ20y,1) _ µ2 ('I Y, Fly). To this end, we shall use induction on n.
If n = 1 then any adaptive information is also nonadaptive and the proof is obvious. Suppose that n > 1. Let Nn1 be the adaptive information consisting of noisy evaluations of the first (n  1) functionals of
3.7 Varying information ICI. For y E R', we write y = (yn1, yn) where Then we have
µ(B)
=
167
yn1 E I[8n1 and yn E R.
XB( f,y)lrf(dy)p(df)
IF
ffn
XB(,f,y)Af(dynILn,yn1(f),02yn1)
1 (dyn1)
Wn1,f
fR'
FR
P(df)
XB (f, y) N(dyn l Ln,y"1 (f),
µ2(df l yn1,
Nn1)
µ1(dyn1I
0.yn1)
Nn1). yn1, Nn1) can
Using the inductive assumption and Theorem 3.4.1,
be interpreted as the conditional distribution on F with respect to the nonadaptive information Nnn 11. Hence, denoting by p the distribution y Nn1) with of yn given yn1, and using decomposition of respect to yn, we have
fF iA1(d(f ), = Jh(y)p(dy) where h(y) = fF XB(f, y) p2(df I Y, I ly). As a consequence, we obtain
µ(B) = f
n
1
fh(y)p(dyn)Pi(dyn_hINn_1) p1(dynIINn1)
LlJRL h(y) p2(df I Y, Illy) P(dyn)
= fjh(y)lrf(dy)P(df) = fh(y)i(dy) n
=
L f xB(f,y)p2(dfly,111y)p1(dy)
n
F
On the other hand, µ(B) = fRn fF XB(f, y)µ2(df l y, IlI)µ1(dy). Thus
µ2('1y,0 = µ2(.Iy,Ny), as claimed.
In the general case (Y C ]R°°), we have
µ(B)
= f Ff YXB (f, y) lrf (dy) µ(df )
a.e.y
3 Average case setting
168 00
M JF fYm XB(.f,Y)m,f(dy)(df) 00
MJ y+n
1
IF X
f f XB (f, y) u2(df I y, may) µ1(dy)
YF
The proof is complete.
Thus the conditional distribution µ2( y) with respect to adaptive information III is equal to the conditional distribution with respect to nonadaptive information IIII and the same y. This should be intuitively clear. Indeed, in both cases, the information y is obtained by using the same functionals Lz,y and precisions Q?y.
Theorem 3.7.1 gives almost immediately the following result corresponding to Theorem 2.7.1 of the worst case setting.
Theorem 3.7.2 For any adaptive information III = {Ills}yEy there exists y* E Y such that for the nonadaptive information l ly= we have rad8°e(IH).
Proof There exists y* E Y1 such that rad8Ve(1H)
=
fy
(
y)))2 µi(dy) ? r(v2(. y*))
(3.16)
and is the radius of v2 ( Iy) . By Theorem 3.4.1, the measures µ2 ( I y, l ly.) have the same correlation operator. Hence a.e. y. This, Theorem 3.7.1, and (3.16) yield
where, as in Section 3.2,
rada°e(1Hy) =
r(v2 (.I y,1 Iy. )) µ1(dyjIlly )
r(v2(. y*,IIIy")) = r(v2(.Iy*))
radave(I[I),
as claimed.
Observe that Theorem 3.7.1 does not say anything about the construction of y*. Actually, y* can assume arbitrary values, see E 3.7.2.
Thus the situation differs from that in the worst case in which for
3.7 Varying information
169
linear problems with uniformly bounded noise we have y* = 0 and rad"ror(No)
A2 >_ A3 ? ' ' > 0 be the corresponding eigenvalues, C Ei = AjSj We may assume that the sequence {Ai} is infinite
by letting, if necessary, Ai = 0 for i > dim G. For Ai > 0, define the functionals
Ki = Ai 1/2S%i = Ai Clearly, these functionals are porthonormattl,
(Ki, Kj)µ = (AA)'/2(S, S Ittj)H = (A Aj)112(SHSHei,bj) = Sij' For A2 = 0 we set K' = 0. Since C = SHSH, the Ais are also eigenvalues of the compact operator SHSH : H * H, and the corresponding Horthonormal eigenelements are C,Ki E H, i > 1. Recall that in the worst case setting the problem of optimal information was related to some minimization problem. The corresponding problem in the average case is as follows: Problem (MP)
Minimize n /
A.
(3.17)
i=no+1 1 + 77i
over all r7no+1 >_ ' ' ' > 77n > 0 satisfying n
n
r7i < i=r
and
En
i=no+1 77i 
Theorem 3.8.1
no+1 ... > 77m > 0 = ?7m+1 = = 11n We know from NR 3.4.3 that the radius of III = IN, E} is equal to the radius of information III = {N, E} where N consists of m µorthonormal functionals Li l
n q(2) Li =r7i1 j=1 E Lj, Qj
and E = diag{771 1,
1 %, yield m
"i
i=1 1 +?7i
Eai.
n
IIS(C..Li)II2 < i=1
qi
1+r7
Ai
3.8 Optimal nonadaptive information
173
(compare with E 3.8.1), which implies the following lower bound on radave(N, E): n
00
(radave(N, E))2
00
> : Ai  E 1 + 77j Aj = SZ(771, ... 714
j_1
i=1
77n) + E .1j j=n+1
Observe that for all 1 < r < n we also have (see again NR 2.8.6) n
E
n
n
(Gnej, ei)2 =
77i
n
52(77i,
E Ai
... , 7l*n) +
(3.19)
i=n+1
N
We now show that rad'(NE, E) is equal to the right hand side of (3.19). Indeed, since n
n
n
(LF, Lj )µ = ( i E wi8Ks, Qj E wjtKt)µ = Qio j E wijwjs, s=1
s=1
t=1
(3.20)
the corresponding matrix Gn equals WW*, the eigenvectors q(i) of Gn
are the columns w(i) of W, and Gnw(i) equals ?7i *w('), 1 < i < n. Furthermore, for 1 < i < m, the functionals Li are equal to Li 1 ni
q(i) as 1 I Qs j=1
s=1
n
n
11
1
wsjKj =
wsi 77a
wsjKj I
71i
j=1
s=1
n
n
E Kj E wsiwsj
j=1
s=1
?hKi = KK.
77i
Hence
m
00
radave(NE, E) =
Ili *IIS(CCLj)II2 i=1 1 + 71i
i=1 00
N i=1
*
n Ai  E +j77j*Aj =
00
*
j=1
Q(,q1...... qn) + E A3. N
j=n+1
3 Average case setting
174
Since, in view of (3.20), we also have IIL£11N, = U? Es 1 W? = 1, the information NE is in Nn. This completes the proof for the case no = 0.
Suppose now that no > 1. Then the a posteriori Gaussian measure on F with respect to exact information N(°) = [L1,. .. , Lno] has the correlation operator CN N(o) = Cµ (I  PN(o) ), where PN(o) : F* F* is the porthogonal projection onto span{L1 i ... , Lno }. For the dominating eigenvalues Ai of SCµ N(o) S* (which is the correlation operator of the a posteriori measure on G with respect to N(°)) we have Ai _> Ano+i, V i >_
1. Moreover, if N(°) = No = [K1,. .. , Kno ] then Ai = Ano+i, and the corresponding eigenelements are Si = 1;no+i, V i > 1. Hence we obtain the desired result by reducing our problem to that of finding optimal N(1) = , Ln], for the correlation matrix E(1 = diag{U,2o+1, ... , Un} [Lno+l,
and the Gaussian a priori distribution on F with correlation operator
Cm(IPNo) We now give an explicit formula for the solution of the minimization problem (MP) as well as for the minimal radius rnaVe(E). For no < q < r < n, define the following auxiliary minimization problem.
Problem P(q, r)
Minimize ( Q9T 1n9+1> ...
r
aj
fir) j=q+1
1 + 1Jj
 
over all 77q+1 >
> 77, > 0 satisfying EiT=q+1 77i = >2T=q+1 Ux
2
The solution rl* _ i ) of P(q,r) is as follows. Let k = k(q, r) be the largest integer satisfying q + 1 < k < r and
Ek
21
rrT uj=9+l U.
Then 117 =
rr X
=gq+`i
1/2 Aj/2
< X1/2
+(kq)
U2+ k  q) i
(3.21)
1 < i < k,
1
j=q+1 Aj and ij. = 0 for k + 1 < i < r. Furthermore,
qr( )
rrk
1/2
2
U3 2 + (k  q)
r
+
1
7
A9
(3.22)
3.8 Optimal nonadaptive information
175
We shall say that the solution q* = (77q+1, ... , 77,;) of P(q, r) is accept
able if
r
r rylj C
j=8
for q+l<s no I solution of P(s, ni+1) is acceptable },
.
i=1 Ai
n+Q2m
where m is the largest integer such that Am > 0. For fixed a and n > +oo, the radius converges to zero, but not faster
than a//. Suppose that the eigenvalues Aj satisfy Aj
^ (__?)
j
as
> +oo,
(3.28)
where p > 1 and s > 0. After some calculations we find that for or > 0 or (ln8P n/nP1)1/2
r(a2)
2(s+1)
or (1/n)1/2
1 < p < 2,
1/2
p > 2,
178
3 Average case setting
notation do not depend on or. For exact
where the constants in the information we have rave (O) n
l
C1n3Pn11/2 J nP1
Hence for 1 < p < 2 the radius of noisy information behaves as r e(0), while for p > 2 it achieves the best possible rate of convergence which is v/V/n. Note that the eigenvalues (3.28) correspond to the multivariate function approximation in £2((0,1)d) with respect to the Wiener sheet measure, see NR 3.8.4. Relations to worst case setting We now discuss the relations mentioned previously between the optimal information in the average case and the worst case settings of Subsection 2.8.1.
Consider the pair of problems defined as in Subsection 3.6.3. That is, assume that {H, F} is the abstract Wiener space for a zeromean Gaussian measure It. We approximate values S(f) of a continuous linear operator S : F + G, based on noisy information y = N(f) + x, where N = [L1i ... , L,,] and the functionals IILiII/, = II (Li)HII H < 1. In view of Theorem 3.6.3. for fixed N the optimal algorithms in the worst and average case settings are (almost) the same. We now want to see whether a similar result holds with respect to optimal information. More precisely, suppose we want to choose information N E A/, and algorithm cp so as to minimize (WW) Worst case error eWOT (N, cp) for the set E = { f E H I II f II H S 1 } and noise IIxIIY = (Elx, x)2 < 6,
(AA) Average case error e' (III, cp) for the Gaussian measure µ and noise x  N(0, a2E), where E is an n x n diagonal matrix. Observe first that in both problems the minimal radii are determined by n dominating eigenvalues of the operator SHSH = SCLS*, and the optimal functionals are linear combinations of the functionals Ki, 1 < i < n. Furthermore, in the case E = I and 6 = v > 0, the radii rn°''(6) and rn e(O,2) decrease to zero as n > +oo; however, their convergence may be different and is not faster than n1/2. It turns out that an even stronger correspondence between the two problems holds, similar to that of Theorem 3.6.3. That is, there exists
3.8 Optimal nonadaptive information
179
information N* which is almost optimal in both the worst and average
case settings. To show this, assume additionally that S = v and 0 = < yn2 are the diagonal elements of the yi = ... = y.o < yn,,+1 < matrix E. Let a' and 0+1 > . > ?7n > 77+1 = 0 be the solution of
the minimization problem (MP) of Subsection 2.8.1 with Si = yi, and let 77i >
.
. > 77n > 0 be the solution of the minimization problem (MP)
of the present section with Qi = yi. Next, let the information N. be given as in Theorem 3.8.1 (or in Theorem 2.8.1) with Qi = yi// (or 6i = yi//) and 772 = (77w + r7°')/2, no + 1 < i < n. Observe that N* is well defined since for any no + 1 < r < n we have n
n
n
Erli = E`i7w+?7a)/2 < i=r
i=r
Y ,}Z 2 2=r
and the assumptions of Lemma 2.8.1 are satisfied. Also, N* E Nn. We have the following theorem.
Theorem 3.8.3 For information N* and the spline algorithm (Pspl we have
< 2 rwor (A)
ewor ({N*, A}, Vspl) and
eave ({N*, E}, Wspl)
where
rave (E)
0 we obtain ave
2
rn (a) rw or (6)
nlP/2ln8P/2 n 8+1
hl 1
n
1 < p < 2, p = 2, p > 2,
as n + +oo. For exact information we have rn e(0)/rnor(0) x f
3.8.2 Approximation and integration on the Wiener space In this subsection, we study optimal information for approximation and integration of continuous scalar functions, based on noisy observations of function values at n points. More precisely, we let F be the space of functions defined as
F = Co = { f : [0,1] *
f continuous, f (0) = 0 },
equipped with the classical Wiener measure w. Recall that w is uniquely determined by its mean element zero and covariance kernel
R(s, t) = fco f(s)f(t) w(df) = min{s, t},
0 < s, t < 1.

The solution operator corresponding to the function approximation is given as
App : Co + £2(0,1),
App(f) = f,
V f E C°,
3.8 Optimal nonadaptive information
181
while the integration operator is defined as f1
Int: C°R,
Int(f)=J f(x)dx, Vf EC°. 0
We assume that A = Astd. That is, information about f E Co is supplied by n noisy values of f at arbitrary points from [0, 1],
N(f) = [1(t1), J (t2), ... , f (tn) ]
(3.29)
where 0 = to < tl < t2 < . . . < to < 1. The information noise is assumed to be white, i.e., x  N(0, o 2I). Hence information is determined by N and the noise level a.
We start with a remark about the radius of information. Let S E {App, Int}, the variance be o.2, and exact information N of the form (3.29) be given. We know that the radius rada°e(S, N, Q2) is attained by the algorithm cpopt(y) = S(m(y)), y E R', where m(y) is the mean of the conditional measure w(.Iy) with respect to the observed vector y. Furthermore, 1/2
rada°e(S) N,02) _ (IC0
Let RN :
[0, 1]2
IIS(f)ii2w(dfI0)(3.30)
 IR denote the covariance kernel of the conditional
measure (it is independent of y). Applying (3.30) and the Fubini theorem we obtain 1
(rad (APp,N,cr2))2 =
j f2(x)dx)w(df10)
jo
J' 1 (f
o
f2(x)w(dfI0))dx
1
RN(x, x) dx
(3.31)
J0 and
(radaYe(Int, N, v2))
2
1
f (X)
= fce (f
fc
2
dx) w(df 10)
( f1 f1 f(s) f (t) ds dt) w(df 10) 0
0
1
f0
fo,
RN (s, t) ds dt.
(3.32)
3 Average case setting
182
Hence the problem of finding the minimal error rn e(S, a2) can be reduced to minimizing (3.31) for function approximation and (3.32) for integration, over all N of the form (3.29). We shall find the optimal information in three steps. We first give formulas for RN. Then, using these formulas, we estimate the radius of information Nn consisting of observations at equidistant points. Finally, we present lower bounds on rn°e(S, a 2), S E {App, Int}, from which it will follow that information Nn is almost optimal. Covariance kernel of the conditional distribution
Recall that the mean element m(y) of w(.Iy) is the natural linear spline interpolating the data {ti, zi} 0 where the zis are obtained by smoothing the original data {yi}z 1, see Subsection 2.6.4. We now find formulas for the covariance kernel RN of w(. ). Define the sequences fail!' 0, {ci} 0, {di} 0, and {bi}1 1, as follows.
ao=co=0, Ci =
a2 (ti  ai1) Q2 + (ti  ai1)
(3.33)
ai = ti  ci, i = 1, 2, ... , n, and do = Cn,
bi = ai1 +
(ti  ai1)2
(ti  ai1)  di (ti1  ai1)(bi  ti1) dz_1 
(3.34)
bi  ai1
i = n, n  1, ... , 1. (To make these and the next formulas well defined for all as and tis, we use the convention that 0/0 = 0.) Note that the numbers bi, bi > ti, are defined in such a way that for the parabola (bi  t) (t  ai1)
pi (t) 
bi  ai_1
we have pi(ti_1) = di_1. If the information is exact, or = 0, then ai = bi = ti, while for or > 0 we have
0 < ai_1 < ti1 < ti < bi,
1 0.
,
(ii) Let 0 < a < 1 and K > a/(1  a2). Then for sufficiently large n we have
ci > ao
di>Kof.
7n=
Proof Observe that the function _ o2(x + 1/n) X _ fi(x)
C2+x+1/n

x2 + x/n _ o2/n
o2+x+1/n
'
x>0
,
is decreasing and is zero at g
_ 0
1
1 1+ +4 0 , 22n
Moreover, if 0 < x < g then x + e(x) < g. Hence the sequence {c } is increasing and ci < g < on112, Vi > 0, which proves (i). To show (ii) observe that for cs < h < g we have C*
s11((
s1
j=o
j=o
f t > sS(ca) > s (h) = E(cj*+1  cj) = ES(cj)
Hence, for any 0 < h < g and s > 0, c* > min{ h, s 1; (h)} = min { h, _
s(h2 + h/n  o2/n) }
o2+h+1/n
(3.40)
Letting h = aon1/2 and using (3.40) we get that the inequality ci* > aon1/2 is satisfied for + (0j)2) + a
i>
aoVfn (1
Hence, (ii) follows.
a(oVfn)1
1  a2  a(oi)1
1  a2
ovfn.
11
3.8 Optimal nonadaptive information
187
Lemma 3.8.2 (i) Let 0 < a < 1 and K > a/(1  a2). Then for sufficiently large n
di > 2,
Vi >
(ii) Let ,3 > 1 and L > 2 ln(,Q  1). Then for sufficiently large n
doi
Vi > Loi.
2,Fn
Proof Let 2
c*
c* /n
B: = ' 0 Kovln we have cz >
aiv//. Hence, for such i and n, = Aidi+1 + Bi > Adi+j + B
di
ni1
>
... > Anidn + B
A' j=0
A
(Cn 
a1Qf
2
B
B
1A ) + 1A'
where A
_
(1+ai)
_ '
a1o
aion+;/n
B
Since cn ^ v/ f and
_ ala 2 + 2a1QV 1A 2vrn 1+2a1Qf B
ala
2/'
(i) is proven. To show (ii), let 0,",,n
C
(1+Qv)
2
'
D
_
a an+N/n
Then, owing to Lemma 3.8.1(i), we have Ai < C and Bi < D, Vi. Hence dn_i
D D < Ci tdn 1C + 1C s
3 Average case setting
188
0 we have rave(a2) n
Inkk n/ x f1ar/V"
r = 0,
r>1,
and r,ae(0) x n (r+1/2) In (k1)(r+') n. NR 3.8.5 We now give a concrete application of the correspondence theorem of Subsection 3.6.3. We let F be the Hilbert space, f (0) = 0, f is abs. cont., f' E G2(0,1) }, F = W° = [0,
with inner product (fl, f2) F = fo fl(t)f2(t)dt. Consider the problem of approximating the integral Int(f) in the worst case setting with E the unit ball of F. Information consists of n function evaluations and noise is bounded in the Euclidean norm, s 1 x1 < 62. As we know, {W°, C°} is an abstract Wiener space and the classical Wiener measure w is the corresponding Gaussian measure on Co. Hence we can apply Theorem 3.6.3 and Corollary 3.8.2 to get that the minimal radius for this problem is given as
+4. V/n
rnor(Int, b)
where qn E [1/\, f]. These bounds are attained by the (1/2)smoothing spline algorithm using noisy function values at equidistant points. NR 3.8.6 We assume that each value f (t;) is observed with the same variance o2. One may consider a model in which f (t;) is observed with variance o, where the o,s may be different. It is easy to verify that in this case Theorem 3.8.4 remains valid provided that o2 in the formulas (2.5) and (2.6) is replaced by o; . However, formulas for the minimal radii are in this case unknown.
NR 3.8.7 The problems App and Int with F = C° and p the rfold Wiener measure wr (r > 1) were studied in Plaskota (1992). It was shown that if the class A consists of function values or derivatives of order at most r, then 2 rn e(APP,r,o)
V'° +
(1
and
Tne(Int,T,o2) X
lr+1/2
\nl r+1
V/n
+ \nJ
These bounds are attained by information
Nn(f) = [ f(r)(tl),f(r)(t2),...,f(r)(tn)],
(3.52)
where t1 = i/n, 1 < i < n; see E 3.8.9. One can show that for integration the same bound can be obtained using only function values. However, this fact does not apply to the function approximation which follows from more general results of Ritter (1994). He considered numerical differentiation, S(f) = Diffk(f) = f (l') (0 < k < r), with respect to the same rfold Wiener measure. Assuming that only observations of function values are allowed, he showed that )(2(rk)+l)/(2r+2) (1)r+1/2 rBVe(Diffk,r,o2) x + \nVrn
3 Average case setting
196
In particular, for approximation from noisy function values (o > 0), the min
imal radius has the exponent (2r + 1)/(2r + 2) which is worse than (1/2). Hence, for function approximation, noisy information about rth derivatives is much more powerful than information about function values.
Exercises E 3.8.1 Let al _>a2> _ a.>0andletA,Ai, 1 £2(a,1), (Sa(f))(t) = f (t), where a E (0, 1). Observe that for any III and cp we have eaV.(Sa,11, cp) < eBVe(App, III, V). To find a lower bound on eaVe(Sa, ICI, gyp), use the technique
from the proof of Theorem 3.8.6.
E 3.8.8 Let w, be the rfold Wiener measure and L(f) = f (k) (t), f E C°, with 0 < k < r. Show that
f
1ILIIwr
t2(rk)+l
= J L2(f)wr(df) =
2
< 1.
E 3.8.9 Let F = Cr(), µ = Wr, and Nr, be information defined by (3.52). Show the inequalities
rada°e(App, r + 1, Nn") < rada°e(Int, r, NN) < rada°e(App, r, Nom) . Use this and the previous exercise to obtain that for A consisting of function values and derivatives of order at most r we have
rn '(App, r, U2)
n
rnnVe(Int, r, Q2)
for all r > 1 and a > 0.
3.9 Complexity In this section we deal with the problem complexity in the average case setting. Recall that the problem is defined by the solution operator S : F + G, the probability measure p on F, and the class A of permissible functionals. As in the worst case setting, we assume that approximations are obtained by executing a program. The program is defined in Section 2.9. The only difference is in the interpretation of the information statement. That is,
1( y; L,v) now means that the real variable d takes a value of the real Gaussian random variable whose mean element is L(f) and variance is v2. The cost of executing this statement is c(v2) where, as before, c is a nonnegative and nonincreasing cost function with positive values for small a > 0. We recall that the program specifies not only how information is collected, but also which primitive operations are to be performed. The operations are arithmetic operations and comparisons over R, elementary linear operations over G, and logical operations over the Boolean values.
Let P be a program which is a realization of an algorithm o using
3 Average case setting
198
information y 1rf. The (average) cost of computing an approximation with the program P equals
costV(P) =
JY
cost(P; y) pi(dy)
where, as before, cost(P; y) is the cost of computing cp(y), and pi is the a priori distribution of noisy information y on Y,
µi(B) =
IF
ir f(B)
(3.53)
p(df)
(compare with Section 3.2). The definition of the cost of approximation yields the algorithm complexity, compa°e(111, cp), and the problem ecomplexity, Comp&Ve(c), in the average case setting. That is,
comp, (ICI, cp) = inf { cost' (P) I
P is a realization of cp using N },
and for e > 0, Compave(c) = inf { comp89e(ICI, cc)
I
eave(111,
cc)
e}
(inf Q = +oo).
Our aim now is to obtain tight bounds on the average complexity of linear problems with Gaussian measures. That is, we assume that S is a continuous linear operator acting between a separable Banach space F and a separable Hilbert space G, and p is a zeromean Gaussian measure on F.
To establish complexity bounds, we need further relations between nonadaptive and adaptive information in terms of the cost function c(.).
3.9.1 Adaption versus nonadaption In Subsection 3.7.2 we compared the radii of adaptive and nonadaptive information. Theorem 3.7.2 says that for any adaptive information ICI
there exists y E Y such that the average radius of the nonadaptive information 111 is not larger than the average radius of E. However, it
is not excluded that the cost of observations in IIIy could be much larger
than the average cost of observations in E. In the present section we study this issue.
3.9 Complexity
199
Let III = {Ny, Ey}YEY with Ey = diag{vl, ... , o'n(y) (yl) ... , Yn(y)1) } be arbitrary information. The average cost of Ill is given as n(y) costave(JU)
_ fY i=1 clog (yl, ...
,
yi1)) dy
Clearly, if ICI is nonadaptive then we have costBVe(III) =
EZ1 c(cr ).
For a E R and y(1), y(2) E Y, let lfI' = ICI' (y(1), y(2), a) be information defined in the following way. Denote by ni the length of y(i) and by y1 the first component of y. Let
Y' = {yEV' I y1a}, and foryEY', NY, Vy } = y'
jl
Eya) } {Ny(2),Ey(2)}
yl!5 a,
yl>a.
We set III' = {Ny, Ey}YEY'. Observe that the information ICI' is almost
nonadaptive since it uses only at most two nonadaptive sequences of observations. It turns out that the class of such information is as powerful as the class of adaptive information. That is, we have the following theorem.
Theorem 3.9.1 Let 111= {Ny, Ey}YEY be adaptive information. Then there exist y(1), y(2) E Y and a E R such that for the information III' = III/(y(1), y(2), a) we have
and
CoSt8°e(llI') < costave(L)
radave(ICI') < radave(III)
Proof Let w be the a priori distribution of the variable y F+ cost(1Iy) = Ei i) c(o (y1, ... , yi1)) on R+, i.e.,
w(B) = ul({y E Y I cost(ICIy) E B}),
VBorel sets B of
where pi is given by (3.53). Clearly,
cost (E) = fR T w(dT).
(3.54)
The measure µl can be decomposed with respect to the mapping y '> cost (IRY) as
µl() =
f
pi ( IT) w (dT),
3 Average case setting
200
where µ1( I T) is a probability measure on Y which is supported on the set YT = {y I cost(111y) = T }, for all T such that YT 0. This, (3.2) and Theorem 3.4.1 yield (rada"e(1H))2 = f (r(v2(.I y)) )2 1,11(dy) =
f b(T) w(dT)
(3.55)
where
(T) = f + (r(v2
pi(dyIT)
YT 34 0,
otherwise.
(3.56)
I y) is the conditional distribution of S(f) given y, and r(.) is the radius of a measure, see Section 3.2. Here v2 ( I y) = µ2
We now show that it is possible to select real numbers 0 < T1 < T2
To /3T. Then /3 > oo and for all T >_ 0 we have z/) (T) > O (T) . Moreover, the last inequality `>' can be replaced by `>' on the interval [0, To] or on [To, +oo). Hence we obtain
fb(T)w(dT)
>
f ia(T) w(dT) = 00 = !RTdT),
which is a contradiction.
Let T1,T2 and a* satisfy (3.57) and (3.58). We now choose two sequences y(i), j = 1, 2, in such a way that cost(I[1 (3)) = Tj and J (r(v2(.I
y(a)) ))2 /12(df I z(j)) S O(Ti),
3.9 Complexity
201
as well as the number a such that a
1
J
( x2
2}
exp {
dx  a*,
l where v; = L1(CµLl) + v1 is the variance of the Gaussian random JJJ
variable yl. From (3.54) to (3.58) it now follows that for the information IlI' =1[I'(y('), y(2), a) we have cost(IlI')
= a* cost(IlI(1)) + (1  a*) cost(IIy(2) )
< f T w(dT) = cost(III) and (radave (]Ri)) 2
a* (radave(rjy(1)))2 + (1  a*) (radave(Ey(2)))2
< a* V)(Ti) + (1 a*)z/i(T2) < f (T) w(dT) (radave(]R))2 ,
as claimed. We now make the following observation. Assume without loss of generality that cost(IlIy(1)) < cost(E) and radave(IlIy(2)) < rad&Ve(E) (if this were not true, it would be possible to select y(l) = y(2)). Let 0 < p < 1. Then for a* > p we have
cost(IlIy(1)) < cost(E)
and
radave(1rIy(1))
0. The second assumption, ICnon(e) = that ICn°n(s) increases at most polynomially in 1/s as s > 0+. This condition can often be replaced by the semiconvexity of ICnon() That is, we have the following lemma.
3.9 Complexity
203
Lemma 3.9.2 Let eo = fF, IIS(f)II2a(df). Suppose that ICnon(/) is a semiconvex function of e on the interval [0,E01, i.e., there exist 0 < a < ,Q and a convex function h : [0, eo] > [0, +oo] such that
a . h(e) < ICnon(f) < l3 h(e),
b0 < e < eo.
Then
Comp(e) >
. ICnon (e),
V O < E < eo.
Proof Let 1[I = {I[Iy}yEY be arbitrary adaptive information with radius radave(I[I) < E < co. Let V)(y)
= (r(1i2('I y)))2
.
Clearly, 0(y) < Eo. Define the probability measure w on R as
w(B) = p ({ y E Y I VI(y) E B } ),
VBorel sets B of R.
The convexity of h and the inequality cost(I[ly) > ICnon (
,(y)) >_ a h(zfi(y) ),
yield
cost(] I)
=
jcost(IE1)/ii(dy) >
a J+h(x)w(dx) >
a h ((radave(I))2) > a.
a.
ICn"(e'(I1I))
ICnon(e).
Since 111 was arbitrary and
Comp(e) > inf { cost (E) I
rad8Ve(11I) < e },
the lemma follows.
The essence of the proven estimates is that it is enough to know the information ecomplexity to obtain bounds on the problem complexity. As in the worst case setting, ICnon(e) can be derived as the inverse of the Tth minimal radius. More precisely, the Tth minimal (average) radius is defined as n
R(T) = inf { r8Ve(diag{cr , ... , o
})
I
n > 1, E C(a2)
T }.
3 Average case setting
204
Knowing R(T) we can find its inverse
R1(E) = inf IT I R(T) < e }. If this function is semicontinuous then, similarly to Lemma 2.9.2, ICnon(e)
R1(e)
as
6 + 0.
Notes and remarks NR 3.9.1 First results on adaption versus nonadaption in the average case setting were obtained by Wasilkowski (1986) who studied exact information, see also Traub et at. (1988, Sect. 5.6 of Chap. 6). The results on adaptive information with noise have been taken mainly from Plaskota (1995a). NR 3.9.2
Let
IC'd (e) = inf { cost(I[I) I
111 adaptive and rad°Ve(E) < e }
(3.59)
be the adaptive information ecomplexity. In terms of IC&l(E) and ICnon(e), the results of Theorem 3.9.1 mean that for any e and 0 < p < 1, at least one of the two following inequalities holds: ICad(E) > ICnon (67P or
IC,d (E) > (1  p) ICn°n (e)
It turns out that this estimate is sharp. More precisely, it was proven by Plaskota (1993b) that for exact information (i.e. for the cost function c = const > 0) the following theorem holds. Let the nonzero solution operator S : F + G and the Gaussian measure p with dim (supp p) = +oo be given. Then there exists a class A C F* of permissible information functionals such that:
(i) For any a, /3 > 0 satisfying a +)3 > 1, and for any co > 0, there exists E < Eo such that
IC(E)
0 and co > 0 there exists e < co such that ICad(E)
0 and eo > 0 there exists E < co such that
ICad(E) < y .
ICnon(E).
3.10 Complexity of special problems
205
Exercises E 3.9.1 Show that the adaptive information ecomplexity defined in (3.59) satisfies ICad(E)
= inf{alCn,n(El) + (1  a)ICn,n(E2)
0<El <E<E2, ae2 l +(1a)E2=E2}, and that ICad(/) is a convex function of E. E 3.9.2 Suppose that the function T + R2 (T) is semiconvex, i.e., there exist To > 0, 0 < a < 0, and a convex function h : [0, +oo) + [0, +oo) such that
a . h(T) < R2(T),
VT > 0,
R2(T)
To.
and
Show that then for any information IlI with cost(III) < T we have
R(T),
rada°e(ICI) >
VT > To.
Q
3.10 Complexity of special problems In this section we analyze the ecomplexity of the problems considered in Section 3.8.
3.10.1 Linear problems with Gaussian measures We begin with the problem defined in Subsection 3.8.1. That is, S : F G is an arbitrary continuous linear operator, p is a zeromean Gaussian measure, and the class A consists of linear functionals bounded by 1 in the pnorm. The technique of evaluating Comp(e) will be similar to that used in Subsection 2.10.1 where the corresponding problem in the worst case setting is studied. Therefore we only sketch some proofs.
For a given cost function c, we let c(x) = c(x1), 0 < x < +oo. We assume that the function c is concave or convex, and c(0) = +oo. Recall that {taz}a' i G is the complete orthonormal system of eigenelements of SCµS*, Al > A2 > ... > 0 are the corresponding eigenvalues, and Kz = A 1/2S"`6j. The function Sl is given by (3.17).
Lemma 3.10.1
The Tth minimal radius is equal to
R(T) =
inf
where the infimum is taken over all n and h > 0, 1 < i < n, satisfying
3 Average case setting
206
(al) for concave c n
4?IZ) < T, i=1
(bi) for convex c 1 n nc(1?1a) 1,
c(o ) < T } i=1
n
inf {c('l,...,?!n)
n > 1,
6(7]z)
nc(rio) where 71o = (1/n) Ez 1 r7,,. Hence
R2(T) = inf { (rn e(O.2I))2 I
n > 1, nc(v2) < T } n
inf {Q(111,...,?/n) I n> 1, nc ( The rest of the lemma follows from Theorem 3.8.1.
?1t) 0 I R(T) < e }.
If the number n = n(T) defined by (3.61) satisfies n(T) = O(T) as T + +oo, then IC(clin; e) is attained by information that uses O(T) observations, and the complexity behaves as IC(clin; e).
3 Average case setting
208
Observe that the condition n(T) = O(T) means that zero is not an attraction point of the sequence (1/n) E; 1 /2 1) . When this is the case, we can show that clan is the `worst' cost functionthe result corresponding to Lemma 2.10.2 of the worst case setting.

Lemma 3.10.2 Let c be an arbitrary cost function. Let Qo be such that c(o) < +oo. If there exists a > 0 such that for sufficiently large n 1
n 1: j=1
2
1) > a,
(3.63)
An
then for small e > 0 we have
Comp(c; e) < M Comp(clin; e) where M = M(c,Qo) = a1r2aC21 (c(QO) +2g). Proof Let no be such that (3.63) holds for all n > no. Let eo satisfy eo < R(clin; ano) and ICnon(clin; 60) > a. We shall show that the required inequality holds for all e < so. To this end, we proceed similarly to the proof of Lemma 2.10.2. We choose information 111 for which rada°e(III) = e and cost(clin; ICI) _ ICnon(clin; e). Owing to (3.63), we can assume that the number of ob
servations in E is n = LICnon(clin; e)/a] and they are performed with the same variance o'2 = ICnon(clin; e)/n. Let k = L2aoOJ. Then for the information Ill which repeats the same observations as ICI k times but with variance .2, 02 = a2/k, we have rada°e(llI) = rada°e(E) and cost(c; IlI)
IC (Clingy e ) < K n c IC' kn
< knc(2a/k) < alkc(op)ICnon(Clin;e) Hence
Comp(c; e)
< a1k c(a) Comp(clin; e) + (2 k n  1)g
< a1k (c(ao) + 2g) Comp(clln; e), as claimed.
We note that the condition (3.63) holds for many sequences {aj} of eigenvalues. For instance, for Aj = jp with p > 1 we have lim
1
n.oo n
n 1/2
j=1
An
1) =
r
t
p/(2  p) +00
1 < p < 2, p > 2.
3.10 Complexity of special problems
209
Hence we can take a = 1. This means, in particular, that Comp(clin; e) can be achieved by using no more than LComp(ciin; E) j observations. There are, however, sequences {.,j } for which (3.63) is not satisfied, and consequently the Tth minimal radius cannot be achieved by information using O(T) observations. An example is given in E 3.10.2. Clearly, when the cost function is bounded from below by a positive
constant, the lower bound (up to a constant) on the ecomplexity is 1 is the cost function for provided by Comp"°"(cexa; E) where cexa exact information. In this case, letting n = n(e) > 0 be the minimal n for which 00
E Ai 0,
O = 0,
where q > 0. Note that for q = 0 we have exact information. Assuming (3.63), for q > 1 we have Comp(q; e) Comp(1; e). Therefore in the following calculations we restrict ourselves to 0 < q < 1. Using Lemma 3.10.1 we obtain n
R(q;T)2
(1
= \T)
00
1/r
1/9
Aj
+
g,r)
(3.64)
j=n+1
i=1
where r = q/(1 + q) and n = n(T) is the largest integer satisfying n1
(1+E(A
r 1/r
i=1
//
n1

(E(A)
i=1
r 1/r
/
< T1/q,
Furthermore, R(q; T) is attained by observing Ki , ... , Kim, with variances
,2 = (A /(1+q) (
1
1/q
nT
/
,j=1 fir/ 7
1f
/
,
1 < i < n.
3 Average case setting
210
Consider now a problem for which the eigenvalues
A, .. (lnsiy' where p > 1 and s > 0. Recall that such behavior of the eigenvalues can be observed for the function approximation with respect to the Wiener sheet measure, see NR 3.3.5. Then we have (11T)P1 ln(s+1)PT
(p  1)4 > 1, (p  1)4 = 1,
(1/T)P'ln3P T
0 < (p  1)4
+oo, where 4 = min{1, q}. We check that R(q, p, s; T)2 is a semiconvex function of T and that the sequence {A,} satisfies (3.63). Hence Compnon (q, p, s; f) is also semiconvex and we obtain the following formulas for the ecomplexity.
Theorem 3.10.1 Compave
(4, p, s; E)
(P(1/e.)2/(P1)
(P 1)4=
0 2. If 0 < p:5 2 then the exponent is 2/p and does not depend on q. In the latter case, the complexity behaves independently of the cost function. The situation is then similar to that in the corresponding problem of the worst case setting, see Theorem 2.10.1. The only difference is that p is replaced by p  1.
3.10.2 Approximation and integration on the Wiener space We pass to the approximation and integration problems of Subsection 3.8.2. Recall that both problems are defined on the Wiener space of continuous functions and information consists of noisy observations of function values. In that subsection we proved tight bounds on the minimal errors r&Ve(App, a2) and r8Ve(Int, a2) with or > 0. They allow us to find complexity for fixed observations with variance ao or, in other
3.10 Complexity of special problems
211
words, when the cost function is cfix(a2) = CO > 0 for v2 > 0.02, and Cfix(o2) = +00 for a2 < Qo. That is, we have R(cfix;T) = r, e(o) with n = n(T) = [T/co], and owing to Corollary 3.8.2, IGnon(App,
Cfix; r)
CO
1
a2
1 6E2
+ p"' 4E4
0.0IC
(
and non
2
1 2 (Int, cfix; E) ~ co (W3 e + In E2
Where pn, qn E [1/ ' ,1]. Since for both problems ICnon(cfix; /) is a semiconvex function, the last estimates with `;ze' replaced by also hold for the problem complexity.
It turns out that similar bounds can be proven for the cost function ciin(a2) = v2. Indeed, the upper bound on Comp(clin; E) is provided by Comp(cfix; e) with ao = 1 = co, while the lower bound follows from the following lemma.
Lemma 3.10.3 For all T we have (R(A PP, C tin; 2'))2 >
67
6T
and
(R(Int , Clin; T) )2 >
+ 1 3(1
T)
Proof Let Ii1 be arbitrary nonadaptive information using observations at tis with variances a?, 1 < i < n, such that n
COSt(Clin; III) _ E Qa 2 < T. i=1
Consider first the approximation problem. Proceeding exactly as in the proof of Lemma 3.8.3 we can show the following generalization of that lemma: namely, for any 0 < a < t < b < 1, the covariance kernel of the conditional distribution, RN(t, t), satisfies RN (t, t)
1 + T(bz(i(t)'
where Vi(t) = (t  a)(b  t)/(b  a), Tab =
(3.65)
vi 2, and the summation
is taken over all i such that ti E (a, b). We now use (3.65) to obtain a lower bound on R(App, Clin; T). To
3 Average case setting
212
this end, we divide the unit interval into k equal subintervals (ui_1, ni), 1 < i < k. For 1 < i < n, let Ti = >jEA1 o3 where
Ai = { j I 1 < j < n, tj E (uil, ui) }. Denoting iki(t) = (t  ui_1)(ui  t)/(ui  ui_1) and applying (3.31) and (3.65) we obtain (radave(App, lrj))2
>
u;
k
dt
Oi(t)
fui1 1 +Ti/(4k)
2
=
k
1
3k E Ti + 4k
The last quantity, as a function of the nonnegative arguments T1i ... , Tk, Ek 1 Ti < T, is minimized for Ti = T/k. Hence, for any k,
(rad'(App, E))2 > 3(T + 4k2) Taking k = L T/4J we obtain the desired bound.
For integration we have
(radave(Int, In))2 >
1+T
where Al = fF Int2(f) w(df) = 1/3. This completes the proof. Thus we have proven the following theorem.
Theorem 3.10.2 For the cost function clip and cfix with vo > 0 we have
Compave(App, cfix; ) ^ Compave(App, clip; e) ^ E4 and
CompBVe(Int, cfix; e)
Compave(Int, Clin; e) ^ E2,
as a + 0+.
Notes and remarks NR 3.10.1
Most of Subsection 3.10.1 is based on Plaskota (1995a). Sub
section 3.10.2 is new.
NR 3.10.2 We can apply Theorem 3.10.1 to the multivariate approximation
3.10 Complexity of special problems
213
with respect to the Wiener sheet measure, i.e., to the problem formally defined in NR 3.8.4. We obtain Compaae (e.)
(1/e)24
(1/e)i/(r+1/2) (ln(1/e))k(r+l)/(r+1/2) (1/6) 1 /(r+ 1 /2 ) (ln(1/e))( k  1 )(r+ l )/(r+ 1 / 2)
4 > (r + 4 = (r + 1/2)1, 4 < (r +
where k and r are as in NR 3.8.4, and 4 is as in Theorem 3.10.1.
NR 3.10.3 Complexity for the function approximation and integration with respect to the rfold Wiener measure with r > 1 can be derived from Plaskota (1992, 1995b) and Ritter (1994), see also NR 3.8.7. That is, suppose that the class A consists of function values and derivatives of order at most r, and the cost function c = cfix, i.e., observations are performed with the same variance oo > 0 and cost co. Then Comps"e(App;e)
(1)1/(r+1/2)
: E2 +
and 2
1(1)/(r+1 )
Comp' (Int; e) x 0,2 + r
where the constants in the notation do not depend on oo. Suppose now that only observations of function values are allowed. Then Compa°e(App; e) ^
oo(1/E)2+1/(r+l/z)
while CompBVe(Int; e) remains unchanged.
NR 3.10.4 We recall that in the case of the solution operator S being a functional we have the correspondence Theorem 3.6.3. It says that for the case radii of the same corresponding problems, the worst case and information are equal up to the constant factor V2. We can formulate an analogous correspondence theorem about the worst and average case complexities.
Let {H, F} be an abstract Wiener space p for a Gaussian measure on F. Let the solution operator S : F + R be a continuous linear functional and the class A of permissible functionals be given. Consider the problem of finding the ecomplexity in the two settings: (WW) The worst case setting with respect to E = f E H I I I f IIH 5 1}, noise bounded in the weighted Euclidean norm, 1((yi  Li(f))/bi)2 < 1, and a cost function c,,,(8), (AA) The average case setting with respect to the measure it, independent noise with (yi  Li(f )) N(0, o2), and a cost function ca.(o ). If cw(x) = ca(x2) then
Ea
(IGvnon)wor (Vve'2E) < (IGnon)ave (6,) < (ICnon)wor (E).
If, additionally, (ICnon)a(\) V c is semiconvex and as (Icnon)ave (e) then
Comp or(e) x Compa°e(e)
as
(ICnon)ave (/ e) behaves
a 4 0+.
3 Average case setting
214
For instance, the results of Subsection 3.10.2 can be applied to get complexity results for the corresponding problem in the worst case setting (compare also with NR 3.8.5).
Exercises E 3.10.1 Show that the condition Ej° 1 A3/2 < +oo implies 1
n
n moo
Al/2
n
(A1/2 n
j=1
That is, Lemma 3.10.2 can be applied to such eigenvalues.
E 3.10.2 Let 1/2 < p < 1. Let an = nP and Pn = n1 E.7_1 ai/an, n > 1. F o r al > a2 > + 0, let 0 = no < nl < be the sequence of integers defined inductively by the condition (ni1)1P
Pn;1
ni
 ni1
< ai,
ni
with Po = 0. For n > 1 we let An = an,, where i is the unique positive integer such that ni1 < n:5 ni. Show that for any n satisfying n
A;/2 > An/2(Ti+n) i=1
with Ti = aini, we have n/Ti > 1/ai  +oo as i  +oo. E 3.10.3 Let 0 < q < 1. Show that Comp,' (q, p, s; f) is not a strictly convex function of e.
E 3.10.4 Show that for the cost function c,(a2) = {
1 + a2 +00
a2 > 0, a2 = 0,
we have n
R(c,; T)2  T
2
E Ai/2
00
+
Aj j=n+1
i=1
where n = n(T) is the largest integer satisfying n i=1
Al/2 < An (T+)n
2
4 Worstaverage case setting
4.1 Introduction In the previous two chapters, we studied settings in which we made exclusively deterministic assumptions on problem elements f and information noise x (the worst case setting), or exclusively stochastic assumptions (the average case setting). In the first setting we analyzed the worst performance of algorithms, while in the other we analyzed the average performance. The deterministic and stochastic assumptions can be combined, and this leads to mixed settings.
In this chapter, we study the first mixed setting in which we have deterministic assumptions on the problem elements and stochastic assumptions on noise. We call it the worstaverage case setting. More precisely, we want to approximate values S(f) for elements f belonging to a set E C F. Information about f is given with random noise. That is, nonadaptive or adaptive information ICI is defined as in the average case setting of Chapter 3. The error of an algorithm cp that uses information ICI is given as
e"'_e'(],
w) = sup
fEE
f 118(f)  w(y)II2lrf(dy),
(4.1)
Y
where Y is the set of all possible values y of noisy information, and if is the distribution of y for the element f, i.e., ICI = {lrf} fEF. This setting has been studied extensively in statistics. It is often called statistical estimation, and the problem of minimizing the error over a class of algorithms is called the minimax (statistical) problem. In the mixed settings, the complexity results are not as complete as in the worst and average case settings. The reason for this lies in the technical difficulty. For instance, even for apparently simple one dimen215
216
4 Worstaverage case setting
sional problems, optimal algorithms are not linear (or not affine), and they are actually not known explicitly. The body of this chapter consists of three sections. In Section 4.2, we study approximation of a linear functional over a convex set E. We consider nonadaptive linear information with Gaussian noise. It turns out that, although optimal algorithms are not affine, we lose about 11% by using affine algorithms. Hence, affine approximations prove once more to
be (almost) optimal. Optimal affine algorithms are constructed. These results are obtained by using the concept of a hardest one dimensional subproblem, and by establishing a relation between the worstaverage and worst case settings. In particular, it turns out that appropriately calibrating the levels of random noise in one setting and deterministic noise in the other setting, we get the same optimal affine algorithm. If E is the unit ball in a Hilbert norm, there are also close relations between the worstaverage and the corresponding average case settings. This enables us to show that these three settings are almost equivalent. The same smoothing spline algorithms are almost optimal in any of them. The situation becomes much more complicated when the solution operator is not a functional. This case is considered in Section 4.3. We only present some special results about optimal algorithms when, roughly speaking, information is given `coordinatewise'. In particular, we show optimality of the least squares algorithm when E = Rd. For arbitrary information, optimal algorithms are unknown even for problems defined on Hilbert spaces.
4.2 Affine algorithms for linear functionals Optimal algorithms often turn out to be linear or affine for approximating a linear functional in the worst or average case setting. In this section, we investigate whether a similar result holds in the mixed worstaverage case setting. To begin with, we consider a one dimensional problem. We shall see that even in this simple case the situation is complicated.
4.2.1 The one dimensional problem Consider the problem of approximating a real parameter f E [,r, ,r], T > 0, from data y = f + x where x is distributed according to the zero
4.2 Affine algorithms for linear functionals
217
mean one dimensional Gaussian measure with variance o.2 > 0. That is, we formally have S :1R > R, S(f) = f , and a f = N(f, , a') Clearly, for a = 0 we have exact information. In this case, the algorithm W(y) = y gives the exact value of S(f) with probability 1 and its error is zero. For or > 0, the error of any algorithm co :1R + R is positive and given as *
ew_a(III w)
= e"a(T a2; V)
ISIS
/
21r1
Q2
flf(f+)12{2/(22)}dx.
First consider linear algorithms. That is, co is of the form W(y) = c y for all y E R. Let rlin(T, o'2) = inf { ewa(T, a2; (P)
I
co linear }
be the minimal error of linear algorithms.
Lemma 4.2.1 For all r > 0 and a > 0 we have VT2
rlin(T, aZ) = or
T2 + a2 .
The optimal linear algorithm W(y) = copt y is uniquely determined and jicient is given as its coefficient Copt = Copt(T, Q 2 )
=
T2
T2 + a2
Proof We have already noticed that the lemma is true for or = 0. Let a > 0. Then for any linear algorithm co(y) = cy and f E R we have (ewa(T, a2; 0 and v > 0 we have
1
1 we have riin(1, a2) < 1 and rnon(1, a2) > z/(1), where the last inequality follows from (4.5) by taking f = 1 and using the monotonicity of 0. On the other hand, for a < 1 we have riin(1, a2) < a and rnon (1, a2) > a 1/i (1), where now the last inequality follows from (4.5) by taking f = a. Thus for any a we have riin(1, a2) < rnon(1, a2)
Y (1)
By numerical computation we find that 01/2(1) = 1.49... < 1.5. Since riin(1, a2) or as or + 0, to obtain the first limit in the theorem it suffices to show that rnon(1, a2) or as a + 0. To this end, observe that for any W we have (ewa(1, a2; V))2
>
2
fj
1
I (f  co(Y))2e
dy) df
f ( f 1  co(y))2e
Ldf) dy.
2ra2
1
1
The inner integral in (4.6) is minimized by
P2(y) =
f
11
xe
f 11 e 2
L dx
2
x
dx
f xex2/2dx
= y e f ex2/2dx
(4.6)
4 Worstaverage case setting
220
where the last integrals are taken from (y1)/a to (y+l)/a. Put cp = cp2 and change variables in (4.6), y = f + 0u. After some calculations we get (e' a(1, 02;
0))2 > 02
2
f
1\
1 j 02 (f, u, 02) eu2/2 du) df, 27r
R
where aX2
/2dx 2) = f (u  x) 1(f, u, 0) f e 2/2dx

and the integrals are taken from u + (f 1) /a to u + (f + 1) /a. Observe now that for If < a < 1 and Jul < A < +oo, the function 01(f, u, a2) converges uniformly to u as or > 0. Hence lim
2 J 1( 2
,
'
72 1
1
(
7r
J V, (f, u, a2)eu2/2du) df R
1
ju2e_u2/2du)df = 1.
27r
a, as claimed. To prove the second limit, observe that by taking f = 1 in (4.5) we obtain Consequently, rnon(1, a2)
(eWa(1, 02;'p))2
e 1/202)
> ')(1/0) >_
c
22
V
2L
``
2
eu /2 du
which tends to 1 as or + +oo. Since lim, r1in(1, 02) = 1, the proof is complete.
The upper bound 1.5 in Theorem 4.2.1 can be improved. Actually, the best bound is known. Thus let us define the constant rlin(T, a2)
1 = sup r,o rnon (T, a2)
(4.7)
Then rci = 1.11..., see NR 4.2.2.
4.2.2 Almost optimality of affine algorithms We pass to the general problem. We assume that the functional S is defined on a linear space F. We want to approximate S(f) for f belonging to a convex set E C F, based on linear information with Gaussian noise. That is, we have at our disposal information y = N(f) + x E R", where
N : F + Y = R' is a linear operator and the noise x  N(0, o2E).
4.2 Affine algorithms for linear functionals
221
The symmetric matrix E E 1Rn x n is assumed to be positive definite. It induces an inner product in 1R' defined by (y, z)Y = (E1y, z)2. The error e''(III, cp) is given by (4.1). We denote by rad ff a(III; E) and rad ona(III; E) the minimal errors of affine and arbitrary nonlinear algorithms over E,
rad ''(III; E) = inf { e'a(I[I, W; E) I
cp affine },
radn n (III; E) = inf { e"'a(III, cp; E) I cp arbitrary }. Algorithms that attain the first and second infimum will be called optimal of ine algorithms and optimal algorithms, respectively.
Consider first the case where the set E is an interval. That is,
E = I(f_1if1) _ {af_1+(1a)fl I 0 0 is any number such that r(y) < r(6) + d6 (ry  6), `d y, see Section 2.4. The set of all such d6 will be denoted by Or(b). Observe that by taking 6 = a we obtain an algorithm which is close to
an optimal affine one in the mixed worstaverage case setting. Indeed, for any affine cp = g + d w)y with IIwIIy = 1, we have
IS(f)  w(N(f) + x)12 +d2(w,E1x)2
2dS(f)(w,E1x)2
If we integrate this over x  7r = N(0, a2E), the second component will vanish and the third one will become a2d2. Hence ewa(I[I, W; E)
= sup
fEE
if ' I S(f)  (N(f) + x)
I2
(dx)
sup I S(f)  co(N(f))12 + 0'2 d2. fEE Since ewo'(N, cp; E) = sup fEE I S(f) / cp(N(f )) I + 6 d, for 6 = or we have ewor(N, 1
; E)
ewa(If,
gyp; E)
ewor(N, p; E).
In particular, this implies ewa(IfI,
p ; E) < v rad ona(IH; E)
(b = a). It turns out that for appropriately chosen 6, W6 is an optimal affine algorithm. Obviously, if a = 0 then information is exact and we can
take 6=0. Theorem 4.2.2 Let o, > 0. (i) If r(b) is a homogeneous function, i.e., r(b) = 6r'(0+), then W6 = cpo, V6, and cpo is an optimal affine algorithm. Furthermore,
rad' a(E; E) = ea(N, wo; E) = a r'(0+).
4 Worstaverage case setting
224
(ii) If r(b) is not homogeneous, then there exist 6 = S(o) and d6 E 8r(S) such that S r(6)
d6 = Q2 + 62)
(4.9)
and W6 is an optimal affine algorithm in the mixed worstaverage case setting. Furthermore,
rad ff 8
E) = e "' a (1, V6; E) _
a rS .
O'2 + 62
Proof (i) In this case, the worst case error of c'o with information N6 is given as e
wor
sup IS(f)  vo(N(f))I +6do fEE
r(0) + S do = S r'(0+) = r(S),
which shows the optimality of cpo in the worst case setting for all S. Similarly, we show that
e''(III, po; E) = a r'(0+). For e > 0, let h = (f1  f_1)/2 E bal(E), fl, f_1 E E, be such that (IN(h)IIy < S and S(h) > r(b)  E. Let I = I(f_i, fi). Then the error over E is not smaller than the error over the interval I. The formula for radwa(E; I) given in Lemma 4.2.2 yields
radff a(III; E) > rad ff a(IlI; I) >
or
(2
_+62 )
and since a is arbitrary,
rad
E)
r(+ Q2
62
(4.10)
Now, replacing r(b) by 6r'(0+) and letting 6 > +oo, we obtain
radff a(11I; E) > a r'(0+), which proves the optimality of cpo.
(ii) We first show the existence of 6 = 6(a) and d6 E 8r(S). Since r(y) is a concave function of ly (see E 2.4.6), the set
{('y,d)1ry>0, dEBr(ry)} forms a continuous and nonincreasing curve. We also have that the function ''(y) = ryr(ry)(a2 + y2)1 is nonnegative, continuous and takes
4.2 Affine algorithms for linear functionals
225
the value zero for ry = 0. Hence do > V)(0) and it suffices to show that for large ry we have dry < 0(ry). Indeed, assume without loss of generality that d.y > 0, `dry. Then, owing to the concavity of r(y), there exists 'Yo > 0 such that r('yo)
=a>1.
'Yodto
This and the inequality dry
< r('y)  r(Yo) < dryo,
d'y >
'Y  'Yo
1o,
yield
r(7) > dry
r('Yo)
do
+ ('Y  'Yo ) ?
'Y
+' Yo (a  1)
.
Consequently V, (,Y)
d7
> 'Y2 + Y'yo(a  1)

ly 2+a2
which is larger than 1 for y > o,2/('(o(a  1)). Hence 6 = S(c) exists. In view of (4.9), the (squared) error of cpb is equal to (ewa(I[j, yob; E) )2
= sup IS(f)  Wo(N(f )) I2 +
v2db2
fEE
(r(S)  6db)2 + Q2da =
Q2
r 2(Sa
0r2 +S
This together with (4.10) proves the optimality of cpb and completes the proof.
Thus the optimal algorithm in the mixed setting turns out to be optimal in the worst case setting with appropriately chosen S. Observe that in the proof we also showed that the minimal affine worstaverage error over E equals the minimal affine worstaverage error
over the hardest one dimensional subset I C E. We emphasize this important fact in the following corollary.
Corollary 4.2.1 For approximating linear functionals over convex sets E we have
radaff a(IlI; E) = sup radaff a(IlI; I). ICE
Furthermore, if the worst case radius r(6) is attained at h* = (f1 f * 1)/2 E bal(E), fl , f* 1 E E, then I * = I (f * 1, fl) is the hardest one dimensional subproblem, i.e., rad ff a(E; E) = rad aff a(IlI; I*).
4 Worstaverage case setting
226
We note that if E is not only convex but also balanced, then the optimal affine algorithm cpb is linear and the hardest one dimensional subproblem is symmetric about zero. We now give two simple illustrations of Theorem 4.2.2 and Corollary 4.2.1.
Example 4.2.2 For the one dimensional problem of Subsection 4.2.1 with ,r = +oo, the radius r(6) = 6 is a homogeneous function. Then the optimal linear algorithm in the worst and worstaverage case settings is independent of the noise level and given as Vopt(y) = y. The hardest subproblem does not exist, however:
rlin(+oo, a2) = v = lim rlin(r, a2). Tioo Example 4.2.3 Consider the integration problem over the class E of 1Lipschitz periodic functions f : [0, 1] + IR, as in Example 2.4.2. The
information is given as yi = f (i/n) + xi, 1 < i < n, where the xi are independent and xi N N(0, a2). Recall that in this case r(ry) = The worst case optimal algorithm is independent of f and equals cplin(y) = n1 Ea 1 yi. We check that (4.9) holds f o r 6 = 6(a) = 4 a2 / . Hence Theorem 4.2.2 yields that Viin is also the optimal linear algorithm in the mixed case for any a and 2
rad ff a(11I; E) =
n + 16 n2
The hardest one dimensional subproblem is [h*, h*] where
t 2i1
2
h* = 4a +
2n
i1 0, since otherwise the information is useless. Then, in view of Lemma 4.2.2 and Theorem 4.2.1, we have to show that it is possible to select 6 = 6(.2) and d6 in such a way that a/6 converges to 0 or to +oo, as or + 0+. Indeed, if this were not true, we would have 6 40+. However, a2
r(6)

6 d6
T2

1
and so the limit a2 lim 6_,0+ 62
=
{0 +00
if r(0) = 0, if r(0) > 0,
as claimed.
Thus we have shown that nonaffine algorithms can only be slightly better than affine algorithms. Moreover, the optimal affine algorithm `becomes' optimal among arbitrary nonlinear algorithms, as the noise level a decreases to zero. The Hilbert case
We end this subsection by considering a special case where E is the unit ball in a separable Hilbert space. We first show the following lemma.
Lemma 4.2.3 Suppose that {H, F} is an abstract Wiener space and µ the corresponding Gaussian measure. If E C F is the unit ball with respect to the Hnorm, then for any linear algorithm cplin we have e"'a(11I, 'Glin; E) = eave(I1I, 'lin; p). Hence
rad fin a(E;E) = radave(III;p).
4 Worstaverage case setting
228
Proof Indeed, writing Vii. = (eave(III,
Wlin; µ))2
IF
w)y with IIwIly = 1, we obtain
f(S(f)  Olin(N(f) + x))2 ir(dx) 1(df)
f (S(f)  colin(N(f)))2p(df) + O'2d2. Recall that for any continuous linear functional L defined on F we have JFL2(f)µ(df) = IIfLIIH, where fL E H is the representer of L in H, see is a continuous functional Subsection 3.3.2. Since K =
on F and fK = Is  d E', wifi, we have (eave(11I,'P1in; M))2 l \
= Il fs  d i=1
wifi I12 +
cr2d2
sup (S(f)  p1in(N(f )) )2 + Q2d2
IIfIIH,,, be a complete orthonormal basis of W j. Define F as the closure of F with respect to the norm 00
IIfIIF = IIfWIIF +
Ajaj,
f
00
=
fw+ : ajfj, fW E W, j=n+l
j=n+1
where {A,} is a positive sequence with Ej=n+l Aj < +oo. Then {F, F} is an abstract Wiener space. (And the corresponding zeromean Gaussian measure µ has correlation operator given by Cµ f = f for f E W, and Cµ f j = Aj f j .) Furthermore, it is easy to see that we can take
S(f) = S(fw) and N(f) = N(fw). In the Hilbert case, the hardest one dimensional subproblem can also be shown explicitly. Indeed, we know from Theorem 3.5.1 that the
average case approximation of S(f) with respect to the measure µ is as difficult as the average case approximation with respect to the zeromean Gaussian measure PK.. Here K. = S  W11nN and the correlation operator of AK. is given by
AK.(L) =
(IIK*Ilµµ CµK*,
L E F.
Furthermore, the algorithm (o j,, is optimal in both cases. Note that µK. is concentrated on the one dimensional subspace V = span{CµK*}. Hence, owing to Lemma 4.2.3, (Piin is also the optimal linear algorithm in the mixed setting for the set
EK. = {aCµK*EVI
IaIIIK*IIµK.
=iwjfjIIF
and EK. C E. Hence [h*, h*] is the hardest one dimensional subproblem.
4 Worstaverage case setting
230
4.2.3 Relations to other settings We summarize relations between the mixed worstaverage and other settings discussed in Subsection 4.2.2. Let S be a linear functional on a linear space F. Let information about
f be given as y = N(f) + x. Consider the problem of approximating S(f) from data y in the following three settings. (WA) Mixed worstaverage case setting with a convex set E C F and the noise x " N(0, o.2E).
(WW) Worst case setting with a convex set E C F and the noise bounded by lixily = (E1x, x)2 < 6. (AA) Average case setting with a Gaussian measure p defined on F and X N N(0, v2E).
We denote the optimal affine algorithm in the mixed setting (WA) by 0o
Theorem 4.2.5 (i) If a = 6 then the algorithm cpo is almost optimal in the worst case setting (WW), ewor(N, W,,,; E) :5 v/2 radwr (N; E), and q, radwor(N; E) < radno,a(1Q; E) < rad% or(N; E),
where ql = //(2ici) = 0.63.... (ii)
Let {H, F} be the abstract Wiener space for the measure p and
let E be the unit ball with respect to the norm in H. Then the algorithm cpo is optimal in the average case setting (AA), e' (III, w.,; rad8Ve(1II; p), and q2
radBVe(III; µ) < rad on (III; E) < rada°e(III; p),
where q2 = 1/, = 0.90.... We can say even more. For any a E [0, +oo] we can find 6 = 6(a) E [0, +oo] such that cob is an optimal affine algorithm in the mixed setting (WA) and in the worst case setting (WW). Moreover, the inverse is true.
For any 6 E [0, +oo] there exists a = a(6) E [0, +oo] (a2 = 6(r(6)/db 6)) such that the algorithm Vo is optimal affine for (WA) and (WW). Since po is also an optimal affine algorithm in the average case setting (AA), similar relations hold between the worst and average case settings.
Example 4.2.4 Consider the abstract Wiener space {H, F} where H = W,0+1 and F = CO (r > 0), and its rfold Wiener measure w,. (see
4.2 Affine algorithms for linear functionals
231
Example 3.3.1). Suppose we want to approximate a functional S E F*, e.g., S(f) = fo f (t) dt, from noisy information y = N(f) + x, where
N(f) = [f(t1),f(t2),...,f(tn)1 and the noise x is white. We know that in the average case (AA) with µ = wr, the unique optimal algorithm is the smoothing spline algorithm. It is given as co,(y) = S(s(y)) where s(y) is the natural polynomial spline of order r which belongs to W +1 and minimizes 1
fo
(f(r+l)(t))2 dt +
n
1
Q2
.
E (yj  f (tj)) 2 j=1
(for o = 0, s(y) interpolates the data yj exactly, i.e., s(y)(ti) = y2, Vi). Hence this is the unique optimal affine algorithm in the mixed setting (WA) with E the unit ball in H, and close to optimal among all algorithms in the mixed and worst case settings (WA) and (WW). Let {Vo} be the family of smoothing spline algorithms for 0 < or < +oo. Then each cpo is an optimal affine algorithm in any of the three settings for appropriately chosen noise levels.
Notes and remarks NR 4.2.1 The one dimensional problem of Subsection 4.2.1 has been studied by many authors. Bickel (1981), Casella and Strawderman (1981), Levit (1980) looked for optimal nonlinear algorithms. It is known that the optimal algorithm is the Bayes estimator with respect to the least favorable distribution on [T, T]. This least favorable distribution is concentrated on a finite number of points. Moreover, for 7/0 < 1.05, this distribution assigns mass 1/2 each to ±T. Hence in this case the algorithm cp1 defined by (4.4) with f = 7 is optimal and z
(rnon (T, o 2))2 = T2 e
r/Q)
1 27r
f
0o
a
u2/2
cosh(wr/O)
du.
As 7/o increases, the number of points also increases and the least favorable distribution `tends' to uniform distribution.
NR 4.2.2
The fact that the ratio r1;.(7, o2)/rnon(T, (.2) is bounded from
above by a finite constant was pointed out by Ibragimov and Hasminski (1984)
who studied the case of N = I and convex balanced E. Donoho et at. (1990) and Brown and Feldman (1990) independently calculated the value of ici = 1.11... .
NR 4.2.3 Li (1982) and Speckman (1979) showed optimality properties of smoothing splines for approximating functionals defined on Hilbert spaces. The results of Subsection 4.2.2 (under some additional assumptions) were first obtained by Donoho (1994) who also considered other error criteria. The
4 Worstaverage case setting
232
generalization to an arbitrary convex set E and an arbitrary linear S and N, as well as the results about asymptotic optimality of affine algorithms (as o > 0) and the special results for the Hilbert case, is new. Lemma 4.2.3 was pointed out to me by K. Ritter in a conversation.
NR 4.2.4 In the proof of Lemma 4.2.2 we used the fact that for the one dimensional problem the `pure noise' data are useless. This is the consequence of a more general theorem giving us a sufficient condition under which randomized algorithms using noisy but fixed information are not better than nonrandomized algorithms. More precisely, let information III = 17r f } where the 7r f are distributions on
Y, and a set T with probability distribution w on T, be given. Suppose that approximations to S(f), for f E E, are constructed using an algorithm cot which is chosen randomly according to w. The family {cot} with t  w is called a randomized algorithm. The error of {cot} is defined as eran(1[I,{cot}) 1
= sup
fEE
J77
II
S(f)Wt(y)II2f(dy)w(dt).
(We note that randomized algorithms are a special case of the well known and extensively studied Monte Carlo methods which, in general, use a random choice of algorithms as well as information. Usually exact information is studied, see e.g. Heinrich (1993), Mathe (1994), Novak (1988).) Let A be a given class of permissible algorithms. Assume that, for the information III, there exists a probability distribution µ on the set E such that the average and worstaverage case radii of IlI are the same, i.e., (4.12) rada°e(E; A, µ) = rad\ _a(U; A, E). Then randomized algorithms are not better than nonrandomized algorithms. Indeed, denote by e2(f,t) =
j(S(f)_cat(y))2lrf(dy)
the squared average error of the (nonrandomized) algorithm cot at f. Using the mean value theorem and assumption (4.12) we obtain (eran(U, {cot}))2
SUP J e2 (f, t) w(dt) >
fEE T
Jf
= JT JE e2(f,t)lt(df)w(dt) ? >
e2 (f,
t) w(dt) µ(df )
J e2(f,t*)p(df) E
(radBVe(]U; A,µ))2 = (radwa(LI; A, E))2,
as claimed. For instance, for the one dimensional problem of Subsection 4.2.1, the measure p satisfying (4.12) exists for the class of linear algorithms as well as for the class
of nonlinear algorithms. For linear algorithms this measure puts equal mass at ±r, µ({T}) = µ({r}) = 1/2, which follows from the fact that the error of any linear algorithm is attained at the endpoints. For nonlinear algorithms, µ is concentrated on a finite set of points, see NR 4.2.1.
To see that pure noise data are useless for one dimensional problems, it
4.2 Affine algorithms for linear functionals
233
now suffices to observe that any algorithm using such data can be interpreted as a randomized algorithm. Indeed, suppose that information y = [f + xl, xz, . . . , xn] where the xis are independent. Then, for any algorithm cp : Rn + R, we can define a randomized algorithm
t = [x2, ... , xn], W41) = W (Y), whose `randomized' error is equal to the worstaverage case error of W. Existence of t gives the desired result.
Exercises E 4.2.1 Suppose we want to approximate f E I T, T] based on n independent observations of f , Vi = f +xi where xi  N(0, o ). Show then that the sample mean, con(y) = n1 E' 1 yj, is an asymptotically optimal algorithm, eWa (con)
= S
n`
as n ' +oo,
radon (n)
where rad ona(n) is the corresponding nth minimal error of arbitrary algorithms. E 4.2.2 Consider the problem of E 4.2.1 with f E R (T = +oo) and observations with possibly different variances, xi  N(0, o?), 1 < i < n. Show that then the algorithm n
W(Y) _
z
___n 10" y"
2
,i=1 of
is optimal among nonlinear algorithms and its error equals (E'`i=
o2)1/2
i
E 4.2.3 Suppose we approximate values S(f) of a linear functional for f belonging to a convex and balanced set E. Let information y = S(f) + x, x " N(0, 02). Show that and rad on (LI; E) = rnon(T, 02), where T = sup fEE S(f ). Moreover, the optimal linear algorithm is opt (y) _ radlin a(I[Ii E) = 'l'lin(T, 02)
copt(T, o2)y
E 4.2.4 Consider approximation of a linear functional S with F = R' , based on information y = f +x, x  N(0, .2J). Show that for a convex and balanced set E C Rn we have
rad ff a(In, E) = o sup
fEE
2S2(f)
2
o + IIfIIF
E 4.2.5 Prove the uniqueness of the optimal affine algorithm co6(o2) from Theorem 4.2.2 (if it exists). Hint: Consider first the case of E being an interval.
E 4.2.6 Consider the problem defined on a Hilbert space as in Theorem 4.2.4, but with uniformly bounded information noise, (E1x, x)2 < 6. Show that then the optimal value of the regularization parameter ry is equal to
ry(6) = 02(8) = 6 (r(6)
 6)
(a/0 = +oo).
4 Worstaverage case setting
234
E 4.2.7 Let F be a separable Hilbert space. Consider approximation of a s)F from information y = N(f) + x where
nonzero functional S =
N = [(',f1)F,...,(',fn)F], (fi, fj)F = 6r,j, and x _.l/(O,0.21). Denote by Si the orthogonal projection of s onto span{fl, ... , fn} and by s2 its orthogonal complement. Show that
(?
IIg1IIF'
1+o
6(a) _
for 0 < or < +oo, ( {
a2(6) =
for 0 0, Vi. Information y about f is given coordinatewise, i.e., yi = fi + xi, 1 < i < n, where the xi are independent and xi  N(0, a?). Lemma 4.3.1 For the rectangular problem above we have n
radlin a (E,R(T))
n 2
_
O_ 2
rlin (Ti, i ) i=1
i=1
n
radnon (1I, R(T))
_
rnon(Ti, ai ), i=1
ai Ti
4.3 Approximation of operators
235
and the (unique) optimal linear algorithm is con(y) = (c1 y1,
cnyn)
where T2
2
ci = copt (Ti, Qi) =
1 < i < n.
+ Ti2
O'i2
Proof Indeed, in this case the error of any algorithm cp = (cvi) ... ) cpn), cpi : Rn + R, can be written as (ewa(IH,
n
w; R(T)))2 = sup J E (fi fER(r) R" i=1 n
 0i(f + x))27r(dx)
( sup J(fi_i(f+x))2iri(dx))
t=1
where 7ti = N(0, o2). Hence the optimal (linear or nonlinear) approximation is coordinatewise. Moreover, as the yis are independent, optimal cpis use only yi. The lemma now follows from results about the one dimensional problem given in Lemma 4.2.1. It turns out that such a `coordinatewise' algorithm is an optimal linear one over a larger set than the rectangle. Indeed, observe that the squared error of pr at any f E Ikn is E 1((1ci)2 f2+o2ci2). Since for f E R(r) this error is maximized at f =,r, the algorithm Tr is also optimal over the set of f satisfying the inequality n
n
i=1
i=1
E(1  ci,)2 f2 < E(1  ci)2T2
.
Taking into account the formulas for ci we get that this set is ellipsoidal, n
E(f2/ai) < 1 }
£(T) _ { f E Rn I
i=1
where
a?(T) _
+
r2/0.2%
n )2.E
,r2
j=1 (1 { Tj2loj2)21
1 b2 > . . . > bn > bn+l = 0. Let
Tz = of
bi
1+
uj=1(o'J2/bJ2)
'j=1(° ,j /bj )
1
for 1 < i < k, and Ti = 0 for k + 1 < i < n, where k is the smallest positive integer satisfying /bj )
bk+1
1 + uj=1(Qj lbj )
Then the `coordinatewise' algorithm cpi. is the (unique) optimal linear one, and
(Ej1(cj/bj))2
k
radlin 8(E; E) = e"'a(FI, V,.; E) _
2
E aJ N j=1
1+uj=1(oj/bj)
Furthermore,
radlin 8(11; E) < ,i . Tad ona(III; E)
1.11...)
and
radlin 8(111; E) :: radnona(III; E)
as
Qi i 0+,
1 < i < n.
4.3 Approximation of operators
237
Proof Observe first that r* is well defined. Indeed, the definition of k implies
1+Ek 1(6,j/q)
1+
>
Ej=1(vj2/bj2)
bk (1 + Ej=1(0,j2/b2)) + Qk/bk
/bj)
bk
1 > 0, Vi. Obviously R(T*) C E. We can also easily check that E C E(T*). Indeed, for 1 < i < k we have ai(T*) = bi, while for k + 1 < i < n we which means that
have k
ai(T*) =
u.1=1(o.9/bj) > bk+1 > bi. 1+ /b; )
Hence R.(T*) is the hardest rectangular subproblem contained in E and we can apply Corollary 4.3.1. We obtain that co is the unique optimal
linear algorithm, and the radius radin a(III; E) = e"'a(lll, V,.; E) can be easily calculated. As rtin(Ti, o) < 16i rnon(Ti, Q?), in view of Lemma 4.3.1 we have
radlin '(E; E) = radlin a(I; /,(T*)) 161

radnona(LI; R,(T*)) < r.1 radnona(III; E). 
We also have rlin(Ti, Ub
rnon(Ti, o,?) as o'i/Ti > 0. Hence to complete the proof it suffices to observe that o'i/r + 0 as all vis decrease to zero.
Another characterization of problems whose difficulty is determined by the difficulty of rectangular subproblems is given as follows.
We shall say that a set E is orthosymmetric if (fl, ... , fn) E E implies (sl f2i ... , s,h ff,) E E for all choices of si E {+1, 1}. A set E is quadratically convex if
Q(E) = { (fl , ... , fn) I (f1, ... , fn) E E } is convex. Examples of orthosymmetric and quadratically convex sets include rectangles, ellipsoids, and lpbodies with p > 2, i.e., n
E
f E Rn
DIfilp/Iailp) < 1 }.
i=1
4 Worstaverage case setting
238
Lemma 4.3.2 Let E be a bounded convex set of 1R". If E is orthosymmetric and quadratically convex then the condition (4.13) holds, i.e.,
radl', a(L; E) =
sup
radlin 8(111; R(T))
R(r)CE
Proof Let T* be the maximizer of radl"n a(1[I; R(rr)) over T E E. As E is orthosymmetric and convex, R(T*) C E. We now show that E C E(T*).
For xi > 0, Vi, let n
4 (xl ... , x,x) _ (r
R xl, ... , v)))2 xn
fin
= %=1
x1 Q42 + xi
Denoting by 8A the boundary of a set A, we have that P = Q(8E(T*)) is a hyperplane which is adjacent to the set
B={xI
' Y (x)
> (rad in a(III; R(T *)))2 }
As Q(E) is convex and the interiors of Q(E) and Q(B) have empty intersection, both sets are separated by P. Hence Q(E) C Q(E(T*)) which implies E C E(rr*). Now we can apply Corollary 4.3.1, completing the proof. 0
4.3.2 The Hilbert case We now apply the results obtained above to get optimal algorithms for some problems defined on Hilbert spaces. We assume that S is a compact operator acting between separable Hilbert spaces F and G. We want to approximate S(f) for f from the unit ball E of F. Information is linear with Gaussian noise, i.e., y = N(f) +x where N = fl)F, ... , (, f, )F] and x  N(0, o2E), E > 0. As always, We also assume that the operators S*S and N*N (where N* is meant have a common with respect to the inner products and orthonormal basis of eigenelements. Write this basis as {4%}%>1 and the corresponding eigenvalues as Ai and 77i, S*S Si _ A% %,
N*N Si = 17% c%,
i > 1,
where A, _
Our aim is to find the optimal linear algorithm and its error. It is clear that we can restrict our considerations to cp such that W(R) C S(F) since otherwise we would project W onto S(F) to obtain a better
4.3 Approximation of operators
239
algorithm. We write cp in the form
sp(y) = j>jMS(O'
i
where (pj : R' > R and the summation is taken over all j > 1 with Aj, for Aj > 0. As the elements are orthogonal and such cp we have (ewa(IH, ,))2 z
sup
7rf(dy)
U441 fn
i E j Aj((J,Sj)F sup n i(f't,)2 1 such that 77i > 0. Clearly, m < n.
For j E Z, let qj = 7)j. Then the vectors qj are orthonormal in Rn with respect to the inner product (, )y, and the are orthonormal with respect to Let Q be an orthogonal n x n matrix whose first m columns are E1/2gi,, and
J 1/2 1/2 D1 = diag 741 , ... , 77inm
Letting g = M(f) + i, where
D1QTE1/2y
we transform the data y = N(f) + x to
M(f) nm
and the ij are independent,
i
= N(0, v2diag{rJi11,... nm
(Compare with the analogous transformation in Subsection 3.4.2.) Writing cp(y) = w (y) and f j = Y, t j) F we obtain
(ewapj,
W))2
=
r
f Xj (.fj  ccj (M(.f) + i))2 *r(dx). sup Ui fi 1 } where r E 12 is in the ellipsoid Ei(T?/Ai) < 1. It is not difficult to see that the parameter T* is given as follows. Let
s = min( i > 0 I ?7i+1 = 0 or Ai+1 = 0}.
(4.14)
Then ri = 0 for i > s + 2, and Ti , ... , T9+1 are the maximizer of oj
2+ +'r? T or Tj
j=1
s+1
E9+1 (Ti /Ai) < 1 (if A.+1 = 0 then r,+1 = 0 and over the ellipsoid the last summation is taken from 1 to s), where the noise levels vJ2 _ 01 2)t /iij. Solving this maximization problem, we obtain the following
formulas. Let k be the smallest integer satisfying k E {1, 2, ... , s} and
Ak+1
A2 > ... > 0. Let s and k be defined by (4.14) and (4.15).
 

(i) If 1 < k < s then k
radlin a(te)
Ai
=a. 1 i=1
 or2
(\E7=1k VIAP737 1l2 / 1+a2 Ek
j=1 1j
71i
and the optimal linear algorithm is given as ,1in (y)
1  a2
(1 +
(l
z=1
+Q2
1
77,71 (N(Sj), E102 S(Si)
1
2 Ej=1 j )
(ii) Ifk = s + 1 then s
rada(IlI) _ Im
As+1 + Q2 8
(

A+1)2
i=1
and the optimal linear algorithm is given as Cl
Tin(y) = i=1

V1
F11Y)2 S(6)
a ) 77,,'
In both cases, for nonlinear algorithms we have
radnon (1II) < tci radlin a(te)
(#4 = 1.11...).
4 Worstaverage case setting
242
Let us see more carefully what happens when a + 0. We have two cases. Assume first that A,+1 = 0, i.e., the radius of exact information is zero. Then we fall into (i), radlin g(am)
F
a
lzt$
a tr(S(N*N)1S*)
rad on (ICI)
(where the last equivalence follows from the fact that aj/rrj* * 0 as or + 0, 1 < j < s), and the algorithm Wlin `tends' to 8
*(y) = E 77i
V/t
/t
i=1
Observe that if F =
i ... , 8} then V* is nothing else but the least squares algorithm W18 = SN1PN (where PN is the orthogonal projection in the space Y onto N(F) with respect to Indeed, for y = N(f) + x with PNx = 0, we have e
(P*(y) = E(f,Si)FS( i) = S(f) = S01s(y) i=1
Suppose now that A3+1 > 0 which means that the radius of exact information is positive. Then lim0 radlin a(1CI)
_
s+l = lim radnon'(E)
and the optimal linear algorithm is independent of the noise level or, provided that it is small, a2
0). In the statistical literature, this is called a nonparametric regression model, and it was studied, e.g., in Golubev and Nussbaum (1990), Nussbaum (1985), Speckman (1985), Stone (1982) (see also
the book of Eubank, 1988). It is known that for this problem the radius of information is asymptotically (as n  oo) achieved by a smoothing spline algorithm and that this radius satisfies )2r/(2r+l)
/
ra d ona(11I, n) 
I
\
YI/2(r)p1/(2r+1)
n
(4.18)
4.3 Approximation of operators
245
where y(r) = (2r + 1)1/(2r+1)(r/1r(r + 1))2r/(2r+1) is Pinsker's constant, see Nussbaum (1985). The main idea for proving this result is as follows. For large n, the £2norm of a function f roughly equals I I f I I n = (n1 En 1 instead of the So we can consider the error with respect to the seminorm G2norm. The set En = {(f (0), f (1/n),. . , f ((n 1)/n), f (1)) 1 f E EP} is an ellipsoid. Hence the original problem can be reduced to that of approximating a vector v E En C Rn from information y = v + x where x  N(0, o, 2I). If we find the coordinates of En (which is the main difficulty in this problem), results of Subsection 4.3.1 can be applied. Golubev and Nussbaum (1990) showed that we cannot do better by allowing observations at arbitrary points. That is, the estimate (4.18) holds also for the nth minimal radius and equidistant observations are asymptotically optimal. f2(i/n))1/2 .
NR 4.3.4 Recently, Donoho and Johnstone (1992) (see also Donoho, 1995, and Donoho et at., 1995) developed a new algorithm for approximating functions from their noisy samples at equidistant points. The algorithm is nonlinear. It uses the wavelet transform and relies on translating the empirical wavelet coefficients towards the origin by 21og(n)a//. Surprisingly enough, such a simple algorithm turns out to be nearly optimal for estimating many classes of functions, in standard Holder and Sobolev spaces as well as in more general Besov and Triebel spaces. More precisely, suppose that f is in the unit ball of the Besov space BP,, or the
Triebel space F. Then the minimal errors of arbitrary and linear algorithms using n noisy samples are given as
rad ona(n) x nk
and
_
k=
radlina(n)
x nk
where k
s
s _+1/2
and
,
s + (1/p_  1/p)
s + 1/2 + (1/p_ 1/p)'
with p_ = max{p, 2}. Hence, for p < 2, no linear algorithm can achieve the optimal rate of convergence. For information on Besov and Triebel spaces see, e.g., Triebel (1992). An introduction to wavelets may be found in Daubechies (1992) or Meyer (1990).
NR 4.3.5 As far as we know, the computational complexity in the mixed worstaverage case setting has not been studied so far. There are two main difficulties that have to be overcome before finding concrete complexity formulas.
The first difficulty lies in obtaining optimal information. As optimal algorithms are not known exactly even for problems defined in Hilbert spaces, results on optimal information are limited. (For some special cases see NR 4.3.3 and E 4.3.6, 4.3.7.) The second difficulty is adaption. In the mixed worstaverage case setting, the situation seems to be much more complicated than in the worst and average
case settings, even for problems defined on convex and balanced sets. For instance, Ibragimov and Hasminski (1982) and Golubev (1992) proved that in the nonparametric regression model the (nonadaptive) equidistant design is asymptotically optimal in the class of adaptive designs. This is, however, no longer true for integration. Recently, Plaskota (1995b) gave an example
4 Worstaverage case setting
246
where adaption helps significantly for multivariate integration for a convex and balanced class of functions. More precisely, let
F = {f:D=[0,1]d'Rj If(x).f(y)I 2. Suppose we want to approximate the integral S(f) = fD f (u)du from observations of the function values. Let rnon (n, o) and raa (n, o) be the minimal errors that can be achieved using exactly n nonadaptive and adaptive observations with variance o2, respectively. Then, for any o > 0, we have rnon(n, o) x n1/d, but r.d (n, a)
ll
U=0,
n 1/d n1/2
0>0.
Thus, for large d, adaptive information can be much better than nonadaptive information. Moreover, using adaptive information one can obtain much better approximations for noisy data than for exact data. These results hold because adaption and noisy information make the Monte Carlo simulation possible.
Exercises E 4.3.1
Show that the number k in Theorem 4.3.1 can be equivalently defined as the largest integer satisfying 1 < k < n and E3.=1(ojlbj)
bk
0 we define
Eo = aE = {af EFI f EE} and information III, with 7ro, f = N(N(f), a2E). Show that for an arbitrary linear solution operator S, the minimal linear, affine, or nonlinear errors satisfy
radwa(11,; E) = a rad'a(III; El/Q). E 4.3.6 Consider the problem of approximating a vector f from the unit ball of Rd. Let rn(al, ... , an) (0 < al < < an) be the minimal error that can be attained by linear algorithms that use n (n > d) independent observations yi = (f, fi) 2 + xi with Gaussian noise, xi  N(0, a?), where Il f i 112 < 1, 1 < i < n. Show that
1
d r nto,
a) n = min
Lei=1 71i
V 1+E
d
i=117i1
where the minimum is taken over all 71i > 0 such that n
n
E 1jj < E j=r
aj2
1 < r < n.
(4.19)
j=r
In particular, for n = d we have
rd
rn(al,...,an) =
2
ai
1+a d
2
i
What is the optimal information? Hint: Use Lemma 2.8.1 to obtain the upper bound and optimal information.
E 4.3.7 Consider the optimal information problem as in E 4.3.6, but with E=R d and an arbitrary solution operator S : R d + G. Let Al > be the eigenvalues of S"S. Show that
> Ad > 0
rn(al,...,an) = min V
where the minimum is taken over all 'ni > 0 satisfying (4.19). In particular, show that for equal variances a? = a2 we have rn(a) _
Find the optimal information.
a n n
d
Ai/2 i=1
5
Averageworst case setting
5.1 Introduction In the previous chapter, we studied the mixed worstaverage case setting in which we have deterministic assumptions on the problem elements and stochastic assumptions on noise. In the present chapter, we analyze
the second mixed setting called the averageworst case setting. It is obtained by exchanging the assumptions of the previous chapter. More precisely, we assume some (Gaussian) distribution µ on the elements f. The information operator is defined as in the worst case setting of Chapter 2. That is, N(f) is a set of finite real sequences. The error of an algorithm W that uses information N is given as
e''(N, p) =
f
sup
F UEN(f)
IIS(f)  (p(y)II2µ(df).
The averageworst case setting seems to be new and its study has been initiated only recently. Nevertheless, as a counterpart of the widely studied worstaverage case setting, it is also important and leads to interesting and nontrivial results. The main results of this chapter concern approximating linear functionals, and are presented in Section 5.2. It turns out that in this case the averageworst case setting can be analyzed similarly to the worstaverage case setting, although they seem to be quite different. Assuming that the information noise is bounded in a Hilbert norm, we establish a close relation between the averageworst setting and the corresponding average
case setting. That is, in both settings, optimal linear algorithms belong to the same class of smoothing spline algorithms. Moreover, the minimal errors are (almost) attained by the same algorithm, and the ratio of the minimal errors is bounded by at most 1.5. Using once more the concept 248
5.2 Linear algorithms for linear functionals
249
of hardest one dimensional subproblems, we show that nonlinear algorithms cannot be much better than linear algorithms for approximating linear functionals. The relation between the averageworst and average settings, together with relations established in the previous sections, enables us to formulate a theorem about (almost) equivalence of all four corresponding settings for linear functionals. In Section 5.3, we present some results for operators. In particular, we show optimality properties of the least squares algorithm.
5.2 Linear algorithms for linear functionals In this section, we construct almost optimal algorithms for the case when
the solution operator S is a linear functional. To do this, we use ideas similar to those in the worstaverage case setting.
5.2.1 The one dimensional problem Suppose we want to approximate a real random variable f which has zeromean normal distribution with variance A > 0, f ' N(0, A). We assume that instead of f we know only its noisy value y = f + x where I xI < 6. That is, the information operator N(f) = [f  6, f + 6]. In this case, the error of an algorithm cp is given as eaW
(N,
V)
e'W
=
(A, 6;'P)
j
l
sup If  p(f + x)I2
ICI
exp{ f2/(2A)} df
First, we consider linear algorithms. Let rlin(A, 6) = inf { ea' (A, 6; gyp)
I
W linear I.
Lemma 5.2.1 For all A > 0 and 6 > 0 we have 62 < 2A/ir,
S
rlin(A, 6)
=
{
,\62(12/ir) a+6226(2A/,T)12 vfA
2A/7r < 62 < 7rA/2
7rA/2 < 62.
The optimal linear algorithm W(y) = copty is uniquely determined and
5 Averageworst case setting
250
its coefficient is given as 62 < 2A/7r,
1
Aa 2a/7r Copt = copt(A,6)
{
A+S22S 0
2A/7r < S2 < 7rA/2,
A/
irA/2 < 62
Proof For a linear algorithm cp(y) = c y and f E ]I8 we have C) f cx12 sup If V(f + x) I2 = sup (11xI 6
20 ( 6 )
(5.2)
where
/2
e62/2
fi(b) =
V
+oo exp{y2/2} o
cosh{6 y}
dy.
The inequality rnon(1, 6) < riin(1, 6) is obvious. To show the reverse
inequality, observe that the function 0 is decreasing. This, (5.2) and Lemma 5.2.1 yield that for 6 E [0, 1] r1in(1, rnon(1,6)
0. Let
radlin w (N; µ) = inf { eaw (N, (p; µ) I cp linear j, radnon (N;,a) = inf { ea"'(N, w; µ) I cp arbitrary },
5.2 Linear algorithms for linear functionals
253
be the minimal errors of linear and arbitrary nonlinear algorithms. Consider first the case when the measure u is concentrated on a one dimensional subspace.
Lemma 5.2.2 Let h E F and t be the zeromean Gaussian measure with correlation operator
CN,(L) = L(h) h,
dL E F'*.
(i) If N(h) = 0 then radii '(N; µ) = radnon (N, µ) = I S(h)I and cp = 0 is the optimal algorithm.
(ii) If N(h) # 0 then radii" (1,q; u)
=
I S(h) I riin
(1,
radnon 7 (A;i) = IS(h)I rnon(1,
II N(h) II 6
Y)
IIN(h)IIy
'
)'
and the optimal linear algorithm is given as
(v(y) = where
coPt(1,
IIN(h)IIy) IIN(h)IIy
rnon(',') and co
y' IIN(h)IIy
are as in Lemma 5.2.1.
Proof (i) Since N(f) vanishes on span{h}, information consists of pure noise only. Let cp : Rn + R be an arbitrary algorithm. Then, for any a E Rn with IIally < 5, the error of the constant algorithm cpa  W(a) is not larger than the error of W. Hence zero provides the best approximation and the minimal error equals S(CLS) = (S(h)I, as claimed.
(ii) Define the random variable a = a(f) by f = ah. Then a has standard Gaussian distribution. Similarly to the proof of Lemma 4.2.2, letting z = E1/2y/ II N(h) II y, we can transform the data y = N(f)+x to z = a w + x' where w = E112N(h)/II N(h)II y and IIx'II2 _< 5/II N(h)II y. Using an orthogonal matrix Q with Qw = el we get that the original
problem of approximating S(f) from data y is equivalent to that of approximating s(a) = a S(h), a  /(0,1), from data
= Qz = [a+xl,x2,...,in] where IIx'II2
IIXII2 0. Then
f
radfinW(N;A) >
(radaffW(N;1LK (. g)))2 µPx'(dg)
F'xo (F)
where PK,(f) = f Ka(f)CAKQ/IIKQII2. Since the measures AK, (.1g) have the same (independent of g) correlation operator
AK, (L) = L(CPK2) CµKo, IIKoII2
a.e. g,
and they differ only in the mean g, the minimal error of affine algorithms with respect to µK, ('19) is independent of g. That is, radaff "' (N; µK, (' I g)) = radii W (N; µK, )
where µK, =
Now we can use Lemma 5.2.2 to obtain the affine algorithm V with error equal to rad'w(N;,uK. (. g)). We have Wg(y) = S(g) + Copt (1,
6
S(ha)
N(h,)
N(g), > IIN(ho)IIy y \ IIN(h,)IIy ) IIN(ho)IIy \y
5 Averageworst case setting
258
where ha = C,,Ka/IIKIIµ. We find that
N(ho) = N(KaK°) II
a2 E w° IIKoIIµ
IIµ
II N(ho) II y = P1(Q), and
S(ha) __ S(CµKC)
_
IIKKII2+6r2IIN(CµKo)IIy IIKoIIµ
11K,11µ
(compare with the proof of Theorem 3.5.1). Hence g'g(y) = S(g) + copt(1, bP(or)) (1 + a2P2(a)) (y, wa)2
Since or satisfies (5.4) and S(g)  (wa, N(g))2 = KK(g) = 0 for all g E PK, (F), we finally obtain
Vg =
wa)2 = Va
Thus the same (linear) algorithm cpa minimizes the errors over UK, a.e. g. Hence this is the optimal linear algorithm. To find the error of cpa, observe that S(ho) can be written as
(Ewa,wa)2
S(ha) = (IIKoIIµ + P1(u)) This and Lemma 5.2.2 give radaw
aw
on (N; µ) = radon (N; µKo )
eaw (N, Wa; µ)
S(ha)I rlin(1, 6P(o)) (Ewwa)2) r1in(1,6P(O'))
(IIKoIIµ + p1(a)
Consider now the case when v(b) = 0. Proceeding as for a(b) > 0 we get that for any ry > 0
>
radiin w (N; µ)
radiin w (N; µK7 )
(IIKoIIµ + p'(y))
(Ew7,w.y)2.
We have already noticed that in the case a(b) = 0 we have p(O) < +oo and c0pt(1, 6p(0)) = 1. In view of Lemma 5.2.1, this means that rtin(1,6p(0)) = 6p(0). We also have IIKoIIµ = 0 which follows from the proof of Lemma 5.2.3. Hence, letting ry > 0+ and using continuity arguments, we get radiin w (N; µ)
(IIKoIIµ + p1 (0) b
(Ewo, wo)2.
(Ewo,wo)2) riin(1,bp(0)) (5.5)
5.2 Linear algorithms for linear functionals
259
On the other hand, in the case v(S) = 0 we have S(f) = cpo (N f) a.e.
f.
Hence
sup IIxIIy 0+'
completing the proof.
Thus nonlinear algorithms can only be slightly better than linear algorithms.
5.2.3 Relations to other settings In Subsection 4.2.3 we established close relations between optimal approximation of functionals in the mixed worstaverage case setting and in the other settings. In this subsection we show similar relations for the mixed averageworst case setting. These follow from the results of Subsection 5.2.2. Let p be a zeromean Gaussian measure on F and H C F a separa
ble Hilbert space such that {H, Fl} with Fl = supp p is the abstract Wiener space for p. We consider the problem of approximating a continuous linear functional S(f) from noisy information y = N(f) + x in the following four settings.
(AW) Averageworst case setting with the measure p on F and the noise IIxIIy = (Elx,x)2 < S. (WW) Worst case setting with E the unit ball in H and IIxIIY (Elx, x)2 < 6. (AA) Average case setting with the measure p and x  N(0, o2E).
5.2 Linear algorithms for linear functionals
261
(WA) Mixed worstaverage setting with E the unit ball in H and x N(0, a2E). As before, we denote by Wo the optimal (linear) algorithm in the average case setting. Recall that cpo can be interpreted as the smoothing spline algorithm, W,(y) = S(s(y)) where s(y) is the minimizer of
IIfIIH+a2IIyN(f)IIY in H.
Theorem 5.2.4 Let a = 6. Then we have ea" (N, poi N) < v G radlin W (N; li)
k2 v L radnon (Ni /1')
(r.2 < 1.5) and 1
radw or (N; E)
ir/2 the optimal linear algorithm is cpo 0, while for 6 < 7r/2 it is given as V, (y) = (1+Q2)1S(y) where or = 0(6) = 1/copt(1 66)  1. Furthermore, show that the error of ca equals JIS112 rlin(1, 6).
E 5.2.7 Let 6 > 0. Show that the necessary and sufficient condition for the algorithm cao to be optimal is that KO = S  cpoN = 0 a.e. on F, and 62 7r (Ewo, Gr,1PNEwo)2 2 (Ewo, wo)2
0, 1 < i < n. That is, the joint probability distribution µ on Rn is the zeromean Gaussian measure with diagonal correlation matrix. Information about f is given coordinatewise, yi = fi + xi where 1xi I < bi,
1 1 the vector 2n = yn  Nn(f) is in a given set B(On, Nn(f )) C Rn of all possible values of the nth information noise corresponding to exact information Nn(f ). Here yn = [iii,. .. , yn] and B(On, Nn(f )) satisfy conditions (i)(iii) of Subsection 2.7.1. That is,
(i) B(0, z) = {0}, (ii) If On < An then B(On, Z) C B(On, z), (iii) B(On, zn) x E Rn 12a E IR, [x, a] E B(on+1, zn+1) }. Defining the nth information Nn = {Nn, On} as
yn  Nn(f) E B(An,Nn(f))}, we can equivalently say that y E Rn is information about f if for all
Nn(f) = {yn E
I
n > 1 the vector yn is the nth noisy information about f, yn E Nn(f ). We now pass to adaptive information. It is given by a family N = {Ny}yERo, where N. = {Ny, Dy} is nonadaptive information with
Ny = [L1,L2(';yi),...,L, (.;Yi,...,Yn1),...] and
Ay = For adaptive N, a sequence y is called (noisy) information about f (y E
N(f)) if yn E Ny(f), do > 1.
6.2 Asymptotic and worst case settings
271
For a given solution operator S : F > G, where G is a normed space, an approximation to S(f) is provided by an algorithm. The algorithm is a sequence of transformations, i.e., co = {con}n>°, where con : Rn > G. (coo is a fixed element of G.) The nth error of the algorithm co at f E F with information y E N(f) is given by the distance II S(f)  cPn(yn)ll
We shall only consider the case where
the solution operator S is linear, the functionals Li are linear, and the sets B(On, Zn) = B(On) are unit balls in some extended norms II Icon ofR .
Recall that the last assumption and the conditions (i)(iii) imply that (see E 2.7.1)
IIxIIon = m
II[x,t]Ilon+l,
n>1,xEW.
(6.1)
6.2.2 Optimal algorithms Our first goal is to characterize the best possible behavior of the error II S(f)  con(yn) II for fixed (adaptive) information N. The nth (worst case) radii of nonadaptive information NY will play a crucial role. These radii are given as the usual worst case radii of Nn with respect to the unit ball of F, i.e.,
radwor(Ny) = radwor(Ny) = inf sup
sup IjS(f)  n(zn)Il.
11f1jF 1,
where n = 2k + i, 0 < i < 2k  1. Let the nth approximation cpn(yn) be given by the linear spline interpolating the data yn. Then for any f E F and yn E Nn (f) we have
IIf  n(y,)Il oo < M/n, where M = M(f) depends only on the Lipschitz constant for f. Hence for all f we have at least linear convergence of the successive approximations to f, while the radii radn°T(N) do not converge at all. In the last example, the space F is not complete. The completeness of F and continuity of information turn out to be crucial assumptions. If these conditions are met, {radn°r(Ny)} also establishes a lower bound on the speed of convergence of the error II S(f)  Vn (yn) ll . More precisely, we have the following theorem.
Theorem 6.2.1 Suppose that the domain F of the linear solution operator S is a Banach space, and information consists of continuous linear functionals Li(. ; yl,... , yn_1). Let r(y), y E R°O, be arbitrary nonnegative sequences such that T(y)
_
[T1,T2(yl)...... n(yn1),...]
and limn.oo 7n(y) = 0. Then the set

o(yn)ll < +oo, b'y E N(f) } noo Tn(y) radn (Nv) is boundary, i.e., it does not contain any ball in F. (Here 0/0 = +oc.)
A = If E F
limsup II S(f)
6 Asymptotic setting
274
Proof Suppose to the contrary that A contains a closed ball B of radius r, 0 < r < 1. We shall show that then it is possible to find an element f * E B and information y* about f * such that ,((,*)n) 11
= +oo.
lim sup IIS(f*) n*oo
Tn(y) ran
(6.4)
We first construct by induction a sequence { fk}k>i C B, a sequence , and yk E Nyk (fk), k > 1, which satisfy the following conditions:
of integers 0 = no < n1
1 we have constructed f1i...,fk, no < < nk1, and y1,...,yk1 We select Ilk = [yk,1, yk,2, ... ] in such a way that Yk,i = Yk1,i for 1 < i < nk_1, and
IIy2k 1NNk1(fk)Ilo+1 = IlyLNyk(fk)ll&
,
>nk1+1.
Note that in view of (6.1) this selection is possible. (For k = 1 we set y1,i = Li(fk; yi 1) so that y1 = Nyl (fl).) Clearly, Ilk is noisy information about fk. Now we choose rk > 0 such that for Ill  All rk we have II S(f) S(fk)ll < (1/3)IIS(fk)  S(fk1)Il (For k = 1 we choose r1 such that
llS(f)  S(fi)ll nk_1 for which
Tnk (yk) < min Irk, (r/2)k} and
nk(yk)ll < Tnk(yk) radnkr(Nyk)
1
10
Owing to (6.2) there exists hk E F such that (i) IlNyk (hk)llonk < (ii) llhkllF < (iii) IIS(hk)II
Tnk(yk),
Tnk(yk),
and
> (1/4) Tnk(Ilk)rnkr(Nyk)
Tnk(yk).
6.2 Asymptotic and worst case settings
275
We now set fk+l = fk + hk. Observe that for k = 1 we have
Tn1(yl) : 2)
Ilyn, Ny1(.f2)IlAn = IINn,11(hl)Ilovl while for k > 2 we have Ilykk _ Nyk (fk+l)II An.
k we have (y*)nk II
 N,* (fr)Il, " m1  (fk+1)Ilov; + i II NN: (y*)nk
II
(fj+1  fj)II ov:
j=k+1 k E2j
j=1
m1
m,1
j=k+1
j=1
+E
E 2j < 1.

Letting m > +oo and using the continuity of we find that II (y*)nk Ny: (f *) II o: < 1. This in turn yields y* E NY* (f *) for all n > 1, i.e., y* is noisy information about f *.
Finally, for k > 1 we obtain
II S(f*)  nk((y*)n,)II IIS(f*)  S(fk)II  IIS(fk)  nk((y*)nk)Il > 2  IIS(fk+1)  S(fk)II  10 . 40
Tnk (y*) rad r(Ny* ),
which implies (6.4) and contradicts the fact that f * E B. The proof is complete.
Observe that the sequences T(y) can be selected in such a way that they converge to zero arbitrarily slowly. This means that the speed of convergence of the radii {radn°r(Ny)} essentially cannot be beaten by any algorithm W. In this sense, the optimal convergence rate is given by that of the radii, and the (ordinary) spline algorithm cpo is optimal. Observe that the sequences T(y) cannot be eliminated from the formulation of Theorem 6.2.1. Indeed, for the spline algorithm coo we have
{fEF
limsup IIS(f) W°cao (yn)II < +00, noo Tadn (KY)
Vy E N(f) } = F.
For further discussion on Theorem 6.2.1, see NR 6.2.2.
6.2.3 Optimal nonadaptive information In this subsection we consider the problem of optimal information. We wish to select the functionals Li in such a way that the speed of convergence is maximized. In view of Theorem 6.2.1, we can restrict ourselves
6.2 Asymptotic and worst case settings
277
to nonadaptive information, since adaption does not help. (The behavior of the error 11S(f)  On(yn) jj is characterized by that of the nth radii of the nonadaptive information Np.) More precisely, let Al be the class of nonadaptive (exact) information operators N = [L1, L2, ... ], where the functionals Li E A belong to a given class A C F*. Suppose the precision vector is fixed,
o = [61,62,63,...]. It is clear that the error cannot tend to zero essentially faster than the sequence of the nth minimal worst case radii {rn°r(0)} defined by
rw(0) = inf radn°r(Nn NEN
An)
n>1
(compare with the corresponding definition in Section 2.8). We shall show that it is often possible to construct information N E Al for which that convergence is achieved. We assume that the extended norms 11 JjAn satisfy the following condition. For any n > 1 and permutation {pj} 1 of {1, 2, ... , n} we have II (x1,...,xn) 11 [61,...,6n] =
Ex,1,...,xpn)
1)[6PJI...,bPn1
(6.5)
f o r all (x1, ... , xn) E Rn. This condition expresses the property that
the power of information does not depend on the order of performing observations. Indeed, for information N1 = {[L1i ... , Ln], [61, ... , 6n]} and N2 = {[Lp...... Lpn], [6P1 ...,6Pn]} we have [y1,...,yn] E N1(f) if 1
[yp, , ... , ypn ] E N2 (f ). Clearly, (6.5) holds, for instance, for the weighted supnorm and Euclidean norm.
Let 77 > 1. For any n > 1, let information Nn E Al be such that
0) < 77 . Define 2k No = [N11,N22,N44..., N2k,...]
where, as always, N,N denotes the first n functionals of Nn.
Lemma 6.2.2 Suppose that
61 > 62 > 63 > ... > 0.
(6.6)
Then for information Na = {No, 0} and the ordinary spline algorithm cpo we have 115(f)  O0 (yn) lI
K(f)  r f (n+1)/41(A),
f E F,
y E No (f ),
6 Asymptotic setting
278
where K(f) =77 max{2,(1+p) IIfIIF}.
Proof For n > 1, let k = k(n) be the largest integer satisfying n > Ei 2' = 2k+1  1. Then all the functionals of N2k are contained in 0
No and, in view of (6.6), these functionals are observed with smaller noise bounds using information NA than using {N2k , N2k }. This, (6.1) and (6.5) yield radnO1(Np) < rad2r°r({N2k,
QZk
}) < 77 r"(0).
Using (6.3), for any f E F and y E No (f) we obtain max{2, (1+p)IIfIIF}radnor(NA)
[(n + 1)/41.
Lemma 6.2.2 often yields optimality of information N. Indeed, for many problems the nth minimal radius ra°r (0) behaves polynomially in 1/n, i.e., rn°r(0) x n_p for some p > 0. Then x rn°r(0), the error II S(f)  cpo (yn) II achieves the optimal convergence rate, and information No is optimal.
Corollary 6.2.2 If the nth minimal radii satisfy rnwor(0)
nP,
p > 0,
then information No is optimal, i.e., for the spline algorithm
IIS(f)co (yn)II = O(np),
V,, we have
f EF, yENo(f)
Notes and remarks NR 6.2.1 The first results which revealed relations between the asymptotic and worst case settings were obtained by Trojan (1983) who analyzed the linear case with exact information. His results were then generalized by Kacewicz (1987) to the nonlinear case with exact information. The particular nonlinear problems of multivariate global optimization and scalar zero finding were studied in Plaskota (1989) and Sikorski and Trojan (1990), respectively. The results for noisy information were obtained by Kacewicz and Plaskota (1991, 1992, 1993). This section is based mainly on these last three papers. NR 6.2.2 One can try to strengthen Theorem 6.2.1 by making the sequences r(y) dependent not only on the information y, but also on the problem elements f . That is, we select nonnegative sequences r (f ; y) such that
6.2 Asymptotic and worst case settings
279
and limn. Tn (f ; y) = 0, and replace the set A in Theorem 6.2.1 by IIS(f) (yn)II < +oo, Vy E N(f) }. B = If E F limsup

I
n+oo
Tn(f;y) radn (N,)
Then such a `reformulated' theorem is in general no longer true, as can be illustrated by the following example.
Let F = G be a separable, infinite dimensional Hilbert space with the orthonormal basis {Si}i>1. Let S be given by SEi = A44i, i _> 1, where IAII > > 0. Let the information N be exact with N = y2)F, ...]. IA2I > In this case we have radnor (N) = A n+1 I. On the other hand, for the algorithm cp = {cpn} where cpn(yn) =
1 yiAi4i, we have C,O
1
I)1n+1I
Ian+1I
E (f,Si)FA2 < i=n+1
E
i=n+1
which converges to zero with n > +oo for all f E F and V E N(f). Hence, taking Tn(f; y) = Tn(f) = IIS(f)  0n(yn)II/I.n+1I, we have B F. In this example, the ratio I IS(f)  cpo (yn) II /radno' (N) converges to zero for all
f and information y about f. However, owing to Theorem 6.2.1, on a dense set of f this convergence is arbitrarily slow.
NR 6.2.3 In this section we analyzed behavior of algorithms for fixed f as the number of observations increases to infinity. It is also possible to study behavior of the cost of computing an capproximation, as a + 0. Clearly, we want this cost to grow as slowly as possible. The corresponding computational model would be as follows. The approximations are obtained by executing a program P. This time, however, the result of computation is a sequence 90, 91, 92.... of approximations rather than a single approximation. That is, the execution consists (at least theoretically) of infinitely many steps. At each (nth) step a noisy value yn of a functional Ln (f; Y1, ... , yn_ 1) is observed and then the nth approximation 9n = Wn(y1,... , yn) is computed. Obviously, such an infinite process usually requires infinitely many constants and variables. However, we assume that for any n, the nth approximation is obtained using a finite number of constants and variables, as well as a finite number of primitive operations. In other words, the first n steps of P constitute a program in the sense of the worst case setting of Subsection 2.9.1. Let P be a program that realizes an algorithm cp using information N. For e > 0, let
m(P; f, y) (e) = min { k > 0 I
IIS(f)  cp'(yt)II < e,
Vi > k }
(6.7)
be the minimal number of steps for which all elements g,n, g,,,.+1, gn,+2,... are eapproximations to S(f). (If such a k does not exist we let m(P; f, y) = +00.)
Then the cost of obtaining an eapproximation using the program P is given as
f E F, Y E N(f), where m = m(P; f, y) (e) is defined by (6.7), and cost,n(P; y) is the cost of performing m steps using the program P with information y. (If m = +oc then cost (P; f, y) (e) = +00.) cost(P; f, y) (6) = cost. (P, y),
6 Asymptotic setting
280
A similar model of (asymptotic) cost was studied by Kacewicz and Plaskota (1992, 1993). They showed that, under some additional assumptions, the best behavior of cost(P; f, y) (e) is essentially determined by the worst case complexity Comp"'or(e). Hence there are close relations between the asymptotic and worst case settings not only with respect to the error but also with respect to the cost of approximation. NR 6.2.4 In the previous remark we assumed that the computational process is infinite. It is clear that in practice the computation must be terminated. The choice of an adequate termination criterion is an important practical problem. Obviously, (6.7) cannot serve as a computable termination criterion since m(P; f, y)(e) explicitly depends on f which is unknown. Suppose that we want to compute approximations using a program P which realizes a linear algorithm cp using nonadaptive information N. Suppose also that w e know some bound on the norm of f, say I I f I I F < K. Then to obtain an eapproximation it is enough to terminate after
m = min { i e,r(Ni, cpt) < e min{1,1/K}} steps. In view of the results of Kacewicz and Plaskota (1992, 1993), this is the best we can do. On the other hand, if we do not have any additional I
information about the norm IIf II F, then any computable termination criterion fails; see E 6.2.5.
Exercises E 6.2.1 Let A E R°° be a given precision vector. Show that X E R', IIxII. = lim IIx'IIA°, n,.oo
is a well defined extended norm in R°°, and for all n > 1 we have IIxniIo,. = main II[xn,z]IIA, ZERw
xnERn.
E 6.2.2 Show that Corollary 6.2.1 also holds for the smoothing spline algorithm cp. defined in Subsection 2.5.1. E 6.2.3 Let N and cp be arbitrary information and algorithm. Show that for the spline algorithm coo and rn(y) as in Theorem 6.2.1, the set
{fEF
lim P
II S(f)  con (yn) II Tn (y) II S(f)  coo (yn) II
` +oo,
b y E N(f) }
does not contain any ball.
E 6.2.4 Suppose that the solution operator S is compact and acts between separable Hilbert spaces, and that observations of all functionals with norm bounded by 1 are allowed. Let NO =
ei)F, (, 52)F, ... CC
where {£j} is the complete orthonormal basis of eigenelements of S* S and the corresponding eigenelements satisfy Ai > A2 >_ ... > 0. Assuming exact observations, A = [0, 0, 0, ...], show that for the spline algorithm cospi we have II S(f)  c p1(yn)II 0, Vn > 1. Let to : R°° + {0, 1} be arbitrary termination functions. That is, for f E F and y E N(f) calculations are terminated after m(y) = min { i > 0 1 ti (y1, ... , y1) = 11 steps. Show that for any e > 0 and y E N(F), there exists f E F such that y E N(f) and II S(f)  wm(v) (ym(v)) I I > C.
6.3 Asymptotic and average case settings In this section, we assume that information noise has random character. We show close relations between the asymptotic and average case settings. This will be done under the following assumptions: F is a separable Banach space equipped with a zeromean Gaussian measure p, the solution operator S is continuous and linear and acts between F and a separable Hilbert space G, and information consists of independent observations of continuous linear functionals with Gaussian noise. To be more specific, (adaptive) information with Gaussian noise in the asymptotic setting is given as III = {Ny, Ey}VER, where
Ny = [L1('),L2(';y1),...,Ln(';y1,...) yn1),...] is an infinite sequence of continuous linear functionals, and
Ey = diag{oi,Q2(TJ1),...,U2(4f1,...,Yn1),...} is an infinite diagonal matrix. By the ith observation we obtain yi = Li(f; y1, ... , yi1) + xi where xi N(0, v? (y1, ... , y2 1)). That is, for f E F the probability distribution of information y = [yi, y 2. ... ] E R°° about f is defined on the ofield generated by the (cylindrical) sets
of the form B = A x R°° where A is a Borel set of Rn and n > 1, and given as follows. For any such B we have 7r f (B) = 7rf (A) where 7rf is the distribution of [y1, ... , yn] E In corresponding to the first n observations. It is defined as in Subsection 3.7.1 for information BT = {NYn, En }yERn with
NY =
[L1(.),L2(';y1),...,Ln(';y1,...,yn1)]
6 Asymptotic setting
282 and
En = [61,62(y1),...,(Sn(y1,...,yn1)] Y Now, for an arbitrary Borel set B C R',
7rf(B) =
7rp(Bn)
n1
where Bn = { yn E 1Rn I V E B } is the projection of B onto Rn.
6.3.1 Optimal algorithms We now deal with the problem of an optimal algorithm. Recall that in the average case setting the optimal algorithms opt are obtained by applying S to the mean of the conditional distribution corresponding to information about f. Also, opt can be interpreted as a smoothing spline algorithm, cpopt = co pi. We now show that the same type of algorithms can be successfully used in the asymptotic setting. More precisely, for y E 1[8°°, let the algorithm cp$pl = {cpgpl} be given as a P1(yn) = S(m(yn)) where n
m(yn) = E zjn (CC(L,(.;Yj1))), j=1
zn is the solution of (En + Gn )zn \\
v
Nv
En = diag {o i, 02 (y1), ... , Qn
1=1.
Gn = {(Liy21) Lj ('; yi1))µ}? j
=
yn
and GNa is the Gram matrix, Let {H, F'1} (F1 = supp µ) be
the abstract Wiener space for p. Then m(yn) can be equivalently defined as the minimizer in the Hilbert space H of n
Ilf JJH
+
,._2(J_1)
(yj  Lj(f; y11))2
j=1
(compare with Section 3.6 and Subsection 3.7.2).
In what follows, we shall use the joint distribution j1 on the space
F x R. This represents the probability of the occurrence of f E F and information y about f, and is generated by the measure p and the distributions 7r f. That is, for measurable sets A C F and B C IR°°, we have
µ(A x B) = flrf(B)P(df).
6.3 Asymptotic and average case settings
283
Observe that in view of (6.8) we can also write
µ(A x B) = lim µ'(A x Bn), noo where 1n(A x Bn) = fA irf (Bn) p(df) is the joint probability on F x Rn, i.e., the projection of ji onto F x 18n. Obviously, m(yn) is the mean element of the conditional distribution minimizes the average error over An.
on F. Hence conSPI
The algorithm ca 1,i is optimal in the asymptotic setting in the following sense.
Theorem 6.3.1 For any algorithm cp = {epn}, its error almost nowhere tends to zero faster than the error of cpspl. That is, the set
A = {(f,y)EFx
lira IIS(f)  On(yn)II noo IIS(f)  V

I
is of µmeasure zero. (By convention, 0/0 = 1.) The proof of this theorem is based on the following lemma.
Lemma 6.3.1
Let w be a Gaussian measure on G with mean m4,.
Then for any go E G and q E (0, 1) we have
w ({gEGI IIg9oll 0 and a E G we have w(BT(0)) > w(Br.(a))
6 Asymptotic setting
286
Lemma 6.3.3 Let w be a Gaussian measure on G and Cu, its correlation operator. Then w(B*(a)) where O(x) =
\
3
tr(C,,
2/7r fox et2/2dt.
We are now ready to show that the error of any algorithm (and particularly the error of cpgpi) cannot converge to zero faster than {rad&Ve(ICIy)}.
Theorem 6.3.2 For any algorithm cp the set 11S(f)
Al = { (f,y) E F x R°° I
lim
noo
 n(yn)II = 0}
/ rad'(Ey)
has ftmeasure zero.
Proof We choose q E (0, 1) and define
A1,n = { (f,y) E F x R°O I IIS(f)  0n(yn)II < q radn e(IIIy)}
and B1,n = { (f, yn) E F x Rn I (f, y) E A1,n }. Similarly to the proof of Theorem 6.3.1, we have Al C U°° 1 nnOO i A1,n and µ(A1) < limsupn_,. [P(B1,n). It now suffices to show that the last limit tends
to zero as q , O. Indeed, using the conditional distribution of j
we get
Iin(B1,n)
=
f
n
A2({ .f E F '
I
(.f, yn) E B1,n } I yn) L1(dyn)
= f v2({9EGI n Since rad;Ve(IlIy) = of the Gaussian measure
is the correlation operator we can use Lemma 6.3.3 to get that
where
v2({9EGI II9con(y,)II
40(2q)
Thus %µn(B1,n) < (4/3) i/i(2q). Since this tends to zero with q + 0+,
µ(A1) = 0. We now show that in some sense the sequence {radnVe(Ey)} also provides an upper bound on the convergence rate.
6.3 Asymptotic and average case settings
287
Theorem 6.3.3 For the algorithm cpgpl the set
A2 = { (f, y) E F x R
lim
radn e(IIIy)
01.
n IIS(f)  (p(yn)II
has µmeasure zero.
Proof Choose q E (0, 1) and define
A2,n = {(f,y) E F x R°° I IIS(f) 
Pj(yn)II > (1/q)rad e(Ey)}
(f, yn) E F x Roo I (f, y) E A2,n}. Then A2 is a subset of U°°1 n i A2,n and ji(A2) _< limsupn. An(B2,n). Using the and B2,n,
decomposition of µ we have P n (B2,n )
f.n v2 ( { 9 E G I
IIg  p1(yn) II > (1/q)
tr(Cv,yn)
yn.
We now use a slight generalization of the Chebyshev inequality to estimate the Gaussian measure of the set of all g which are not in the ball centered at the mean element. That is, if w is a Gaussian measure on G then for any r > 0 tr(C,,,)
= JG I Ig  mII2 w(dg)
Jgm,., II>r IIg  m.II2 w(dg) > r2 w(G \ Br(mw)), and consequently
w({g E G I
IIg  mWII > r})
1, let
r,Ve(E) = inf radnVe(Nn, En) NEN
be the minimal average error that can be achieved using the first n nonadaptive observations. It is clear that for any information N E N we have rad"(N, E) > rn e(E), i.e., the radii radnVe(N, E) do not converge faster than rBV6(E). Consequently, Theorem 6.3.2 yields that for arbitrary information N E N the µmeasure of the set /
(f,y) E F X R'
nooo
IIS(f)rn`'e(E) 
 0}
is zero. We now establish information NE whose radii behave in many cases as rn e(E). To this end, we use the construction from Subsection 6.2.3. We let y > 1 and choose Nn E N for n > 1 in such a way that
radnve(Nn, E) < 7/ r'(E). Then 2 NE _ [Ni1,N22,N44,..., N2k.... 1.
Lemma 6.3.4 Suppose that
al > 0`2 > Q3 > ... > 0.
6.3 Asymptotic and average case settings
289
Then for the information NE and algorithm 1psp1 the set lim
{ (f, y) E F x ]R°°
r1(' +1)/41(E)
n IIS(f)  p1(yn)Ij
 01
has [1measure zero.
Proof Proceeding as in the proof of Lemma 6.2.2 we show that
rad e(NE, E) < 77
(6.13)
(E).
Hence the lemma is a consequence of (6.13) and Theorem 6.3.3. As in Subsection 6.2.2, Lemma 6.3.4 immediately gives the following corollary.
Corollary 6.3.1 If the nth minimal radii satisfy rn e(E) x nP for some p > 0, then the information NE is optimal.
Notes and remarks NR 6.3.1 Relations between the asymptotic and average case settings were first established by Wasilkowski and Wozniakowski (1987) who studied exact information. The results for information with random noise are new; however, we adopted techniques from the paper cited to prove Theorems 6.3.1, 6.3.2 and 6.3.3. Lemma 6.3.3 is due to Kwapien.
NR 6.3.2 We cannot claim in general that the sequence II S(f)  WP1(yn)II behaves at least as well as radn e(] I,) with probability one. Actually, the probability that II S(f)  Bn (yn) II = O (radn e(11,)) can even be zero, as illustrated by the following example. Let F = G be the space of infinite real sequences with 11f 11F = E
1
ff < +o0
We equip F with the zeromean Gaussian measure µ such that Cµei = Aiei, where Ai = aj and 0 < a < 1. Consider approximation of f E F from exact information about coordinates of f , i.e., N(f) = [fl, f2, f3 ...] and E _ diag{0, 0, 0, ...}. We shall see that in this case the set 11S(f)
A3 = { (f, y) E F x R°° I
lim sup
radn
noo
N
I)
has
+00 }
measure zero. Indeed, as noise is zero, the measure µ is concentrated on {(f, N(f )) I f E F} and ji(A3) equals the limeasure of the set B = {f E F I (f , N(f )) E A3}. Moreover, since in this case cpeP1(yn) = [yl,... , y, , 0, 0, 0, ... ] and radne(IlI) _
°n+1 A1, we have B = Uk 1 Bk where 00
00
Bk = {fEF I Efi 1}. i=n
i=n
6 Asymptotic setting
290

Observe now that the condition E°°n i=
2=fi < k2 >°°n Ai implies
00
1:At = k
I fn l < k
lava =
lk a
i=n
fin.
Hence 00
2. Let wn1 be the joint distribution of gn1 = (91 ... , gn1) and wn be the distribution of gn. Then w(B,.(a))

f 'In ({ 9n I9 and < wn(l9n llJJ
I9nI
r2
1(dg1)
9n1  a1112
 119"  an'Ill
ll()
wn1(dgn1)
1
w(B,.(an1, 0)).
Proceeding in this way with successive coordinates we obtain
w(Br(an1, 0))
f p({tET l I EtjgjI 1  (1/4) and y1 < (4/3). The proof is complete.
Exercises E 6.3.1 Consider the problem of approximating a parameter f E ]R from information y = [y1, Y2, y3, ... ] E R', where yi = f + xi and the xis are independent, xi  N(0, a2), i > 1. Show that then for any f n
irf({yERO°
lim n1 yj=f}) = 1, n. j=1
where, as always, ir f is the distribution of information y about f, i.e., the algorithm co'(yn) = (1/n) Ej yj converges to the `true' solution f with probability 1.
i
6 Asymptotic setting
292
E 6.3.2 Consider the one dimensional problem of E 6.3.1. For y belonging to the set n
C = { y E R°° I the limit m(y) = lim 1 E yj exists and is finite }, n°o n l j=1
let wy be the Dirac measure on R centered at m(y). Let µ1 be the distribution of information y E R°°,
µ1(') = IF 1r f(µ(df) Show that µ1(C) = 1 and µ(A x B) = fwY(A)iLl(dY)for
any measurable sets A C F and B C R°O, i.e., {wy} is the regular conditional distribution on R with respect to the information y E R°°, wy = µ20y).
E 6.3.3 Give an example where Ti ({(f,y) E F x R°° I
IIS(f)  W pi(yn)II ^ radSVe(lfIy)}) = 1 .
E 6.3.4 Let III be given information. Can we claim that for a.e. (f, y) there exists a subsequence Ink } such that II S(f)  cpsp1(ynk) II
radnve (IlI)?
E 6.3.5 Suppose the class A consists of functionals whose µnorm is bounded by 1. Let .... I No = where
is the complete orthonormal basis of eigenelements of SCµS* and
the corresponding eigenvalues satisfy Al > A2 > ... > 0. Assuming exact observations, E = diag{0, 0, 0.... }, show that the information No is optimal independently of the behavior of rBVe(0) _
j>n+l A3.
References
ARESTOV, B.B.
(1990) Best recovery of operators, and related problems. Vol. 189 of Proc. of the Steklov Inst. of Math., pp. 120. ARONSZAJN, N.
(1950) Theory of reproducing kernels. Trans. AMS, 68, 337404. BABENKO, K.I.
(1979) Theoretical Background and Constructing Computational Algorithms for MathematicalPhysical Problems. Nauka, Moscow. (In Russian.) BAKHVALOV, N.S.
(1971) On the optimality of linear methods for operator approximation in convex classes. Comput. Math. Math. Phys., 11, 244249. BICKEL, P.J.
(1981) Minimax estimation of the mean of a normal distribution when the parameter space is restricted. Ann. Statist., 9, 13011309. BJORCK, A.
(1990) Least squares methods. In Handbook of Numerical Analysis. Ed. by P.G. Ciarlet and J.L. Lions, Elsevier, NorthHolland, pp. 465652. BLUM, L., CUCKER, F., SHUB, M. AND SMALE, S.
(1995) Complexity and Real Computation: A Manifesto. To appear. BLUM, L., SHUB, M. AND SMALE, S.
(1989) On a theory of computation and complexity over the real numbers: NPcompleteness, recursive functions and universal machines. Bull. AMS (new series), 21, 146. BROWN, L.D. AND FELDMAN, I.
(1990) The minimax risk for estimating a bounded normal mean. Unpublished manuscript. CASELLA, G. AND STRAWDERMAN, W.E.
(1981) Estimating bounded normal mean. Ann. Statist., 9, 870878.
293
References
294 CIESIELSKI, Z.
(1975) On Levy's Brownian motion with severaldimensional time. In Probability Winter School. Ed. by Z. Ciesielski et al. Vol. 472 of Lecture Notes in Math., SpringerVerlag, New York, pp. 2956. CUTLAND, N.J.
(1980) Computability. Cambridge Univ. Press, Cambridge. DAUBECHIES, I.
(1992) Ten Lectures on Wavelets. Vol. 61 of CBMSNSF Ser. in Appl. Math. SIAM, Philadelphia. DoNOxo, D.L.
(1994) Statistical estimation and optimal recovery. Ann. Statist., 22, 238270.
(1995) Denoising by softthresholding. IEEE Trans. on Inform. Th., 41, 613627. DONOHO, D.L. AND JOHNSTONE, I.M.
(1994) Minimax risk over 1, balls for l4error. Probab. Theory Related Fields, 99, 277303.
(1992) Minimax estimation via wavelet shrinkage. To appear in Ann. Statist. DONOHO, D.L., JOHNSTONE, I.M., KERKYACHARIAN, G. AND PICARD. D.
(1995) Wavelet shrinkage: asymptopia? J. Roy. Stat. Soc., ser. B, 57, 301369.
DoNOxo, D.L., LIu, R.C. AND MACGIBBON, K.B. (1990) Minimax risk over hyperrectangles, and implications. Ann. of Statist., 18,14161437. EUBANK, R.L.
(1988) Spline Smoothing and Nonparametric Regression. Dekker, New York. GAL, S. AND MICCHELLI, C.A.
(1980) Optimal sequential and nonsequential procedures for evaluating a functional. Appl. Anal., 10, 105120. GIKHMAN, I.I. AND SKOROHOD, A.V.
(1965) Introduction to the Theory of Random Processes. Nauka, Moscow. (In Russian.) GOLOMB, M. AND WEINBERGER, H.F.
(1959) Optimal approximation and error bounds. In On Numerical Approximation. Ed. by R.F. Langer, Univ. of Wisconsin Press, Madison, pp. 117190. GOLUB, G.H., HEATH, M.T. AND WAHBA, G.
(1979) Validation as a method for choosing a good ridge parameter. Technometrics, 21, 215223.
References
295
GOLUBEV, G.K.
(1992) On sequential experimental designs for nonparametric estimation of smooth regression functions. Problems Inform. Transmission, 28, 7679. (In Russian.) GOLUBEV, G.K. AND M. NUSSBAUM, M.
(1990) A risk bound in Sobolev class regression. Ann. of Statist., 18, 758778.
GREVILLE, T.N.E.
(1969) Introduction to spline functions. In Theory and Applications of Spline Functions, Ed. by T.N.E. Greville, Academic Press, New York, pp. 135. HANSEN, P.C.
(1992) Analysis of discrete illposed problems by means of the Lcurve. SIAM Review, 34, 561580. HEINRICH, S.
(1993) Random approximation in numerical analysis. In Proc. of the Functional Analysis Conf., Essen 1991, Ed. by K.D. Bierstedt et al.. Marcel Dekker, New York, pp. 123171. HEINRICH, S. AND KERN, J.D.
(1991) Parallel informationbased complexity. J. Complexity, 7, 339370. HOLLADAY, J.C.
(1957) Smoothest curve approximation. Math. Tables Aids Computation, 11, 233243. IBRAGIMOV, I.A. AND HASMINSKI, R.Z.
(1982) Bounds for the risk of nonparametric regression estimates. Theory Probab. Appl., 28, 8194. (In Russian.)
(1984) On the nonparametric estimation of the value of a linear functional in Gaussian white noise. Theory Probab. Appl., 29, 1932. (In Russian.) JENSEN, K. AND WIRTH, N.
(1975) Pascal. User Manual and Report. SpringerVerlag, Berlin. KACEWICZ, B.Z.
(1987) Asymptotic error of algorithms for solving nonlinear problems. J. Complexity, 3, 4156.
(1990) On sequential and parallel solution of initial value problems. J. Complexity, 6, 136148. KACEWICZ, B.Z. AND KOWALSKI, M.A.
(1995a) Approximating linear functionals on unitary spaces in the presence of bounded data errors with applications to signal recovery. Intern. J. of Adaptive Control and Signal Processing, 9, 1931.
References
296
(1995b) Recovering linear operators from inaccurate data. J. Complexity, 11, 227239. KACEWICZ, B.Z., MILANESE, M., TEMPO, R. AND VICINO, A.
(1986) Optimality of central and projection algorithms for bounded uncertainty. Systems Control Lett., 8, 161171. KACEWICZ, B.Z. AND PLASKOTA, L.
(1990) On the minimal cost of approximating linear problems based on information with deterministic noise. Numer. Funct. Anal. Optimiz., 11, 511525. (1991) Noisy information for linear problems in the asymptotic setting. J. Complexity, 7, 3557.
(1992) Termination conditions for approximating linear problems with noisy information. Math. of Comp., 59, 503513.
(1993) The minimal cost of approximating linear operators using perturbed informationthe asymptotic setting. J. Complexity, 9, 113134. KADANE, J.B., WASILKOWSKI, G.W. AND WOZNIAKOWSKI, H.
(1988) On adaption with noisy information. J. Complexity, 4, 257276. KIEFER, J.
(1953) Sequential minimax search for a maximum. Proc. AMS, 4. 502505. KIELBASINSKI, A. AND SCHWETLICK, H.
(1988) Numerische Lineare Algebra. VEB Deutscher Verlag der Wissenschaften, Berlin. KIMELDORF, G.S. AND WAHBA, G.
(1970) A correspondence between Bayesian estimation of stochastic processes
and smoothing by splines. Ann. Math. Statist., 41, 495502. Ko, KERI. (1986) Applying techniques of discrete complexity theory to numerical computation. In Studies in Complexity Theory. Ed. by R.V. Book, Pitman, London, pp. 162.
(1991) Complexity Theory of Real Functions. Birkhauser, Boston, Massachusetts. KON, M.A. AND NOVAK, E.
(1989) On the adaptive and continuous information problems. J. Complexity,
5,345362.
(1990) The adaption problem for approximating linear operators. Bull. AMS, 23, 159165. KORNEICHUK, N.P.
(1994) Optimization of active algorithms for recovery of monotonic functions from Holder's class. J. Complexity, 10, 265269.
References
297
KOWALSKI, M.A.
(1989) On approximation of bandlimited signals. J. Complexity, 5, 283302. KOWALSKI, M.A., SIKORSKI, K. AND STENGER, F.
(1995) Selected Topics in Approximation and Computation. Oxford Univ. Press, New York.
Kuo, H.H. (1975) Gaussian Measures in Banach Spaces. Vol. 463 of Lecture Notes in Math., SpringerVerlag, Berlin. LAWSON, C.L. AND HANSON, R.J.
(1974) Solving Least Squares Problems. PrenticeHall, Englewood Cliffs, New Jersey. LEE, D.
(1986) Approximation of linear operators on a Wiener space. Rocky Mount. J. Math., 16, 641659. LEE, D., PAVLIDIS, T. AND WASILKOWSKI, G.W.
(1987) A note on the tradeoff between sampling and quantization in signal processing. J. Complexity, 3, 359371. LEE, D. AND WASILKOWSKI, G.W.
(1986) Approximation of linear functionals on a Banach space with a Gaussian measure. J. Complexity, 2, 1243. LEVIT, B.Y. (1980) On asymptotic minimax estimates of the second order. Theory Probab. Appl., 25, 552568.
Li, K.C. (1982) Minimaxity of the method of regularization on stochastic processes. Ann. Statist., 10, 937942. MAGARILIL'YAEV, G.G.
(1994) Average widths of Sobolev classes on R"`. J. Approx. Th., 76, 6576. MAGARILIL'YAEV, G.G. AND OSIPENKO, K.Yu.
(1991) On optimal recovery of functionals from inaccurate data. Matem. Zametki, 50, 8593. (In Russian.) MAIOROV, V.
(1993) Average nwidths of the Wiener space in the G.norm. J. Complexity,
9,222230. (1994) Linear widths of function spaces equipped with the Gaussian measure.
J. Approx. Th., 77, 7488. MARCHUK, A.G. AND OSIPENKO, K.Yu.
(1975) Best approximation of functions specified with an error at a finite number of points. Math. Notes, 17, 207212.
298
References
MARCUS, M. AND MINC, H.
(1964) A Survey of Matrix Theory and Matrix Inequalities. Allyn and Bacon, Boston, Massachusetts. MATHS, P.
(1990) sNumbers in informationbased complexity. J. Complexity, 6, 4166.
(1994) Approximation theory of stochastic numerical methods. Habilitation thesis. Institut fur Angewandte Analysis and Stochastic, Berlin. MELKMAN, A.A. AND MICCHELLI, C.A.
(1979) Optimal estimation of linear operators in Hilbert spaces from inaccurate data. SIAM J. Numer. Anal., 16, 87105. MEYER, Y.
(1990) Ondelettes et Operateurs. Hermann. Paris. MICCHELLI, C.A.
(1993) Optimal estimation of linear operators from inaccurate data: a second look. Numer. Algorithms, 5, 375390. MICCHELLI, C.A. AND RIVLIN, T.J.
(1977) A survey of optimal recovery. In Estimation in Approx. Th.. Ed. by C.A. Micchelli and T.J. Rivlin, Plenum, New York, pp. 154. MoRozov, V.A. (1984) Methods for Solving Incorrectly Posed Problems. SpringerVerlag, New York. NEMIROVSKI, A.S.
(1994) On parallel complexity of nonsmooth convex optimization. J. Complexity, 10, 451463. NEMIROVSKI, A.S. AND YUDIN, D.B.
(1983) Problem Complexity and Method Efficiency in Optimization. Wiley and Sons, New York. NIKOLSKIJ, S.M.
(1950) On the estimation error of quadrature formulas. Uspekhi Mat. Nauk, 5, 165177. (In Russian.) NOVAK, E.
(1988) Deterministic and Stochastic Error Bounds in Numerical Analysis. Vol. 1349 of Lecture Notes in Math., SpringerVerlag, Berlin. (1993) Quadrature formulas for convex classes of functions. In H. Brass and G. Hammerlin, editors, Numerical Integration IV, Birkhauser, Basel.
(1995a) The adaption problem for nonsymmetric convex sets. J. Approx. Th., 82, 123134.
References
299
(1995b) The real number model in numerical analysis. J. Complexity, 11, 5773.
(1995c) On the power of adaption. Manuscript. NOVAK, E. AND RITTER, K. (1989) A stochastic analog to Chebyshev centers and optimal average case
algorithms. J. Complexity, 5, 6079. NUSSBAUM, M.
(1985) Spline smoothing in regression model and asymptotic efficiency in G2
Ann. Statist., 13, 984997. OSIPENKO, K.Yu.
(1994) Optimal recovery of periodic functions from Fourier coefficients given with an error. To appear in J. Complexity. PACKEL, E.W.
(1986) Linear problems (with extended range) have linear optimal algorithms. Aequationes Math., 30, 1825. PAPAGEORGIOU, A. AND WASILKOWSKI, G.W.
(1990) Average complexity of multivariate problems. J. Complexity, 5, 123. PARTHASARATHY, K.R.
(1967) Probability Measures on Metric Spaces. Academic Press, New York. PARZEN, E.
(1962) An approach to time series analysis. Ann. Math. Statist., 32, 951989.
(1963) Probability density functionals and reproducing kernel Hilbert spaces. In M. Rosenblatt, editor, Proc. Symposium on Time Series Analysis, Wiley, New York, pp. 155169. PASKOV, S.H.
(1993) Average case complexity of multivariate integration for smooth functions. J. Complexity, 9, 291312. PINKUS, A.
(1985) nWidths in Approximation Theory. SpringerVerlag, Berlin. PINSKER, M.S.
(1980) Optimal filtering of square integrable signals in Gaussian white noise. Problems Inform. Transmission, 16, 5268. (In Russian.) PLASKOTA, L.
(1989) Asymptotic error for the global maximum of functions in sdimensions. J. Complexity, 5, 369378. (1990) On average case complexity of linear problems with noisy information. J. Complexity, 6, 199230.
References
300
(1992) Function approximation and integration on the Wiener space with noisy data. J. Complexity, 8, 301323. (1993a) Optimal approximation of linear operators based on noisy data on functionals. J. Approx. Th., 73, 93105.
(1993b) A note on varying cardinality in the average case setting. J. Complexity, 9, 458470. (1994) Average case approximation of linear functionals based on information
with deterministic noise. J. of Comp. and Inform., 4, 2139.
(1995a) Average complexity for linear problems in a model with varying information noise. J. Complexity, 11, 240264. (1995b) On sequential designs in statistical estimation, or, how to benefit from noise. Int. Comp. Sci. Inst. at Berkeley. Report. RITTER, K.
(1994) Almost optimal differentiation using noisy data. To appear in J. Approx. Th.
RITTER, K., WASILKOWSKI, G.W. AND WOZNIAKOWSKI, H.
(1995) Multivariate integration and approximation for random fields satisfying SacksYlvisaker conditions. Ann. of Appl. Prob., 5, 518540. SACKS, J. AND YLVISAKER, D.
(1966) Designs for regression problems with correlated errors. Ann. Math. Statist., 37, 6689. (1968) Designs for regression problems with correlated errors; many param
eters. Ann. Math. Statist., 39, 4969. (1970) Designs for regression problems with correlated errors III. Ann. Math. Statist., 41, 20572074. SARD, A.
(1949) Best approximate integration formulas: best approximation formulas.
Amer. J. Math., 71, 8091. SCHOENBERG, I.J.
(1946) Contributions to the problem of approximation of equidistant data by analytic functions. Quart. Appl. Math., 4, 4499, 112141.
(1964a) On interpolation by spline functions and its minimum properties. Intern. Ser. Numer. Anal., 5, 109129. (1964b) Spline functions and the problem of graduation. Proc. Nat. Acad. Sci. USA, 52, 947949. (1973) Cardinal Spline Interpolation. Vol. 12 of CBMS, SIAM, Philadelphia.
References
301
SCHOENBERG, I.J. AND GREVILLE, T.N.E.
(1965) Smoothing by generalized spline functions. SIAM Rev., 7, 617. SCHONHAGE, A.
(1986) Equation solving in terms of computational complexity. In Proc. Intern. Congress Math., Berkeley, California. SCHUMAKER, L.
(1981) Spline Functions: Basic Theory. Wiley and Sons, New York. SIKORSKI, K. AND TROJAN, G.M.
(1990) Asymptotic near optimality of the bisection method. Numer. Math., 57, 421433. SKOROHOD, A.V.
(1974) Integration in Hilbert Spaces. SpringerVerlag, New York. SMOLYAK, S.A.
(1965) On optimal recovery of functions and functionals of them. PhD thesis, Moscow State Univ. SPECKMAN, P.
(1979) Minimax estimates of linear functionals in a Hilbert space. Unpublished manuscript. (1985) Spline smoothing and optimal rates of convergence in nonparametric regression models. Ann. Statist., 13, 970983. STECKIN, S.B. AND SUBBOTIN, Yu.N.
(1976) Splines in Numerical Mathematics. Nauka, Moscow. (In Russian.) STONE, C.J.
(1982) Optimal global rates of convergence for nonparametric regression. Ann. Statist., 10, 10401053. SUKHAREV, A.G.
(1986) On the existence of optimal affine methods for approximating linear functionals. J. Complexity, 2, 317322. SULDIN, A.V.
(1959) Wiener measure and its applications to approximation methods, I. Izv. Vyssh. Uchebn. Zaved. Mat., 13, 145158. (In Russian.) (1960) Wiener measure and its applications to approximation methods, II. Izv. Vyssh. Uchebn. Zaved. Mat., 18, 165179. (In Russian.) SUN, Y. AND WANG, C.
(1994) µAverage nwidths on the Wiener space. J. Complexity, 10, 428436. TIKHONOV, A.N.
(1963) On regularization of illposed problems. Dokl. Akad. Nauk SSSR, 153, 4952.
References
302
TIKHONOV, A.N. AND ARSENIN, V.JA.
(1979) Methods for Solving Illposed Problems. Wiley and Sons, New York. TRAUB, J.F., WASILKOWSKI, G.W. AND WOZNIAKOWSKI, H.
(1983) Information, Uncertainty, Complexity. Massachusetts.
AddisonWesley, Reading,
(1988) Informationbased Complexity. Academic Press, New York. TRAUB, J.F. AND WOZNIAKOWSKI, H.
(1980) A General Theory of Optimal Algorithms. Academic Press, New York. TRIEBEL, H.
(1992) Theory of Function Spaces II. Birkhauser Verlag, Basel. TROJAN, G.M.
(1983) Asymptotic setting for linear problems. Unpublished manuscript. See also Traub et al. (1988), pp. 383400. VAKHANIA, N.N.
(1981) Probability Distributions on Linear Spaces. NorthHolland, New York. VAKHANIA, N.N., TARIELADZE, V.I. AND CHOBANYAN, S.A.
(1987) Probability Distributions on Banach Spaces. Reidel, Netherlands. VARADARAJAN, V.S.
(1961) Measures on topological spaces. Mat. Sbornik, 55, 35100. (In Russian.) WAHBA, G.
(1971) On the regression design problem of Sacks and Ylvisaker. Ann. Math. Statist., 10351043.
(1990) Spline Models for Observational Data. Vol. 59 of CBMSNSF Ser. in Appl. Math., SIAM, Philadelphia. WASILKOWSKI, G.W.
(1983) Local average error. Columbia University Comp. Sc. Report. (1986) Information of varying cardinality. J. Complexity, 2, 204228. (1994) Integration and approximation of multivariate functions: average case complexity with isotropic Wiener measure. J. Approx. Th., 77, 212227. WASILKOWSKI, G.W. AND WOZNIAKOWSKI, H.
(1987) On optimal algorithms in an asymptotic model with Gaussian measure. SIAM J. Math. Anal., 3, 632647. (1993) There exists a linear problem with infinite combinatory cost. J. Complexity, 7, 326337.
References
303
(1995) Explicit cost bounds of algorithms for solving multivariate problems. J. Complexity, 11, 156. WEIHRAUCH, K.
(1987) Computability. SpringerVerlag, Berlin. WERSCHULZ, A.G.
(1987) An informationbased approach to illposed problems. J. Complexity, 3,270301. (1991) The Computational Complexity of Differential and Integral Equations. Oxford Univ. Press, Oxford. WERSCHULZ, A.G. AND WOZNIAKOWSKI, H.
(1986) Are linear algorithms always good for linear problems? Aequationes Math., 30, 202212. WIENER, N.
(1923) Differential space. J. Math. and Phys., 2, 131174. WILANSKY, A.
(1978) Modern Methods in Topological Vector Spaces. McGrawHill, New York. WOZNIAKOWSKI, H.
(1991) Average case complexity of multivariate integration. Bull. AMS, 24, 185194.
(1992) Average case complexity of multivariate linear problems I, II. J. Complexity, 8, 337392.
(1994) Tractability of linear multivariate problems. J. Complexity, 10, 96128.
Author index
Arestov, B.B., 12 Aronszajn, N., 61 Arsenin, V.Ja., 61 Babenko, K.I., 118 Bakhvalov, N.S., 30, 31, 69 Bickel, P.J., 231 Bjorck, A., 61 Blum, E., 98 Brown, L.D., 231 Casella, G., 231 Ciesielski, Z., 138
Cutland, N.J., 98 Daubechies, I., 245 Donoho, D.L., 30, 231, 244, 245
Eubank, R.J., 244 Feldman, I., 231
Gal, S., 69 Gikhman, 1.1., 138 Golomb, M., 60 Golub, G.H., 60 Golubev, K.G., 244, 245 Greville, T.N.E., 60 Hansen, P.C., 60 Hanson, R.J., 61 Hasminski, B.Z., 231, 245 Heinrich, S., 99, 232 Holladay, J.C., 60 Ibragimov, I.A., 231, 245
Jensen, K., 99 Johnstone, I.M., 244, 245
Kacewicz, B.Z., 30, 43, 61, 70, 97, 99, 118, 119, 278, 280, 281 Kadane, J.B., 169 Kern, J.D., 99 Kiefer, J., 2 Kielbasinski, A., 61 Kimeldorf, G.S., 161 Ko, K: I., 98 Kon, M.A., 70 Korneichuk, N.P., 70 Kowalski, M.A., 30, 85, 98 Kwapien, S., 289 Lawson, C.L., 61 Lee, D., 86, 118, 194 Levit, B.Y., 231 Li, K: C., 231 MagarilI1'yaev, G.G., 12, 30, 31, 131 Maiorov, V., 131 Marchuk, A.G., 30 Marcus, M., 86 Math6, P., 85, 232 Melkman, A.A., 43 Meyer, Y., 245 Micchelli, C.A., 18, 30, 43, 69 Minc, H., 86 Morozow, V.A., 60
Nemirovski, A.S., 2, 99 Nikolskij, S.M., 2 Novak, E., 2, 70, 85, 98, 130, 232 Nussbaum, M., 244, 245 Osipenko, K.Yu., 12, 30, 31
Packel, E.W., 30 Papageorgiou, A., 86, 194 Parthasarathy, K.R., 131, 137 Parzen, E., 61 304
Author Index Paskov, S.H., 98 Pinkus, A., 85 Pinsker, M.S., 244 Plaskota, L., 43, 70, 85, 97, 118, 119, 146, 154, 169, 194, 195, 204, 212, 213, 245, 261, 278, 280, 281
Ritter, K., 130, 194, 195, 213, 232 Rivlin, T.J., 18, 30 Sacks, J., 194 Sard, A., 2 Schonhage, A., 98 Schoenberg, I.J., 60 Schumaker, L., 60 Schwetlick, H., 61 Sikorski, K., 278 Skorohod, AN., 137, 138 Smolyak, S.A., 29 Speckman, P., 231, 244 Steckin, S.B., 60 Stone, C.J., 244 Strawderman, W.E., 231 Subbotin, Yu.N., 60 Sukharev, A.G., 30 Suldin, AN., 194
305
Sun, Y., 131 Tikhonov, A.N., 61 Traub, J.F., 2, 12, 18, 30, 43, 69, 85, 97, 130, 146, 161, 204, 269 Triebel, H., 245 Trojan, G.M., 278 Vakhania, N.N., 61, 137, 146 Varadarajan, V.S., 131
Wahba, G., 60, 61, 161, 194 Wang, C., 131 Wasilkowski, G.W., 2, 86, 99, 130, 194, 204, 289 Weihrauch, K., 98 Weinberger, H.F., 60 Werschulz, A.G., 2, 30, 61 Wiener, N., 137 Wilansky, A., 12 Wirth, N., 99 Woiniakowski, H., 2, 18, 30, 43, 69, 85, 86, 99, 194, 289 Ylvisaker, D., 194 Yudin, D.B., 2
Subject index
a priori measure, 123 abstract Wiener space, 135, 138, 156, 159, 194, 227, 230, 260 adaption versus nonadaption, 66, 68, 165, 198, 199, 201, 204, 246 adaptive information, 65, 163, 270, 281 algorithm, 8, 271 complexity, 93, 198 asmoothing spline, 57, 59 algorithm, 36, 37, 39, 46, 155, 272 element, 36, 45, 53, 155 approximation, 8 of a parameter, 9, 31, 33, 124, 216, 249
of compact operators, 71, 100, 238 of continuous operators, 170, 205 of functionals, 20, 148, 152, 220, 252 of Lipschitz functions, 81, 82, 106 on the Wiener space, 180, 195, 210212 operator, 81, 180 asymptotic setting, 268 average case setting, 121 averageworst case setting, 248 balanced set, 11 ,l3smoothing spline algorithm, 41 Brownian motion, 138 center of a measure, 126 of a set, 13, 15 central algorithm, 15, 129 centrosymmetric set, 15 characteristic functional, 134, 140142, 150
Chebyshev's inequality, 287, 291 classical Wiener measure, 125, 138, 180 combinatorial complexity, 2 combinatory cost, 92
computational complexity, 1 constants and variables (of a program), 89
convex set, 10 correlation operator, 132, 133 correspondence theorem, 159, 178, 179, 213, 230, 261 cost of approximation, 92, 198, 279 of primitive operations, 92 cost function, 92, 197 covariance kernel, 137139, 180, 183 crossvalidation, 60, 62 curve estimation, 244 cylindrical set, 135, 281
deterministic noise, 3 diameter of a measure, 131 of a set, 15 of information, 1518, 131 differentiation, 195 ellipsoidal problem, 236 eapproximation, 94 ecomplexity, 2, 9496, 103, 105, 106, 112, 113, 117, 198, 202, 208, 210, 212, 213 equidistant sampling, 29, 185 error of an algorithm, 12, 125, 215, 248 of approximation, 8 exact information, 8, 28, 31, 62, 124, 190, 270 expression, 90 extended seminorm, 10
fixed point representation, 118 floating point representation, 12, 118 306
Subject Index Fourier coefficients, 53
307
model of computation, 89, 197 Monte Carlo, 232, 246 multivariate approximation, 86, 114,
Gaussian measure of a ball, 285, 286 Gaussian noise, 139 Gram matrix, 45, 140 graph of information, 18
117, 194, 212 porthogonality, 134, 145, 149 µsemi inner product, 134
HahnBanach theorem, 21 hardest one dimensional subproblem, 24, 154, 225
noise, 1, 3, 10 bounded in a Hilbert norm, 23, 44,
illposed problem, 61 independent observations, 125, 139, 144, 145, 149, 156 induced measure, 123, 140 information, 1 ecomplexity, 94, 96, 103, 106, 202, 204, 207, 211,
about f, 8, 66, 124, 270 cost, 92, 94, 199 distribution, 124 operator, 8 informationbased complexity (IBC), 1 integration of Lipschitz functions, 81, 82, 106 of periodic functions, 29, 226 on the Wiener space, 180, 195, 210212 operator, 81, 181 interpolatory algorithm, 16 isotropic Wiener measure, 138
joint measure, 125 ishard element, 67, 69, 95
Lagrange interpolation, 117 least squares algorithm, 51, 52, 54, 158, 242, 243, 265
problem, 50 linear (affine) versus nonlinear algorithms, 218, 226, 249, 259 linear information with Gaussian noise, 139, 145, 156 with uniformly bounded noise, 10, 12, 17, 252
linear problem, 3 with Gaussian measures, 139, 144, 156, 205
machine, 97 mean element, 129, 132, 133 minimal radius, 71, 7880, 82, 170, 171, 176, 177, 190 minimization problem, 73, 78, 171, 174 Minkowski functional, 12
252
bounded in absolute sense, 110, 112, 117, 119
bounded in relative sense, 68, 110, 113, 117, 119
bounded in supnorm, 30, 64, 81, 84, 106, 109 noise level, 10, 29, 139, 149 noisy information, 1, 8, 124 nonadaptive information, 63, 163, 270 nonparametric regression, 244 normal equations, 51 nth error of algorithm, 271 nth information, 270 nwidths, 85, 131
observations, 3 one dimensional problem, 216, 249 one dimensional subproblem, 24, 150, 151
optimal a, 41, 46, 47, 49, 76, 157 optimal affine algorithm, 20, 24, 221, 223, 228
optimal algorithm, 13, 18, 50, 51, 54, 79, 84, 94, 125, 144, 145, 148, 152, 157, 221, 273, 283 optimal information, 71, 74, 79, 82, 94, 170, 171, 277, 288 optimal linear algorithm, 2931, 234, 236, 241, 253, 256, 263 ordinary spline algorithm, 33, 271 element, 32, 271 orthosymmetric set, 237
parallel computations, 99 partial information, 1 Pascal programming language, 89 permissible functionals, 64, 170 Pinsker's constant, 245 polynomial spline, 54, 57, 60 precision vector, 64, 270 precomputation, 98 priced information, 1 primitive operations, 89, 90 program, 89, 197, 279
Subject Index
308
pure noise, 148, 221, 222, 232, 253, 254, 266
quadratically convex set, 237 radius
of a measure, 126 of a set, 13
of information, 13, 16, 23, 29, 50, 57, 127129, 145, 149, 181, 185, 190,
smoothing spline, 9 algorithm, 34, 43, 156, 157, 159, 282 element, 34, 156 snumbers, 85 solution operator, 8 statements (of a program), 90, 197 statistical estimation, 215 strong equivalence of functions, 103 of sequences, 79
193
random noise, 3 randomized algorithm, 232 real number model of computation, 98 realization of an algorithm, 93 rectangular problem, 234 regular conditional distribution, 126, 130, 141, 150, 166, 182 regularization, 52, 157 algorithm, 52, 54, 158 parameter, 52, 158 regularized solution, 53, 158 reproducing kernel, 58, 63 Hilbert space, 58, 59 rfold Wiener measure, 136, 139, 195, 213
set of problem elements, 8 setting, 3
tensor product problem, 86 space, 61
termination criterion, 66 trace of an operator, 134 Tth minimal radius, 96, 100, 105, 203, 205,207,209211 Turing machine, 97 wavelets, 245 weak equivalence of functions, 96 of sequences, 80 weak measure, 135 white noise, 140, 181, 231 Wiener sheet measure, 138 worst case setting, 5 worstaverage case setting, 215