Contents Eustasio del Barrio Empirical and Quantile Processes in the Asymptotic Theory of Goodness-of-fit Tests 1 Introd...
64 downloads
1330 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Contents Eustasio del Barrio Empirical and Quantile Processes in the Asymptotic Theory of Goodness-of-fit Tests 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Testing fit to a fixed distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Some notes on principal component decompositions of quadratic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Some tools from empirical processes theory . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Strong approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Testing fit to a family of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Adaptation of tests coming from the fixed-distribution setup . . . 4.2 The empirical process with estimated parameters . . . . . . . . . . . . . . 4.3 Correlation and regression tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Tests based on Wasserstein distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Asymptotics for L2 -functionals of the quantile process . . . . . . . . . . . . . . . 6.1 Weak convergence of L2 linear combinations of exponential r.v.’s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Weak convergence of weighted L2 functionals of the quantile process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Weighted Wasserstein tests of fit to location-scale families of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82 86
Paul Deheuvels Topics on Empirical Processes 1 Introduction, notation and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Distribution and quantile functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Topologies on spaces of measures and functions . . . . . . . . . . . . . . . . 1.4 The quantile transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93 93 94 98 108
1 2 9 12 12 16 21 30 31 35 40 44 49 53 67
vi
Contents 2 Fluctuations of partial sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Some large deviations theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Martingale inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Some useful examples of large deviation inequalities . . . . . . . . . . . . 2.4 Strong approximations of partial sums of i.i.d. random variables 2.5 The Erd˝ os-R´enyi theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
114 114 118 126 129 132
3 Empirical Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Uniform empirical distribution and quantile functions . . . . . . . . . . 3.2 Uniform empirical and quantile processes . . . . . . . . . . . . . . . . . . . . . . 3.3 Some further martingale inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Relations between empirical and Poisson processes . . . . . . . . . . . . . 3.5 Strong approximations of empirical and quantile processes . . . . . 3.6 Some results for weighted processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Finkelstein’s theorem via invariance principles . . . . . . . . . . . . . . . . . 3.8 Local and tail empirical process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Modulus of continuity of αn and βn . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 The Bahadur-Kiefer representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Application to density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
136 136 140 142 144 148 151 153 155 156 159 162
4 Auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Some Gaussian process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 A functional LIL for superpositions of Gaussian processes . . . . . . 4.3 Karhunen-Lo`eve expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The RKHS of the Wiener process and Brownian bridge . . . . . . . . 4.5 KL expansions for weighted Wiener processes and Brownian bridges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Bessel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
164 164 166 170 172
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
179
174 177
Sara van de Geer Oracle Inequalities and Regularization 1 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The empirical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191 192 194 194 200
2 M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 General framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Estimation and approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Where empirical process theory comes in . . . . . . . . . . . . . . . . . . . . . . . 2.5 Some first results, assuming ready-to-use empirical process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
200 200 202 202 203 205
3
4
5
6
Contents
vii
2.6 Balancing estimation and approximation error . . . . . . . . . . . . . . . . . 2.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The sequence space formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Reformulation of the regression problem . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimating the mean of a normal vector . . . . . . . . . . . . . . . . . . . . . . . 3.3 A collection of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 The model an oracle would select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Hard- and soft-thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 A probability inequality for the empirical process . . . . . . . . . . . . . . 3.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overruling the variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Estimation and approximation error . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Finite models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Nested, finite models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 General penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Application to the “classical” penalty of Chapter 1 . . . . . . . . . . . . . 4.6 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The 1 -penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Robust regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 The behavior on Ξn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tools from empirical process theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Contraction principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Weighted empirical processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 The case of a Lipschitz transformation of a linear space . . . . . . . . 6.6 Modulus of continuity of the empirical process . . . . . . . . . . . . . . . . . 6.7 Bibliographical remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
208 208 209 209 211 213 214 215 220 221 221 222 224 225 226 227 229 229 232 235 238 240 241 242 242 243 243 243 244 248 249 250
Empirical and Quantile Processes in the Asymptotic Theory of Goodness-of-fit Tests Eustasio del Barrio Abstract. This text contains the material presented at the European Mathematical Society Summer School on Theory and Statistical Applications of Empirical Processes in Laredo, Spain in August-September 2004.
1. Introduction It was in 1900 when Karl Pearson proposed the first test of goodness of fit: the χ2 test. The subsequent research devoted to enhancements of this elementary goodness-of-fit procedure became a major source of motivation for the development of key areas in Probability and Statistics such as the theory of weak convergence in general spaces and the asymptotic theory of empirical processes. In this course we will analyze some suggestive aspects which arise from the development of the asymptotic theory of goodness-of-fit tests along the last century. We will pay special attention to stressing the parallel evolution of the theory of empirical processes and the asymptotic theory of goodness-of-fit tests. Doubtless, this evolution is a good indicator of the vast transformation that Probability and Statistics experienced along this century. Certainly, the names that contributed to the theory are the main guarantee for this assertion. Pearson, Fisher, Cram´er, Von Mises, Kolmogorov, Smirnov, Feller, . . . laid the foundations of the theory. In some cases, the mathematical derivation of the asymptotic distribution of goodness-of-fit tests in that period had the added merit that, in a certain sense, the limit law was blindly pursued. In Mathematics the main difficulty in showing convergence consists, with no doubt, of obtaining a convincing candidate for limit. Thus, proofs in that period could be considered as major pieces of precision and inventiveness. A systematic method of handling adequate candidates for the limit law begins in 1950 with the heuristic work by Doob [36], made precise by Donsker, through the Invariance Principle. The subsequent construction of adequate metric spaces and the development of the corresponding weak convergence theory as the right
2
E. del Barrio
probabilistic setup for the study of asymptotic distributions had a wide and fast diffusion, with notable advances due to Prohorov and Skorohod among others. The contribution of Billingsley’s book [7] to this diffusion must also be pointed out. The study of Probability in Banach spaces has been other source of useful results for the goodness-of-fit theory. The names of Varadhan, Dudley, Araujo, Gin´e, Zinn, Ledoux, Talagrand,. . . are necessary references to anyone interested in asymptotics in Statistics. For example, the Central Limit Theorem in Hilbert spaces had a main role in the obtention of the asymptotic behavior of Cram´er-von Mises type statistics. Last, we must indicate the significance of the “Hungarian school”, developing the strong approximation techniques initiated by Skorohod with his “embedding”. Breiman’s book [8] had the merit of initially spreading Skorohod’s embedding. Now, the strong approximations due to Koml´ os, Major, Tusn´ ady, M. and S. Cs¨ org˝ o, R´ev´esz, Deheuvels, Horv´ath, Mason, . . . , are an invaluable tool in the study of asymptotics in Statistics, as we will point out in this course.
2. Testing fit to a fixed distribution The simplest goodness-of-fit problem consists of testing fit to a single fixed distribution, namely, given a random sample of real r.v.’s X1 , X2 , . . . , Xn with common d.f. F , testing the null hypothesis H0 : F = F0 for a fixed d.f. F0 . While this procedure is usually of limited interest in applications, the solutions proposed to this problem provided the main idea in subsequent generalizations designed for testing fit to composite null hypotheses. Pearson’s chi-squared test can be considered as the first approach to the problem of testing fit to a fixed distribution. The solution proposed by Pearson consisted of dividing the real line into k disjoint categories or “cells” C1 , . . . , Ck into which data would fall, under the null hypothesis, with probabilities p1 , . . . , pk . That is, if H0 were true, then P (X1 ∈ Ci ) = pi , i = 1, . . . , k. If Oi is the number of observations in cell i, then Oi has a binomial distribution with parameters n and pi ; hence, the De Moivre-Laplace central limit theorem (C.L.T.) states that (npi (1 − pi ))−1/2 (Oi − npi ) →w N (0, 1). The multivariate C.L.T. shows that, if l ≤ k, then Bl = n−1/2 (O1 − np1 , . . ., Ol − npl )T has a limit distribution which is centered Gaussian and has covariance matrix Σl whose (i, j) element, σi,j , satisfies σi,j = −pi pj , for i = j, and σi,i = pi (1 − pi ). On the other hand, if pi > 0, i = 1, . . . , k, Σk−1 is nondegenerate and −1 −1 −1 Σ−1 k−1 has element (i, j), νi,j , satisfying νi,j = pk , for i = j, and νi,i = pi + pk . T 2 Simple matrix algebra shows that Bk−1 Σ−1 k−1 Bk−1 converges in law to a χk−1 distribution. Further, straightforward computations show that χ2 :=
k (Oj − npj )2 T = Bk−1 Σ−1 k−1 Bk−1 np j j=1
providing thus a well-known result in the asymptotic theory of tests of fit:
Empirical and Quantile Processes
3
Theorem 2.1. Under H0 , χ2 has asymptotic distribution χ2k−1 . Theorem 2.1 reduces the problem to testing fit to a fixed distribution to analyze a multinomial distribution, thus providing a widely applicable and easy to use method for testing fit which immediately carries over to the multivariate setup. Moreover, this test also allows some freedom in choosing the number, the location or the size of the cells C1 , . . . , Ck . This point will be discussed in the next section. Though, as pointed out by many authors (see, e.g., [68]), consideration of only the cell frequencies when F is continuous produces a loss of information that results in lack of power (the χ2 statistic will not distinguish two different distributions sharing the same cell probabilities). Therefore, in order to improve our method for testing fit we should try to make use of the complete information provided by the data. However, the multivariate C.L.T. and elementary matrix algebra were the only tools needed in the derivation of the asymptotic distribution in Theorem 2.1. This will not be the case when handling more complicated statistics. A way to improve Pearson’s statistic consists of employing a functional distance to measure the discrepancy between the hypothesized d.f. F0 and the empirical d.f. Fn . The first representatives of this method were proposed in the late 20’s and in the 30’s. Cram´er [15] and, in a more general form, von Mises [103] proposed ∞ 2 ωn = n (Fn (x) − F0 (x))2 ρ(x) dx −∞
for some suitable weight function ρ as an adequate measure of discrepancy. Kolmogorov [53] studied √ Dn = n sup |Fn (x) − F0 (x)| −∞<x 0.
i=1
The Hoffmann-Jørgensen inequalities bound moments of sums by the corresponding moment of a maximum plus a quantile of the sum:
Empirical and Quantile Processes
13
Theorem 3.3. For each p > 0 there exists constants Kp , cp such that, if Xi are i.i.d or independent and symmetric r.v.’s, then
Sn p ≤ Kp [t0 + Xn p ], where Y p = (E Y )
p (1/p)
and
t0 = inf{t > 0 : Pr( Sn > t ≤ cp )}. This inequality gives the following result on comparison of moments: Theorem 3.4. For each 0 < p < q there exists a constant, K, such that, if Xi are i.i.d or independent and symmetric r.v.’s, and then
Sn p ≤ K[ Sn q + Xn p ]. The following randomization/symmetrization result by Rademacher variables (symmetric r.v.’s taking values in {−1, 1}) shows that the above Theorem is also valid for centered, independent r.v.’s. Theorem 3.5. Let Xi be independent, centered r.v.’s in Lp , p ≥ 1, and let { i } be independent Rademacher r.v.’s, independent of the Xi . Then 2−p E
n
i Xi p ≤ E
i=1
n
Xi p ≤ 2p E
i=1
and ESnp ≤ 2p+1 E
n
n
i Xi p
i=1
i Xi p .
i=1
As an example of the usefulness of this inequalities we give the following result on the L1 -norm of the empirical process on the line. Theorem 3.6. Let X, Xi , i ∈ N, be i.i.d. random variables with common distribution F . Let Y (t) := IX>t − Pr{X > t}, −∞ < t < ∞, (3.1) and let Yi , i ∈ N, denote the processes obtained by replacing X by Xi in (3.1). Then the sequence ∞ ∞ √ Yi √ = n |Fn (t) − F (t)|dt, n ∈ N, n L1 −∞ i=1 is stochastically bounded if and only if ∞ Pr{|X| > t}dt < ∞. Λ2,1 (X) := 0
Proof. The sufficiency part follows easily from Markov’s inequality. For the necessity part we can assume, w.o.l.g., that X ≥ 0. It is now convenient to write Z(t) := IX>t , Zi (t) = IXi >t , i ∈ N, t ∈ R,
14
E. del Barrio
so that Y (t) = Z(t) − EZ(t) and likewise for Yi . The stochastic boundedness hypothesis simply asserts n 1 √ lim sup Pr (Zi − EZi )L1 > M = 0. M→∞ n n i=1 The L´evy type inequality for i.i.d. random vectors then implies then that 1 lim sup Pr √ max Zi − EZi L > M = 0. 1 M→∞ n n 1≤i≤n The classical inequality for independent random variables, say ξi ,
Pr{|ξi | > t} Pr max |ξi | > t ≥ 1 − exp − i
then gives that there is a constant M < ∞ such that 1 sup n Pr √ Z − EZ L > M < ∞, 1 n n or, equivalently,
Λ2,∞ (Z − EZ) := sup t2 Pr Z − EZ L1 > t < ∞. t>0
A first consequence of this is that
Pr max Zi − EZi L1 > t dt 1≤i≤n 0 ∞ √ ≤ 1 + n √ Pr Zi − EZi L1 > t dt
Zi − EZi L1 1 √ =√ 1≤i≤n n n
E max
∞
n
√ ≤ 1 + nΛ2,∞ (Z − EZ)
∞ √ n
= 1 + Λ2,∞ (Z − EZ) < ∞.
t−2 dt (3.2)
A second consequence is that EX = EZ L1 < ∞. To wit, if t0 < ∞ is such that Pr{X > t0 } ≤ 1/2, then, since finiteness of Λ2,∞ obviously implies E Z −EZ L1 < ∞, we have ∞ IX>t − Pr{X > t}dt ∞ > E Z − EZ L1 =E ≥ E IX>t0
X
t0
0
1 1 dt = E(X − t0 )+ , 2 2
so that EX = t0 + E(X − t0 ) ≤ t0 + E(X − t0 )+ < ∞. Now, Hoffmann–Jørgensen’s inequality gives that for every r > 0 there exist finite positive constants ci , i = 1, 2, depending only on r, such that n (Z − EZ ) r
Zi − EZi rL1 i i r E i=1 √ E max ≤ c + t 1 0,n , 1≤i≤n n nr/2 L1
Empirical and Quantile Processes where
15
n (Zi − EZi ) i=1 L1 √ > t ≤ c2 . = inf t : Pr n
t0,n
On one hand, the stochastic boundedness hypothesis implies supn t0,n < ∞, and on the other, inequality (3.2) asserts the finiteness of the sup over n of the first summand at the right-hand side of the last for r = 1. We thus conclude that n (Z − EZ ) i i sup E i=1 √ (3.3) < ∞. n L1 n If now ξ is a binomial (n, p) then there exist positive finite constants C1 and C2 such that L(ξ) = Bin(n, p) with
√ 1 C1 ≤ p ≤ implies E|ξ − Eξ| ≥ C2 np.) n 2
(this follows, for instance, from symmetrization and Corollary 3.4 in Gin´e and Zinn, 1983.) Applying this to the empirical process, yields n
1 i=1 (IXi >t − Pr{Xi > t}) √ Pr{X > t} ≤ E for med(X) < t < Q(1 − C1 /n). C2 n Then, if integrating and applying inequality (3.3), we obtain Q(1−C1 /n) Pr{X > t}dt sup n
med(X)
Q(1−C1 /n) n (IXi >t − Pr{Xi > t}) 1 √ sup E i=1 dt < ∞. C2 n med(X) n Since Q 1 − C1 /n → ess sup X as n → ∞, this last inequality gives ∞ Pr{X > t}dt < ∞, ≤
0
that is, Λ2,1 (X) < ∞, proving the theorem.
We finally include here a different type of maximal inequality, the BirnbaumMarshall inequality for martingales. It can be used to bound the tails of weighted supremum functionals of the empirical process as we will see later. Lemma 3.7. Let {|Sk |, Fk }0≤k≤n be a submartingale with S0 = 0. Let b1 ≥ · · · ≥ bn ≥ bn+1 = 0 and let r ≥ 1. Call Mn = max1≤k≤n bk |Sk |. Then for all λ > 0 P (Mn ≥ λ) ≤
n 1 r 1 r r r (b − b )E|S | = b [E|Sk |r − E|Sk−1 |r ] . k k k+1 λr 1 λr 1 k
16
E. del Barrio
Proof. Assume without loss of generality r = 1 and Sk ≥ 0. Set Ak = (max1 ≤ j < kbj Sj < λ ≤ bk Sk ). Then n n n n P (Ak ) ≤ bk Sk dP = (bj − bj+1 ) Sk dP λP (Mn ≥ λ) = λ 1
≤ ≤
1
n n
Ak
(bj − bj+1 )
Sj dP = Ak
k=1 j=k n
k=1 j=k n
(bj − bj+1 )
j=1
(bk − bk+1 )E|Sk |
Mj ≥λ
Ak
Sj dP
1
Lemma 3.8 (Birnbaum and Marshall). Let {|St |, Ft }0≤t≤θ be a submartingale with right-continuous sample paths. Assume S(0) = 0 and ν(t) = ES 2 (t) < ∞ on [0, θ]. Let q > 0 be a nondecreasing and right-continuous function on [0, θ]. Then θ |S(t)| 1 ≥1 ≤ P sup dν(t). 2 0≤t≤θ q(t) 0 q(t) Proof. By right-continuity of sample paths and S(0) = 0 |S(t)| |S(θi/2n )| ≤1 =P max ≤ 1 for all n P sup 0 0 max P (|Xn,k | > ) → 0, as n → ∞. 1≤k≤kn
It turns out that, under infinitesimality, the class of possible limit laws of Sn − an , where an are possibly needed centering constants, is the class of the so-called infinitely divisible laws. A law γ is said to be infinitely divisible if, for every n, it n)
can be expressed as an nth convolution power: γ = γn ∗ · · · ∗ γn (that is, γ is the law of the sum of n i.i.d. r.v.’s). Infinitely divisible laws can be characterized as the convolution of a Gaussian law and a Poissonization of a L´evy measure: γ = N (ν, σ 2 ) ∗ cτ Pois µ. If µ is a finite measure on R, then the associated Poissonization is ∞ n) −µ(R) Pois µ := e µ ∗ · · · ∗ µ/n!. n=0
If µ is a positive measure, not necessarily finite, but it integrates min(1, x2 ) then the weak limit w − lim δ ) := cτ Pois µ ↓0
exists and defines the (τ -centered ) Poissonization of µ. Positive measures on the real line satisfying min(1, x2 )dµ(x) < ∞ are called L´evy measures. Thus cτ Pois µ is defined for L´evy measures. The characteristic function of the infinitely divisible law γ = N (ν, σ 2 ) ∗ cτ Pois µ is σ 2 t2 + (eitx − 1 − itxI{|x ≤ τ |})dµ(x) . Ψ(t) = exp itν − 2 We are now ready to state the CLT on the real line. We will denote Xn,k,δ = Xn,k I{Xn,k ≤ δ} and Sn,δ = 1≤k≤kn Xn,k,δ .
18
E. del Barrio
Theorem 3.9. (CLT on the real line) Let {Xn,k : n ∈ N, 1 ≤ k ≤ kn } be an infinitesimal array of row-wise independent real random variables and {an } a sequence of constants. Then, Sn − an converges weakly iff (i) There exists a L´evy measure µ such that P (Xn,k > δ) → µ(δ, ∞), P (Xn,k < −δ) → µ(−∞, −δ) 1≤k≤kn
1≤k≤kn
for every δ > 0 such that µ{−δ, δ} = 0. (ii) There exists σ ≥ 0 such that lim sup lim Var (Xn,k,δ ) = σ 2 Var (Xn,k,δ ) = σ 2 . lim inf δ↓0 1≤k≤kn
1≤k≤kn
E(Sn,δ ) − an → aδ .
(iii) If δ > 0 satisfies µ{−δ, δ} = 0 then Then, if this is the case
Sn − an → N (aδ , σ 2 ) ∗ cδ Pois µ. w
If we restrict our interest to triangular arrays coming from i.i.d. sequences, that is, those arrays with kn = n and Xn,k = Xk /an for some i.i.d. r.v.’s {Xn }n and a sequence of normalizing constants {an }n then the class of limiting distributions reduces to the stable laws. A law γ is said to be stable if it is of the same type as any of its nth power convolutions, that is, if X1 , . . . , Xn are i.i.d. γ, then X 1 + · · · + X n = an X 1 + b n . d
The only possible constants an in the above expression are of type an = n1/α for some α ∈ (0, 2]. This α is called the stability index of the stable law γ. The case α = 2 corresponds to the Gaussian law. A stable law with index α < 2 is an infinitely divisible law without normal part and L´evy measure µ(c1 , c2 ; α) defined by c1 x−1−α dx for x > 0 dµ(c1 , c2 ; α)(x) = c2 |x|−1−α dx for x < 0 for some c1 , c2 ≥ 0. Assume {Xn }n is a sequence of i.i.d. r.v.’s having law ρ. Then we say that ρ belongs to the domain of attraction of the stable law γ if there exists constants an > 0, bn ∈ R such that X 1 + · · · + Xn − bn → γ. w an Stable laws are the only laws having a nonvoid domain of attraction. Domains of attraction can be characterized in terms of regular variation of the tails and the truncated variances of a law. A function L is regularly varying (at ∞) with exponent ν ∈ R if L(tx) lim = xν . t→∞ L(t) We say that it is slowly varying if the above exponent equals 0.
Empirical and Quantile Processes
19
Theorem 3.10. (Domains of attraction on the line) (i) A law ρ belongs to the domain of attraction of the normal law iff the truncated moment x U (x) = y 2 dρ(x) −x
is slowly varying or, equivalently, iff x2
ρ(−x, x)c → 0. U (x)
(ii) A law ρ belongs to the domain of attraction of µ(c1 , c2 ; α), α < 2 iff ρ(−x, x)c is regularly varying of order alpha and ρ(x, ∞) c1 → , c ρ(−x, x) c 1 + c2
ρ(−∞, x) c2 → . c ρ(−x, x) c 1 + c2
We finish this section with some results on the CLT on separable Banach spaces. We consider a triangular array {Xn,k : n ∈ N, 1 ≤ k ≤ kn }, where kn Xn,1 , . . . , Xn,kn are independent (B,
)-valued r.v.’s, call Sn = k=1 Xn,k and consider the problem of finding necessary and sufficient conditions for the weak convergence of Sn − an and of characterizing the possible limiting distributions. The theory parallels to a great extent that for the real line. There are some important differences, though. As in the real case, we say that the array is infinitesimal if max P ( Xn,k > ) → 0,
1≤k≤kn
as n → ∞.
Under infinitesimality, the class of possible weak limits of L(Sn − an ) is, still in this new setup, the class of infinitely divisible laws (the probability measures expressible as nth convolution powers for every n. Infinitely divisible laws, ρ, on B can be characterized as convolutions of Gaussian measures and Poissonization of L´evy measures: ρ = γ ∗ cτ Pois µ. The characterization of L´evy measures, though, is not so straightforward as on the real line. A L´evy measure on B is a σ-finite positive measure, µ, for which there exists τ > 0 and a probability measure η having characteristic functional Φ(f ) = exp eif (x) − 1 − if (x)I{ x ≤ τ }dm u(x) for all f ∈ B (here B denotes the topological dual of B). In this case we define cτ Pois µ := η. Integrability of min(1, x 2 ) is neither a necessary nor a sufficient condition for a measure to be a L´evy measure on a general separable Banach space. The following result is a general CLT in Banach spaces. Theorem 3.11. (CLT in separable Banach spaces) Let {Xn.k } be infinitesimal. Then L(Sn − an ) is weakly convergent iff
20
E. del Barrio
(i) There exists a σ-finite measure µ with µ{0} = 0 such that L(Xn,j )|Bδc → µ|Bδc w
j
for every δ > 0 such that µ(∂Bδ ) = 0. (ii) The limit lim sup φ(f ) = lim Var (f (Xn,k,δ )) lim inf δ↓0 1≤k≤kn
exists and is finite for every f ∈ B (at least for f ∈ W ⊂ B weak-star total ). (iii) There exists a (for all) sequence {Fk } of finite-dimensional subspaces with B = ∪k Fk and β > 0, p > 0 (for all β > 0, p > 0) such that lim sup Edp (Sn,β − ESn,β , Fk ) = 0. k
n
Then (a) µ is a L´ evy measure and there exists a centered Gaussian p.m. such that φ(f ) = f 2 dγ, f ∈ B . (b) For every δ > 0 such that µ(∂Bδ ) = 0 L(Sn − ESn,δ ) → γ ∗ cδ Pois µ. w
In the particular case of a separable Hilbert space (H, , ) the CLT can be slightly simplified. In Hilbert space L´evy measures are positive Borel measures integrating min(1, x 2 ). Condition (iii) in Theorem 3.11 can be replaced by (iii ) There exists a c.o.n.s {φi }i≥1 in H such that, for some (all) > 0, lim lim sup
k→∞ n→∞
n
E Xn,j, − EXn,j, 2 −
j=1
k
EXn,j, − EXn,j, , φi 2 = 0
i=1
Thus, convergence of sums to a Gaussian limit can be characterised with the following Theorem: Theorem 3.12. (CLT in separable Hilbert spaces, normal case) Let {Xn.k } be an infinitesimal H-valued array. Then L(Sn − an ) is weakly convergent iff (i) For every > 0
P ( Xn,j > ) → 0.
j
(ii) The limit
lim sup Var (Xn,k,δ , h ) φ(f ) = lim lim inf δ↓0 1≤k≤kn
exists and is finite for every h ∈ H.
Empirical and Quantile Processes
21
(iii) There exists a c.o.n.s {φi }i≥1 in H such that lim lim sup
k→∞ n→∞
n
E Xn,j, − EXn,j, 2 −
j=1
k
EXn,j, − EXn,j, , φi 2 = 0
i=1
Then there exists a centered Gaussian p.m. such that φ(h) = x, h 2 dγ(x), h ∈ H and L(Sn − ESn,δ ) → γ w
for every δ > 0. The case of sequences of i.i.d. H-valued random variables becomes also simpler in the Hilbert space setup. We conclude the subsection with the LindebergL´evy Theorem in Hilbert space. It is a very easy consequence of Theorem 3.12. Theorem 3.13. Let {Xn } be a sequence of centered i.i.d. r.v.’s then X1 + · · · + Xn √ n converges weakly iff E X1 2 < ∞. In that case the limit law is centered Gaussian with covariance φ(h) = EX1 , h 2 , h ∈ H. 3.3. Strong approximations As we said before, a different approach to proving limit theorems for empirical process on the line can be based on finding suitable versions of the empirical process, αn (t), and a Brownian bridges Bn (t) such that sup |αn (t) − Bn (t)| → 0,
(3.4)
0≤t≤1
almost surely or in probability. The study of results of type (3.4), generically known as strong approximations, began with the Skorohod embedding, consisting of imitating the partial sum process by using a Brownian motion evaluated at random times (see [8]). Successive refinements of this idea became one of the most important methodologies in the research related to empirical processes. A major breakthrough in this line was achieved by the so-called Hungarian construction due to Komlos, Major and Tusnady. We summarize the main results in this section. We assume that {Xn }n are i.i.d. centered r.v.’s with unit variance and finite moment generating function in a neighborhood of the origin and call F its distribution function and set Sk = X1 + · · · + Xk . We also assume that {Yn }n are i.i.d. standard normal and set Tk = Y1 + · · · + Yk . Then Theorem 3.14. We can define, on the probability space on which {Yn }n are defined, a sequence {Xn }n of i.i.d. r.v.’s with d.f. F such that P max |Sk − Tk | > C log n + x ≤ Ke−λx , x > 0, n ≥ 1 1≤k≤n
where C, K, λ are constants depending only on F .
22
E. del Barrio
Theorem 3.14 is a deep result with a long, difficult proof. It has, though, many important consequences for important processes in Statistics. A first, direct result in this line can be obtained for the partial sum process: 1 S (n) (t) := √ Xk , n [nt]
0 ≤ t ≤ 1.
k=1
We will assume that {W (t)}t≥0 is Brownian motion and W (n) (t) =
√1 W (nt) n
(observe that W (n) (t) is itself a Brownian motion). Theorem 3.15. We can define, on a sufficiently rich probability space, versions of {Xn }n and {W (t)}t≥0 such that √ n sup S (n) (t) − W (n) (t) > C log n + x ≤ Ke−λx , P 0≤t≤1
where C, K, λ are positive constants not depending on n. Proof. It follows easily from Theorem 3.14 and the reflection principle for Brownian motion. Cs¨org˝ o and R´ev´esz [21] used this result to give a strong approximation for the uniform quantile process: √ un (t) := n G−1 0 < t < 1, n (t) − t , n where Gn (t) = n1 i=1 I(Ui ≤ t), 0 < t < 1 and {Un }n are i.i.d. uniform r.v.’s on (0, 1). In fact, the well-known distributional equality S1 Sn d , (U(1) , . . . , U(n) ) = ,..., Sn+1 Sn+1 with Sk = ξ1 + . . . + ξk and {ξk }k i.i.d. exponentials with mean 1 allows us to assume k Sk k−1 A log n + t n n 1≤k≤n A log n + t √ (n) (n) ≤ 2P n sup S (t) − W (t) > 5 0≤t≤1 A log n + t + P (1 + n )ξn+1 > 5 1/2 A log n + t + 2P sup |Sk − k| > 5n 1≤k≤n 1/2 A log n + t + 2P | n | > 5n
√ n sup P
Using Theorem 3.15 and some other standard techniques we can give exponential bounds for all terms in the right-hand side of this last inequality to conclude (see [21] for details): Theorem 3.16. We can define, on a sufficiently rich probability √ space, versions of {Un }n and {W (t)}t≥0 such that for every n ≥ 1 and |x| ≤ c n, √ P n sup |un (t) − Bn (t)| > C log n + x ≤ Ke−λx , 0≤t≤1
Komlos, Major and Tusnady [55, 56] gave also a construction, similar to the one in Theorem 3.14 for the uniform empirical process: Theorem 3.17. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that (3.5) P sup |αn (t) − Bn (t)| > n−1/2 (x + C log n) ≤ K exp(−λx) 0≤t≤1
for some absolute positive constants C, K, λ. The best constants in (3.5) are C = 12, K = 2, λ = 1/6, see [9]. While the above Theorems give the best possible rates of approximation of the empirical or the quantile process by Brownian bridges, there was still room for some improvement through a careful study of the approximation at the tails. The following results come from [17]. They have important applications in the study of weighted norms of empirical and quantile processes. Theorem 3.18. We can define, on a sufficiently rich probability space, a sequence {Un }n and Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that P
sup 0≤t≤d/n
|αn (t) − Bn (t)| > n−1/2 (x + C log d)
≤ K exp(−λx)
(3.6)
24
E. del Barrio P
|αn (t) − Bn (t)| > n
sup
−1/2
≤ K exp(−λx)
(x + C log d)
(3.7)
1−d/n≤t≤1
for some absolute positive constants C, K, λ. Theorem 3.19. We can define, on a sufficiently rich probability space, a sequence {Un }n and Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that P
|un (t) − Bn (t)| > n−1/2 (x + C log d)
sup
≤ K exp(−λx)
(3.8)
0≤t≤d/n
P
|un (t) − Bn (t)| > n
sup
−1/2
(x + C log d)
≤ K exp(−λx)
(3.9)
1−d/n≤t≤1
for some absolute positive constants C, K, λ. 3.3.1. Weighted approximations of empirical and quantile processes. We will see here how the improved construction in [17] can be used to give weighted approximations of empirical and quantile processes. Theorem 3.20. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that sup |αn (t) − Bn (t)| = O(n−1/2 log n) a.s.
(3.10)
0≤t≤1
and n1/2−ν
sup λ λ n ≤t≤1− n
|αn (t) − Bn (t)| = OP (1) (t(1 − t))ν
(3.11)
for all 0 < ν ≤ 1/2 and 0 < λ < ∞. Proof. We use the construction in Theorem 3.18. (3.10) follows from (3.6) (taking d = n and x = (2/λ) log n) and the Borel-Cantelli lemma: P ( sup |αn (t) − Bn (t)| > n−1/2 log n(2/λ + C)) ≤ 0≤t≤1
1 . n2
To show (3.11), take first λ = 1 and 0 < ν ≤ 1/2 and call ∆(1) n,ν
:=
∆(2) n,ν
:=
n1/2−ν sup 1 n ≤t≤1
n1/2−ν
|αn (t) − Bn (t)| tν
sup 1 0≤t≤1− n
(i)
|αn (t) − Bn (t)| . (1 − t)ν
It is enough to prove that ∆n,ν = OP (1), i = 1, 2. By symmetry it reduces to (1) showing that ∆n,ν = OP (1). Now, for e < d < ∞ let di = id, i = 1, 2, . . . , in−1 and
Empirical and Quantile Processes
25
din = n, where in = max{i : di−1 ≤ n}. Set I1 = [1/n, d1 /n], Ii = [di−1 /n, di /n], i = 2, . . . , in and δn,i = sup{|αn (t) − Bn (t)| : t ∈ [0, di /n]} i = 2, . . . , in . Now −1/2 (C + 1) log d) + P (∆(1) n,ν > (C + 1) log d) ≤ P (δ1,n > n
in
P (δi,n
i=2
> n−1/2 (C + 1)dνi−1 ) =:
in
Pi,n (d).
i=1
Hence using, (3.6) and the fact that dνi−1 ≥ log di for d large enough, we have, for large d, P1,n (d) ≤ K exp(−λ log d) = Kd−λ and Pi,n (d) ≤ K exp(−λdνi−1 ) ≤ K((i − 1)d)−2 . Thus in
Pi,n (d) ≤ K
i=1
∞ 1 1 1 + K , dλ d2 i=1 i2 (1)
which can be made arbitrarily small by taking d large enough. This shows ∆n,ν = OP (1) for λ = 1. To prove this for general λ it suffices to cover the range [λ/n, 1/n] for fixed 0 < λ < 1, but n1/2−ν sup λ 1 n ≤t≤ n
|αn (t)| ≤ λ−ν (nGn (1/n) + 1) = OP (1), tν
since nGn (1/n) converges weakly to a Poisson r.v. and n
1/2−ν
|Bn (t)| 1 −ν 1/2 −ν sup ≤λ n sup |Bn (t)|d λ sup W (t) − W (n) ν = t n λ 0≤t≤1 0≤t≤ 1 ≤t≤ 1 n
n
n
= OP (1).
Corollary 3.21. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that n1/2−ν
sup
U1:n ≤t≤Un:n
|αn (t) − Bn (t)| = OP (1) (t(1 − t))ν
(3.12)
for all 0 < ν ≤ 1/2. Corollary 3.22. We can define, on a sufficiently rich probability space, a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that |αn (t) − Bn (t)| = OP (1) (t(1 − t))ν 0 0, q nondecreasing in a neighbourhood of 0 1/2 E(q, c) = s−3/2 q(s) exp(−cq 2 (s)/s)ds
F C0 :=
0
1/2
I(q, c) = 0
1 exp(−cq 2 (s)/s)ds s
Empirical and Quantile Processes
27
Lemma 3.26. Assume q ∈ F C0 . (i) If I(q, c) < ∞ then E(q, d) < ∞ for any d > c and q(s)/s → ∞ as t → 0. (ii) If E(q, c) < ∞ and q(t)/t1/2 → ∞ as t → 0 then I(q, c) < ∞. Proof. (ii) Call C = inf 0 0. Then E(q, c) ≥ CI(q, c). (i) If I(q, c1 ) < ∞ then I(q, c2 ) < ∞ for any c2 > c1 . Hence we can assume c > 1. Fix θ small enough so that q is non-decreasing on (0, θ]. It t ∈ (0, θ/c] then ct ct ct 1 1 1 exp(−cq 2 (s)/s)ds ≥ exp(−cq 2 (s)/t)ds ≥ exp(−cq 2 (ct)/t)ds s s s t t t = (log c) exp(−c2 q 2 (ct)/(ct)). Hence q(t)/t1/2 → ∞. We also have, for d − c > 0 and small enough η, q(t)/t1/2 ≤ exp((d − c)q 2 (t)/t) for t ∈ (0, η].Consequently, η η 1 1 1/2 2 (q(t)/t ) exp(−dq (t)/t)dt ≤ exp(−cq 2 (t)/t)dt. t 0 0 t
This completes the proof. Lemma 3.27. If q ∈ F C0 and
1/2 0
1/q 2 (t)dt < ∞ then I(q, c) < ∞ for all c > 0.
Proof. For all x, c > 0 we have that x exp(−cx) ≤ 1/(ec). Therefore, 1/2 1/2 1 1 q 2 (t) exp(−cq 2 (t)/t)dt = exp(−cq 2 (t)/t)dt 2 t q (t) t 0 0 1 1/2 ≤ 1/q 2 (t)dt. ec 0
Theorem 3.28. Assume q ∈ F C0 . Then lim supt→0 |W (t)|/q(t) = β for some 0 ≤ β < ∞ iff < ∞ for any c > β 2 /2 I(q, c) . = ∞ for any c < β 2 /2 Proof. We assume that I(q, c) < ∞ and show that lim supt→0 |W (t)|/q(t) ≤ (2c)1/2 . Assume f is an increasing function on (0, b] and let 0 < a = t0 < t1 < · · · < tn = b ≤ 1. Then P (W (t) > f (t) for some t ∈ [a, b]) ≤ P (Tq(a)≤a ) + ≤
n
P (tm−1 < Tq(tm−1 ) ≤ tm )
m=1 a
(2πt3 )−1/2 f (a) exp(−f 2 (a)/(2t))dt
0
+
n m=1
tm
tm−1
2 m−1 ) dt (2πt3 )−1/2 f (tm−1 ) exp − f (t2t
28
E. del Barrio
We fix now some λ > 1, take a = β/λn and set tm = (1/λ)n−m b. We have now that, for t ∈ [tm−1 , tm ] −1/2
t−1/2 f (tm−1 ) exp(−f 2 (tm−1 )/(2t)) ≤ f (tm−1 )tm−1 exp(−f 2 (tm−1 )/(2λtm−1 )). Take ρ > λ then x exp(−x2 /(2λ)) ≤ exp(−x2 /(2ρ)) for large x. If we define f (t) = (2ρ2 c)1/2 q(t) then, for small b (recall that q(t)/t1/2 → ∞ as t → 0) we get that, for t ∈ [tm−1 , tm ] t−1/2 f (tm−1 ) exp(−f 2 (tm−1 )/(2t)) ≤ exp(−f 2 (tm−1 )/(2ρtm−1 )) ≤ exp(−f 2 (tm−1 )/(2ρt)) ≤ exp(−f 2 (t/λ)/(2ρt)) ≤ exp(−f 2 (t/ρ)/(2ρt)). Taking n → ∞ we obtain that 1 P (W (t) > (2ρ c) /2q(t) for some t ∈ (0, b]) ≤ (2π)1/2 2
1
0
b
1 exp(−cq 2 (t))dt. t
Hence, (letting b → 0) lim sup |W (t)|/q(t) ≤ (2ρ2 c)1 /2a.s. t→0
Since ρ can be taken arbitrarily close to 1, this proves our claim. It remains to be shown that, if lim supt→0 |W (t)|/q(t) = β for some 0 ≤ β < ∞ then I(q, c) < ∞ for any c > β 2 /2. The proof can be found in [19], pp. 181–188. We summarize the consequences of the above results for the finiteness of
B/q ∞ in the following corollary. Now F C0,1 is the class of functions nondecreasing on a neighborhood of 0, nonincreasing in a neighborhood of 1 and such that inf q(x) > 0, δ ∈ (0, 1/2). δ<x M and KnM , φk ⊗ φl = 0 otherwise. Combining this with (6.43) we obtain that
KnM − K M 22 = Kn − K, φk ⊗ φk 2 + 2 Kn − K, φk ⊗ φl 2 ≤
k>M
k =l;k,l>M
Kn − K, φk ⊗ φk + 2 Kn − K, φk ⊗ φl 2 = Kn − K 22 . 2
k
k =l
The last inequality implies that →L2 K M and also that KnM 2 → K M 2 . Now this convergence combined with the fact that Var( YnM −EYnM 22 ) ≤ 8 KnM 22 prove claim (6.42). Now, the proposition follows from (6.39), (6.40), (6.41) and (6.42) through a standard 3 argument. KnM
We should remark that this proposition also holds true if we replace the sequence of exponential random variables by an i.i.d. sequence of square integrable random variables, with only formal changes in the proof. Both, Theorem 6.8 and Proposition 6.11 are practically exercises on the central limit theorem in Hilbert space, however, Proposition 6.11 can be seen as a limit theorem for quadratic forms, and this subject has a long history, reviewed e.g. in Guttorp and Lockhart (1988). Theorem 1 in de Wet and Venter (1973) and Theorem 5 in Guttorp and Lockhart (1988) could seemingly apply to give Proposition 6.11, however the conditions in either theorem are quite difficult to verify and we could not check them in our case, whereas the conditions in Proposition 3.5 are very easy to decide in general. We conclude this subsection giving sufficient conditions for convergence in law of { Yn 22 − E Yn 22 }. The result is not directly applicable to Wasserstein distances, but the changes needed for that are straightforward and omitted here. A √ warning on notation: we write Y − EY, h = λ φ , h Z k k k for φk orthonormal, 2 h ∈ L2 (0, 1) and λ < ∞ although Y − EY may not make sense. Theorem 6.12. If maxi cn,i 2 → 0, Kn →L2 K and mn →L2 m then
Yn 22 − E Yn 22
→ d
:=
Y − EY 22 − E Y − EY 22 + 2Y − EY, m ∞
λk (Zk2 − 1) + 2
k=1
∞ λk m, φk Zk k=1
where Y − E 22 − E Y − EY 22 is defined as in (6.37) and {Zk } is an orthoGaussian sequence. Proof. Formally we require the proof of Proposition 6.11 rather than its statement. First we note
Yn 22 − E Yn 22 = ( Yn − EYn 22 − E Yn − EYn 22 ) + 2Yn − EYn , EYn . (6.44) As in the previous proof, EYn − EYn , EYn Yn − EYn , φk
=
Kn , mn ⊗ φk
→ K, m ⊗ φk = λk m, φk .
62
E. del Barrio
This implies that for each M we have convergence in law of the vector Yn − EYn , φ1 , . . . , Yn − EYn , φM , Yn − EYn , EYn to the Gaussian vector ∞ λ1 Z1 , . . . , λM ZM , λk m, φk Zk , . k=1
This gives weak convergence, for every M < ∞, of the random variables M
Yn − EYn , φk 2 − EYn − EYn , φk 2 + 2Yn − EYn , EYn ,
k=1
in analogy with (6.39). By (6.44) these random variables are ‘finite-dimensional’ approximations of the sequence of interest, and the result now follows by the approximation argument in the previous proof, as a consequence of the limit (6.42). Now, this and (6.42) gives the result. The hypotheses in this theorem are very natural. We will not deal with the question of whether they are necessary (given infinitesimality) however, note that the existence of K and m are necessary in order to define the limit. c) Shift convergence of Yn 22 , II. There are some situations in which Kn is not convergent in L2 but, nevertheless, { Yn − EYn 22 − E Yn − EYn 22 } is weakly convergent. From the definitions we see that
Yn − EYn 22 − E Yn − EYn 22 =
cn,i,j (ξi − 1)(ξj − 1) +
=
2
i=1
cn,i,i (ξi − 1)2 − 1
!
i=1
1≤i =j≤n n
n
i−1
n
cn,i,j (ξj − 1) (ξi − 1) + cn,i,i (ξi − 1)2 − 1 = xn,i ,
j=1
where xn,i = 2
i=1
i−1
cn,i,j (ξj − 1) (ξi − 1) + cn,i,i (ξi − 1)2 − 1
j=1
(and we use the convention that 0j=1 aj = 0). If {ξi } denotes an independent copy of the sequence {ξi } and we set x ˜n,i = 2
i−1
cn,i,j (ξj − 1) (ξi − 1) + cn,i,i (ξi − 1)2 − 1
j=1
for i = 1, . . . , n and Fn,i = σ(ξ1 , . . . , ξi ), then, for each n ∈ N, {xn,i } and {˜ xn,i } are tangent sequences with respect to {Fi }, that is, L(xn,i |Fn,i−1 ) = L(˜ xn,i |Fn,i−1 ) and the random variables x˜n,i are conditionally independent given the sequence {ξi }. Hence, {˜ xn,i } is a decoupled tangent sequence to {xn,i } (see, e.g., de la Pe˜ na and Gin´e (1999), Chapter 6). Decoupling introduces enough independence among
Empirical and Quantile Processes
63
n ˜n,i to enable us to use the CLT in order to obtain their the summands in i=1 x asymptotic distribution. The principle of conditioning (Theorem 1.1 in Jakubowski (1986), reproduced in de la Pe˜ na and Gin´ en(1999), Theorem 7.1.4) can then be used to conclude convergence in law of i=1 xn,i itself. The proof of our next result follows this approach. Theorem 6.13. Let Z be a standard normal random variable. If max cn,i → 0,
(6.45)
i
2 Kn 22 + 6
n
cn,i 42 → σ 2 ,
(6.46)
i=1
and j =k
2
2 cn,i , cn,j cn,i cn,k + cn,i , cn,j cn,i , cn,i + cn,j → 0, j
i:i>j∨k
i:i>j
(6.47) then
Yn − EYn 22 − E Yn − EYn 22 → σZ.
(6.48)
d
If, instead of conditions (6.45), (6.46) and (6.47), we have
max cn,i 22 + |cn,i , mn | → 0,
(6.49)
i
2 Kn 22 + 2
n
cn,i 42 + 4
cn,i , cn,i + mn 2 → σ 2 ,
(6.50)
2 cn,i , cn,j cn,i , cn,i + mn + cn,j → 0,
(6.51)
Yn 22 − E Yn 22 → σZ.
(6.52)
i=1
and
j,k
i
2 cn,i , cn,j cn,i cn,k
i:i>j∨k
+
j
i:i>j
then d
n
˜n = ˜n,i , with x ˜n,i Proof. We first prove the limit in (6.48). If we set U i=1 x defined as above, the principle of conditioning (Jakubowski (1986)) reduces the proof to showing that ˜n |{ξi }) → N (0, σ 2 ) L(U w
in probability. Arguing as in the proof of Lemma 6.10, we can see that this is equivalent to proving that i−1 2
→0 An := max E(˜ x2n,i |{ξj }) = 4 max c2i,i + ci,i + ci,j (ξj − 1) i
i
j=1
Pr
(6.53)
64
E. del Barrio
and Bn :=
E(˜ x2n,i |{ξj }) = 4
i
c2i,i + 4
i
ci,i +
i
i−1
2 ci,j (ξj − 1) → σ 2 . (6.54) Pr
j=1
After a straightforward but cumbersome computation that we omit, we can see that n EBn = 2 Kn 22 + 6
cn,i 42 i=1
and Var(Bn ) = 16
j =k
2 ci,j ci,k
+4
j
i:i>j∨k
2 , ci,j (ci,i + ci,j )
i:i>j
which, by (6.46) and (6.47), immediately give (6.54). We i−1 check now (6.53), which is equivalent to maxi ci,i → 0 and maxi j=1 ci,j (ξj −1) → 0. This last convergence Pr follows from (6.47) and the use of a refined Octaviani’s maximal inequality (e.g., Proposition 1.1.2 in de la Pe˜ na and Gin´e (1999)): i−1 i−1 P max ci,j (ξj − 1) > t ≤ 3 max P ci,j (ξj − 1) > t/3 i
≤
i
j=1
i−1
j=1
n
27 27 27 max c2 ≤ 2 max c2i,j = 2 maxcn,i ⊗ cn,i , Kn i i t2 i j=1 i,j t t j=1
27
Kn 2 max cn,i 22 → 0. i t2 This concludes the proof of the limit (6.48). We pay attention now to the limit (6.52). The fact that
Yn 22 − E Yn 22 = ci,j (ξi − 1)(ξj − 1) ≤
1≤i =j≤n n
! ci,i (ξi − 1)2 − 1 + 2cn,i , mn (ξi − 1)
+
i=1
where yn,i = 2
i−1
cn,i,j (ξj − 1) (ξi − 1) + cn,i,i (ξi − 1)2 − 1 + 2cn,i , mn (ξi − 1),
j=1
can be used to conclude (6.52) by reproducing the proof of (6.48) almost verbatim. The tool for Theorem 6.13, namely the principle of conditioning, which could be easily replaced by the Brown-Eagleson central limit theorem for martingales, has been used before in analogous situations. We will just mention P. Hall (1984), who uses it in density estimation, in order to prove a limit theorem for degenerate
Empirical and Quantile Processes
65
U -statistics with varying kernels. His result is different from ours and does not apply here, but there are similarities in the proofs. The assumptions in the above theorem are quite tight (for instance, it can be shown that they are necessary for the limits (6.53) and (6.54)). An easier to check set of (stronger) sufficient conditions, more adapted to the quantile process case can be stated with the following notation. We define 1 (Kn ◦ Kn )(s, t) := Kn (s, u)Kn (t, u)du = cn,i , cn,j cn,i ⊗ cn,j . 0
i,j
It can be easily checked that Kn ◦ Kn 22 =
1
1
1
j,k
i cn,i,j cn,i,k
2 and also that
1
Kn ◦ Kn 22 =
Kn (s, t)Kn (u, v)Kn (s, u)Kn (t, v)dsdtdudv. 0
0
0
0
The assumptions in the following result are often easier to deal with than (6.51). The proof can be found in [31]. Corollary 6.14. If n
n
cn,i , mn 2 → 0,
i=1
cn,i 42 → 0 and Kn ◦ Kn 2 → 0,
(6.55)
i=1
Kn∗ (s, t) :=
|cn,i (s)cn,i (t)| ≤ CKn (s, t)
(6.56)
i
for some absolute constant C < ∞, and
Kn 22 → σ 2 /2,
(6.57)
then
Yn 22 − E Yn 22 → σZ, d
where Z is standard normal. Note that if Kn ◦ Kn 2 → 0 we cannot have Kn → K in L2 unless K = 0. Remark 6.14.1. The results on convergence or shift convergence in law of Yn 22 derived so far in this article assume infinitesimality on the coefficients cn,i (maxi
cn,i → 0). Of course, if conditions of this type are removed, other asymptotic distributions can be obtained. It is straightforward to see, for instance, that if (cn,i,j − γi,j )2 → 0, (6.58) 1≤i,j≤n
66
E. del Barrio
for some real numbers {γi,j } satisfying
i,j
2 γi,j < ∞ then
n
Yn − EYn 22 − E Yn − EYn 22 =
cn,i,j (ξi − 1)(ξj − 1) − δi,j
i,j=1 ∞
→ L2
!
! γi,j (ξi − 1)(ξj − 1) − δi,j .
i,j=1
Note that the limiting random variable is well defined because the condition 2 γ i,j i,j < ∞ implies that the associated partial sums are L2 convergent. If, further, n n
2 cn,i,j − βi → 0 (6.59) i=1
j=1
for some real numbers βi such that
Yn 22 − E Yn 22 → L2
∞
∞ i=1
βi2 < ∞, then we also have that
∞ ! γi,j (ξi − 1)(ξj − 1) − δi,j + 2 βi (ξi − 1).
i,j=1
(6.60)
i=1
d) Shift convergence of Zn 22 . Yn can be replaced by Zn in Theorem 6.8 as an immediate consequence of the law of large numbers, whereas it can be replaced in Theorem 6.12 and Corollary 6.14 because of the following proposition. Proposition 6.15. Suppose Yn 22 −E Yn 22 converges in law. Then, Zn 22 −E Yn 22 converges in law to the same limit if and only if E Yn 22 √ → 0. n
(6.61)
In particular, this condition is satisfied if both conditions, i cn,i,i √ → 0 and cn,i , mn 2 → 0, n i
(6.62)
hold. If (6.61) holds, we also have Zn − EYn , h − Yn − EYn , h → 0 for any h ∈ L2 (0, 1). Proof. Since
Zn 22 − Yn 22 =
n−1 Sn
2 Sn Sn 1−
Yn 22 = OP n−1/2 Yn 22 , 1+ n−1 n−1
by the central limit theorem and the law of large numbers, the necessity and sufficiency of condition (6.61) follows from Lemmas 2.2 and 3.1. Now, by (6.26) and Cauchy-Schwartz, 1 1 1 √ E Yn 22 = √ ci,i + ci,j ≤ √ ci,i + cn,i , cn,j 2 , n n n i
j
which gives the sufficiency of (6.62).
i
i
j
Empirical and Quantile Processes
67
6.2. Weak convergence of weighted L2 functionals of the quantile process We distinguish between the uniform and the general quantile processes. The uniform quantile process. Recall from the beginning of this section that if un is the uniform quantile process, then 2 2 2 n+1 1−1/n un (t) n d dt= cn,i ξi Ln := g(t) S n+1 1/n 2 i=1 where an,i are as defined in (6.3) and cn,i = n−1/2 an,i (t)I[1/n,1−1/n] (t)/g(t). With the help of Lemma 6.1, the results of Section 6.1 can be easily specialized to this situation. a) The infinitesimality condition (6.27): It follows from the definitions that 1−1/n 2 1 t + (1 − t)2 1 1−1/n 1 2 dt ≤ max dt,
c
≤ n,i 2 i 2n 1/n g 2 (t) n 1/n g 2 (t) and from this we conclude that condition (6.27) is equivalent to 1 1−1/n 1 dt → 0. n 1/n g 2 (t)
(6.63)
b) Convergence of Kn and definition of K. Also from the definitions (in (6.21) and Lemma 6.1), we have Kn (s, t) =
˜ n (s, t) 1K I{1/n ≤ s, t ≤ 1 − 1/n}, n g(s)g(t)
so that by Lemma 6.1 iii), Kn (s, t) → K(s, t) :=
(s ∧ t − st) g(s)g(t)
(6.64)
pointwise, hence, by Lemma 6.1 iii) and dominated convergence, Kn →L2 K if and only if K ∈ L2 ((0, 1) × (0, 1)), if and only if 1 1 (s ∧ t − st)2 ds dt < ∞. (6.65) g 2 (s)g 2 (t) 0 0 Next we see that the limiting kernel K is trace-class and the limit (6.29) holds if and only if 1 t(1 − t) dt < ∞. (6.66) g 2 (t) 0 In fact, by Lemma 6.12 iii), 1 n+1 ˜ n (t, t) 1 1−1/n K t(1 − t)
cn,i 22 = dt → dt 2 (t) n g g 2 (t) 1/n 0 i=1 regardless of whether the limiting integral is finite or not. Thus (6.66) is necessary in order to get a finite limit in (6.29). On the other hand, if (6.66) holds and
68
E. del Barrio
Bg (t) = B(t)/g(t), where B(t), 0 < t < 1, is a Brownian bridge, then Bg is a centered, L2 (0, 1)-valued Gaussian process with covariance function K(s, t). Thus, if λi and φi denote the eigenvalues and eigenfunctions, respectively, of the kernel K, then 1 1 1 2 ∞ 1
2 t(1 − t) B (t) B(t) 2 dt = dt = E φ EB (t) dt = E (t)dt g i 2 g 2 (t) 0 0 0 g (t) 0 g(t) i=1 ∞ ∞ 1 1 ∞ 1 B(t)
2 s ∧ t − st φi (t)dt = φi (s)φi (t)dsdt = = E λi < ∞. g(s)g(t) 0 g(t) 0 i=1 i=1 0 i=1 Hence, if (6.66) holds then K is trace-class and (6.29) holds. c) Convergence of mn to m = 0 assuming infinitesimality. If the infinitesimality condition (6.63) holds, then n+1 2 ˜ n (t)2 1 1 1 1 1m dt ≤ dt → 0, cn,i = n 0 g 2 (t) n 0 g 2 (t) 2 i=1 showing that mn → 0 in L2 . Finally, note that condition (6.66) implies conditions (6.63) and (6.65): the first, by dominated convergence, and condition (6.65) because, since (s ∧ t − st)2 ≤ s(1 − s)t(1 − t), we have 1 1 1 1 1 t(1 − t) 2 (s ∧ t − st)2 s(1 − s)t(1 − t) ds dt ≤ ds dt = dt . g 2 (s)g 2 (t) g 2 (s)g 2 (t) g 2 (t) 0 0 0 0 0 Summarizing, and the law of large numbers for Sn /n give the following: Theorem 6.16. Let un (g) denote the weighted uniform quantile process, that is, un (g)(t) = (un (t)/g(t))I{1/n ≤ t ≤ 1 − 1/n}}, 0 < t < 1, where g is a non-zero measurable function. Assume 1 1−1/n 1 dt → 0. n 1/n g 2 (t) Then the sequence of processes {un (g)} is weakly convergent in L2 (0, 1) to a nondegenerate limit if and only if 1 t(1 − t) dt < ∞. g 2 (t) 0 In this case, un (g) → Bg d
in L2 (0, 1), where Bg (t) = B(t)/g(t) and B is a Brownian bridge. In particular 1−1/n 2 1 2 un (t) B (t) → dt dt. 2 (t) 2 d g 1/n 0 g (t) Only the necessity part of this theorem may be considered new; the sufficiency is well known (see e.g., Mason (1984), Cs¨org˝ o and Horv´ ath (1988) and (1993) p. 354).
Empirical and Quantile Processes
69
Since, under infinitesimality, mn → 0 in L2 , we have that the analogue of Theorem 6.12 for Yn = (Sn+1 /n)un holds under conditions (6.63) and (6.65). In order to get √ rid of the factor Sn+1 /n, according to Proposition 6.15 we must have E Yn 22 / n → 0, which follows from conditions (6.62). Now, if conditions (6.63) and (6.65) hold, then so do conditions (6.62): the second condition in (6.62) is obvious because mn → 0 in L2 and supn Kn 2 < ∞ (as Kn converges in L2 ), so cn,i , mn 2 = Kn , mn ⊗ mn ≤ Kn 2 mn 22 → 0, i
and the first follows because, by Lemma 6.1 iii), 1−1/n ˜ 1−1/n t(1 − t) Kn (t, t) 3 1 1 √ dt ≤ √ dt, cn,k , cn,k = √ ng 2 (t) g 2 (t) n i n 1/n n 1/n and it is easy to see that this last expression tends to zero √ as a consequence √ of (6.63) (divide the domain of integration at the points 1/ n, 1/2 and 1 − 1/ n). Hence, Theorem 6.12 together with Proposition 6.15 give: Theorem 6.17. If conditions 1 1−1/n 1 dt → 0. n 1/n g 2 (t) hold then 1−1/n 1/n
u2n (t) dt − g(t)2
1−1/n
1/n
1
(s ∧ t − st)2 ds dt < ∞. g 2 (s)g 2 (t)
1
and 0
t(1 − t) dt → d g 2 (t)
0
0
1
B 2 (t) − EB 2 (t) dt, g 2 (t)
(6.67)
where the integral of (B 2 − EB 2 )/g 2 is defined in a limiting L2 sense. Proof. By Theorem 6.12 and the above observations, it only remains to show that we can actually replace E Yn 22 by the centering constants in (6.67). By (6.63) and Lemma 6.1, we have 1−1/n ˜ n (t, t) − m t(1 − t) ˜ 2n (t, t)| 1 1−1/n |nt(1 − t) − K dt ≤ dt E Yn 22 − 2 2 g (t) n 1/n g (t) 1/n 4 1−1/n 1 dt → 0. ≤ n 1/n g 2 (t) For g(t) = φ(Φ−1 (t)), where φ and Φ denote the standard normal density and distribution function respectively, this result goes back, in one form or other, to de Wet and Venter (1972), but it seems to be new in the generality it is given here. See also Gregory (1977) and del Barrio, Cuesta-Albertos, Matr´an and Rodr´ıguezRodr´ıguez (1999). Next we examine the normal convergence case (as a consequence of Corollary 6.14). We will further relax integrability (so, condition (6.65) will not hold), but, for convenience, will impose regular variation of g at at least one of the end points 0 and at 1. Standard use of the basic properties of regular variation shows that,
70
E. del Barrio
if g is regularly varying at 0 and at 1 with exponent α, then the hypotheses of Theorem 6.17, (6.63) and (6.65), both hold for α < 1 and fail if α > 1. We study now the borderline case in which α = 1 and (6.65) fails, that is, 1−x 1−x (s ∧ t − st)2 L(x) := 2 dsdt → ∞ as x → 0. (6.68) g 2 (s)g 2 (t) x x This case will fall within the scope of Corollary 6.14 and we will obtain normal convergence. Besides the function L(x) just defined, it is convenient to introduce two more functions, 1−x 1−x (s ∧ t − st) M (x) = 2 dsdt and g 2 (s)g 2 (t) x x (s ∧ t − st)(s ∧ u − su)(t ∧ v − tv)(u ∧ v − uv) dsdtdudv, R(x) = g 2 (s)g 2 (t)g 2 (u)g 2 (v) ∆x where ∆x = (x, 1 − x)4 , and establish their relationship with L. The next lemma will be helpful in this goal. The proof (cumbersome, but routine calculations based on regular variation) is omitted. It can be found in [31]. Lemma 6.18. Assume g > 0 is regularly varying at 0 and at 1 with exponent 1, or that g is regularly varying with exponent one at one of these points and with smaller exponent in the other. Assume also that L(x) → ∞ as x → 0. Then lim
xM (x) =0 L(x)
(6.69)
lim
R(x) = 0. L2 (x)
(6.70)
x→0
and x→0
Theorem 6.19. Assume g > 0 is regularly varying at 0 and at 1 with exponent 1, or that g is regularly varying with exponent one at one of these points and with smaller exponent at the other. Assume also that L(x) → ∞ as x → 0. Let Z denote a standard normal random variable. Then 1−1/n 1−1/n u2 (t) 1 t(1 − t)
n
dt − dt → Z. d g 2 (t) g 2 (t) L(1/n) 1/n 1/n Proof. We only consider the case when g is symmetric about 1/2. We can apply an,i (t) 1 Corollary 6.14 with cn,i (t) = √ 1/2 g(t) I{1/n ≤ t ≤ 1 − 1/n}. Now we nL
have that 2 Kn 22 =
2 2 n L(1/n)
and n+1 i=1
cn,i 42
2 = 2 n L(1/n)
(1/n)
1−1/n
1/n
1−1/n
1/n
1−1/n 1/n
1−1/n
1/n
˜ 2 (s, t) K n ds dt g 2 (s)g 2 (t)
n+1
a2n,i (s)a2n,i (t) ds dt. g 2 (s)g 2 (t)
i=1
Empirical and Quantile Processes
71
We claim that 2 Kn 22
n+1
→ 1,
cn,i 42
→0
and
i=1
n+1
cn,i , cn,i + mn 2 → 0.
(6.71)
i=1
In fact, from Lemma 6.1 we obtain that by Lemma 6.18, implies that
n+1 i=1
a2n,i (s)a2n,i (t) ≤ 3n(s∧t−st), which,
n+1
1 M (1/n) →0
cn,i 42 ≤ 6 n L(1/n) i=1
and proves the second part of (6.71). The first part can be obtained using Lemma ˜ n (s, t)2 − n2 (s ∧ t − st)2 | = |K ˜ n (s, t) + n(s ∧ t − 6.1 ii) and iii) to see that |K ˜ st)||Kn (s, t) − n(s ∧ t − st)| ≤ 8n(s ∧ t − st) and, consequently, that 1−1/n 1−1/n K ˜ 2 (s, t) − n2 (s ∧ t − st)2 2 n 2 Kn 2 − 1 = ds dt 2 2 (s)g 2 (t) n2 L(1/n) 1/n g 1/n 1 M (1/n) ≤ 16 n → 0. L(1/n)
Finally, the third part of claim (6.71) is a consequence of n+1
cn,i , mn 2 = Kn , mn ⊗ mn
i=1
=
1 n2 L(1/n)
1−1/n
1/n
1−1/n
1/n
1 ˜ n (s, t)m M (1/n) ˜ n (s)m ˜ n (t) K ds dt ≤ n → 0. 2 2 g (s)g (t) L(1/n)
since (a + b) ≤ 2a + 2b . The limits (6.71) prove the first two limits in (6.55) and the limit in (6.57) (with σ 2 = 1). Lemma 6.1 iii) gives that (6.56) is also satisfied (with C = 6). Finally, the third limit in (6.55) follows from Lemma 6.18 since 2
2
2
Kn ◦ Kn 22 ≤
81R((1/n) → 0. n4 L2 (1/n)
Corollary 6.14 implies now that Yn 22 − E Yn 22 →w N (0, 1). The conditions (6.62) from Proposition 6.15 are also satisfied because of the last two limits in (6.71) (see the argument immediately before Theorem 6.17) and therefore we also have Zn 22 − E Yn 22 →w N (0, 1). Now we are only left with showing that we 1−1/n can replace E Yn 22 by L−1/2 (1/n) 1/n t(1 − t)g −2 (t)dt as centering constants. Arguing as in the proof of Theorem 6.17 we see that 1−1/n 1−1/n t(1 − t) 1 4 1 2 dt ≤ dt → 0, E Yn 2 − 1/2 g 2 (t) g 2 (t) L (1/n) 1/n nL1/2 (1/n) 1/n where the last limit is a consequence of (6.59).
72
E. del Barrio
Finally, we consider smaller functions g, corresponding to Remark 6.16.1. The following result is only given for completeness and only symmetric weights are considered in the proof. The one sided analogue is already contained in Cs¨ org˝ o and Horv´ ath (1988). Proposition 6.20. Let g be a positive function on (0, 1), regularly varying at 0 and at 1 with exponent α > 1 and with equal or smaller exponent at the other extreme, and such that limx→0 g(x)/g(1 − x) := c ∈ [0, ∞]. Set 1−x t(1 − t) dt. E(x) := g 2 (t) x Then, 1−1/n 2 1 un (t) dt E(1/n) 1/n g 2 (t) ∞ (S (1) − y)2 ∞ (S (2) − y)2
1 c2 1 [y] [y] → dy + dy , d α − 1 1 + c2 1 y 2α 1 + c2 1 y 2α (1)
where {S[y] : y ≥ 1} is the partial sum process associated to the sequence {ξj } of [y] (1) (2) independent exponential random variables, that is, S[y] = j=1 ξj , and {S[y] } is (1)
an independent copy of {S[y] }. Proof. As above, we only consider the case when g is symmetric. Symmetry of g and the fact that an,j (1 − t) = −an,n+2−j (t) show that 2 1−1/n 2 1−1/n n+1 n 2 un (t) 1 1 j=1 an,j (t)ξj dt = dt E(1/n) 1/n g 2 (t) Sn+1 nE(1/n) 1/n g 2 (t) n 2 (Vn(1) + Vn(2) ), = Sn+1 where n+1 2 a (t)ξ n,j j 1/2 1 j=1 dt Vn(1) = 1 g 2 (t) nE( n ) 1/n and
Vn(2) =
1 nE( n1 )
n+1 1/2
1/n
an,j (t)ξn+2−j
j=1
g 2 (t)
2 dt.
We set bn,j (t) = I{j − 1 < nt} and define 2 1/2 n+1 1 j=1 bn,j (t)ξj − nt (1) dt Wn = g 2 (t) nE( n1 ) 1/n
Empirical and Quantile Processes
73
(2)
and, similarly, Wn , replacing ξj with ξn+2−j . Now, since √
2 1 1/2 t2 dt = OP (1) → 0 |(Vn(1) )1/2 − (Wn(1) )1/2 |2 ≤ n(1 − Sn+1 ) n Pr E( n1 ) 1/n g 2 (t) (1)
(1)
(1)
(2)
(2)
and Vn = OP (1), we see that Vn − Wn = oP (1). Analogously, Vn − Wn = (1) (2) oP (1), showing that (6.72) is equivalent to convergence in law of Wn + Wn (1) d (2) (1) to the right-hand side there. Obviously, Wn = Wn and, moreover, Wn and (2) Wn are asymptotically independent: they are indeed independent if n is odd (1) since bn,j (t) = 0 if t ≤ 1/2 and j > (n + 1)/2 and therefore Wn depends only (2) on ξ1 , . . . , ξ(n+1)/2 while Wn depends only on ξ(n+3)/2 , . . . , ξn+1 ; the overlapping that arises if n is even is negligible. Hence, in order to prove (6.72) if suffices to show that ∞ (S (1) − y)2 1 [y] Wn(1) → dy. (6.72) d α−1 1 y 2α To see this, we note that cn,i,j (ξi − 1)(ξj − 1) + 2 dn,i (ξi − 1) + en Wn(1) = i,j
i
with cn,i,j
cn,1,1 = cn,2,2 ,
dn,1 = dn,2
n/2
1 dy if i ∨ j > 1, 2 (y/n) g i∨j−1 n/2 1 ([y] − y) dy if i > 1, dn,i = 2 1 n E( n ) i−1 g 2 (y/n) n/2 1 ([y] − y)2 and en = 2 1 dy. g 2 (y/n) n E( n ) 1
1 = 2 1 n E( n )
Similarly, 1 α−1
∞
(1)
(S[y] − y)2 y 2α
1
dy =
γi,j (ξi − 1)(ξj − 1) + 2
i,j
δi (ξi − 1) + ,
i
∞ ([y]−y) 1 1 dy if i ∨ j > 1, γ1,1 = γ2,2 , δi = α−1 dy if i∨j−1 y 2α i−1 y 2α ∞ ([y]−y)2 1 i > 1, δ1 = δ2 and = α−1 1 dy. Standard regular variation techniques 2α y 2 show that i,j (cn,i,j − γi,j ) → 0, i (dn,i − δi )2 → 0 and en → , yielding as in where γi,j =
1 α−1
∞
Remark 3.10 that Wn(1) → L2
and proving (6.72).
1 α−1
1
∞
(1)
(S[y] − y)2 y 2α
dy
74
E. del Barrio
The general quantile process. By Proposition 6.5, we can transfer the results in the previous
subsection to the general quantile process just by taking g(t) = f (F −1 (t))/ w(t). Let us denote as General Hypotheses or (GH) the following conditions on the cdf F and the weight w: GH F is twice differentiable on its open support (aF , bF ) with f (x) = F (x) > 0 there, and satisfies conditions (6.7), 6.13 and 6.14. w is a non-negative measurable function on (0, 1) and satisfies conditions 6.15. These, together with (6.11), are the conditions under which we can transfer results on un to vn by Proposition 2.5. (6.11) is not included because it will be subsumed by other conditions, in fact, because, by dominated convergence, 1 t(1 − t) 1 1−1/n w(t)dt w(t)dt < ∞ ⇒ (6.11) ⇒ → 0. 2 −1 (t)) n 1/n f 2 (F −1 (t)) 0 f (F We then have: Theorem 6.21. Let B be a Brownian bridge on (0, 1) and let Z be a standard normal random variable. a) If F and w satisfy (GH) and 1 t(1 − t) w(t)dt < ∞ 2 (F −1 (t)) f 0 then
(6.73)
B(t) w(t) vn (t) → in law in L2 (0, 1), f 2 (F −1 (t))
in particular, 1 2 vn (t)w(t)dt → 0
0
1
B 2 (t) w(t)dt in distribution. f 2 (F −1 (t))
b) If F and w satisfy (GH) and 1−1/n 1/2 1 t (1 − t)1/2 √ w(t)dt → 0 n 1/n f 2 (F −1 (t)) and
1
0
then 1
vn2 (t)w(t)dt 0
0
(s ∧ t − st)2 w(s)w(t)dsdt < ∞ f 2 (F −1 (s))f 2 (F −1 (t))
1−1/n
1
− 1/n
t(1 − t) w(t)dt → w f 2 (F −1 (t))
0
1
(6.74)
B 2 (t) − EB 2 (t) w(t)dt. f 2 (F −1 (t))
c) Assume F is twice differentiable on its open support (aF , bF ) with f (x) = F (x) > 0 there, that F satisfies condition (6.7) and that the function g := f (F −1 ) is regularly varying with exponent one at at least one of the two points
Empirical and Quantile Processes
75
0 and at 1, and with exponent not larger than one in the other. Assume also that 1−x 1−x (s ∧ t − st)2 dsdt → ∞ (6.75) L(x) := 2 f 2 (F −1 (s)f 2 (F −1 (t)) x x as x → 0. Then, 1 1 1 t(1 − t)
dt → Z in distribution. vn2 (t)dt − 2 −1 (t)) L(1/n) 0 0 f (F
(6.76)
Proof. By Proposition ?? and the remark above on condition (6.11), the statements a) and b) of the theorem do not require proof. But part c) does (Proposition ??) does not apply in this case). As usual, we assume f (F −1 symmetric about 1/2. If 1−1/n we can replace vn 22 in (6.76) by un /f (F −1 ) 22,n = 1/n u2n (t)/f (F −1 (t))dt, the result will follow from Theorem 6.19. By the proof of Lemma 6.4, we can replace vn 22 by vn 22,n if we show that x x2 1 lim (F −1 (x) − F −1 (t))2 dt = 0, = 0, and lim x→0 f 2 F −1 (x) x→0 x L(x) 0 L(x) and, by the proof of Lemma 6.3 (see (2.20) and (2.21)), we can replace vn 22,n by
un /f (F −1 ) 22,n if 1−1/n 1/2 1 t (1 − t)1/2 dt = 0. lim n→∞ f 2 F −1 (t) nL(1/n) 1/n The first and third of these limits follows just as the limit (??) in the proof of Lemma 6.18, using L’Hˆ opital andthe equivalence (??). To show that the second x limit also holds, let us set h(x) = 0 (F −1 (x) − F −1 (t))2 dt and observe that x x x 1 2 2 −1 −1 dudt (F (x) − F (t))dt = h (x) = f (F −1 (x)) 0 f (F −1 (x)) 0 t f (F −1 (u)) x 2x2 2 u du = f (F −1 (x)) 0 f (F −1 (u)) f 2 (F −1 (x)) the last equivalence being a consequence of regular variation. This, (??) and x x regular variation imply, in turn, h(x) = 0 h (y)dy 2 0 y 2 /f 2 (F −1 (y))dy δ 2x3 /f 2 (F −1 (x)) 2x2 x 1/f 2(F −1 (y))dy and, consequently, that δ x 1 x 0 f 2 (F −1 1 (y)) dy −1 −1 2
(F (x) − F (t)) dt = 2 lim lim =0 x→0 x L(x) 0 x→0 L(x)
by (??). Examples a) Consider the distribution functions 1 −|x|α if x ≤ 0 2e α Fα (x) = if x ≥ 0 1 − 12 e−x
forα > 0,
76
E. del Barrio
and take w ≡ 1. Let fα be the corresponding densities, which are symmetric about zero. Then, it is easy but somewhat cumbersome to see that fα Fα−1 (x) ≈ x(1 − x) log(α−1)/α fα Fα−1 (x) ≈ x(1 − x) log(α−2)/α
1 , x(1 − x) 1 , x(1 − x)
where a(x) ≈ b(x) means that 0 < limx→0 a(x)/b(x) < ∞ and likewise for x → 1, whereas 0 < inf t∈I a(t)/b(t) ≤ sup < ∞ for any closed interval I con x∈I a(t)/b(t) tained in (0, 1) (for instance fα Fα−1 (x) = αx| log 2x|(α−1)/α + α(1 − x)| log 2(1 − x)|(α−1)/α ). So, fα Fα−1 is symmetric about 1/2 and of regular variation with exponent 1 at 0 (and at 1). It then follows easily that Fα is in case a) of the above theorem iff α > 2, in case b) iff 4/3 < α ≤ 2, hence for the normal distribution, and in case c) iff 0 < α ≤ 4/3, in particular for the symmetric exponential distribution. As mentioned above, if the tail probabilities are of different order, the largest dominates and these theorems still hold, so that the same conclusions apply to the one sided families. b) Likewise, if fα (x) = αxα−1 e−x , x > 0, c > 0, α
1 is the Weibull family of densities, then f (F −1 (u)) = α(1 − u) log(α−1)/α 1−u , and, as in the above example, fα is in cases a), b) or c) according as to whether α > 3, 4/3 < α ≤ 2 or 0 < α ≤ 4/3. c) (McLaren and Lockhart (1987)). For the logistic distribution F (x) = (1 + e−x )−1 , x ∈ R, and the exponential with parameter one, both falling in case c), computation of L and the centering gives
1 √ 3/2 2 log n
1
vn2 (t)dt
− 2 log n
→ Z in distribution,
0
and for the extreme value distribution H(x) = exp(−e−x ), also in case c), 1 √ 2 log n
1
vn2 (t)dt
− log n
→ Z in distribution.
0
We cannot apply Proposition ?? when f (F −1 (t) is too large at 0 and 1 (regularly varying of exponent 1 or lower), as we saw in Case c) of the above theorem (exponent 1). Here is a situation where the exponent is larger than one. Theorem 6.22. Assume F satisfies conditions (6.7), f (F −1 ) varies regularly at 0 or at 1 with exponent γ ∈ (1, 3/2) and with equal or smaller exponent at the other
Empirical and Quantile Processes
77
extreme, and limx→0 |F −1 (x)|/F −1 (1 − x) = c ∈ [0, ∞]. Then, denoting α = 1 − γ, 1 ∞ (1) α 2 1 2 2 c2 (S[y]+1 ) − y α dy v (t)dt → w 2 n 1 1+c |α| E( n ) 0 0 ∞ (2) α α 2 1 (S[y]+1 ) − y dy . + 1+c2 0
−1
If F satisfies conditions (6.7), f (F ) varies regularly at 0 or at 1 with exponent γ = 3/2 (i.e., α = −1/2) and with equal or smaller exponent at the other extreme, 1 limx→0 |F −1 (x)|/ F −1 (1 − x) = c ∈ [0, ∞] and 0 (F −1 (t))2 dt < ∞, then 1 1 2 v (t)dt − c n E( n1 ) 0 n ∞ (1) −1/2 1 4 −1/2 2 c2 " →w 4 1+c (S − + ) − y dy 2 [y]+1 (1) (1) 1 S1 S1 ∞ (2) −1/2 1 4 −1/2 2 1 " (S − + ) − y dy , + 1+c 2 [y]+1 (2) (2) 1 S1 S1 n1 −1 1 (1) where cn = 0 (F (t))2 dt+ 1− 1 (F −1 (t))2 dt. In both cases {S[y]+1 : y ≥ 0} is the n partial sum process associated to the sequence {ξj } of independent exponential ran[y]+1 (1) (2) dom variables, that is, S[y]+1 = j=1 ξj , and {S[y]+1 : y ≥ 0} is an independent (1)
copy of {S[y]+1 }. Remark 6.22.1. Regular variation of f (F −1 ) with exponent γ is, essentially, equivalent to regular variation of F −1 with exponent α ∈ (−1/2, 0). In fact, if f (F −1 ) ∈ RVγ (0), then F −1 ∈ RVα (0) and, provided f is monotone in a neighborhood of −∞, if F −1 ∈ RVα (0) then f (F −1 ) ∈ RVγ (0) (see, e.g., Resnick (1987), Proposi1 tions 0.6 and 0.7). With the assumption of regular variation, finiteness of 0 vn2 (t)dt requires α ≥ −1/2. Thus, Theorem 6.22 completes the picture of all the possible 1 limiting distributions of 0 vn2 (t)dt for distributions with regularly varying tails. Remark 6.22.2. It follows easily from the law of the iterated logarithm that (1) α (S[y]+1 ) − y α 1 a.s. = lim sup α−1/2 √ |α| y 2 log log y y→∞ for all α < 0, see, e.g., Samorodnitsky and Taqqu (1994), p. 31. This implies that (1) the limiting integrals in Theorem 6.22 are a.s. finite. Integrability of (S[y]+1 )α − 2 y α at 0 needs α > 1/2. When α = 1/2 the effect of the centering constants, cn , is to remove this lack of integrability, still leading to a limiting distribution. Next we collect some elementary properties of regularly varying functions that will be useful in our proof of Theorem 6.22.
78
E. del Barrio log n n = 1
− l
Lemma 6.23. a) If l ∈ RV0 (0) is positive and > 0 then limn→∞ (log n) n l log 0. b) If l ∈ RVα (0) and α < 0 then limn→∞ n1 = 0. l
l
n
n
c) If l ∈ RVα (0) and β > −α then xβ l(x) → 0 as x → 0. Proof. b) is a trivial consequence of a) and the proof c) can be found, e.g., in Resnick (1987), so we only prove a). By Karamata’s theorem (see, e.g., Resnick 1 (1987), p. 17) l can be written as l(x) = c(x) exp x ( (t)/t)dt with c(x) → c ∈ (0, ∞) and (x) → 0 as x → 0. Therefore, taking n0 large enough to ensure that | (t)| < /2 for t ≤ log n0 /n0 and n ≥ n0 we have that log n logn n 1
− l n − dt = 2(log n)−/2 → 0. ≤ 2(log n) exp (log n) 2 n1 t l n1 Proof of Theorem 6.22. We will assume in this proof that 0 > α > −1/2. The case α = −1/2 can be handled with straightforward changes. We set, as in the proof of Proposition 6.20, bn,j = I{j − 1 < nt} and H −1 (x) = −F −1 (1 − x) and observe, using the fact that bn,j (1 − t) = 1 − bn,n+2−j (t) except in a null set, that 1 logn n n+1
2 1 n 2 −1 1 F v (t)dt = bn,j (t)ξj − F −1 (t) dt n 1 1 Sn+1 E( n ) 0 E( n ) 0 j=1 +
=
n E( n1 ) n E( n1 ) +
+
n+1
2 1 F −1 Sn+1 bn,j (t)ξj − F −1 (t) dt +
1
n 1− log n
log n n
0
n E( n1 ) 1 E( n1 )
j=1
1 F −1 Sn+1
log n n
H −1
0
Wn(2)
n 1− log n log n n
vn2 (t)dt
2 bn,j (t)ξj − F −1 (t) dt
1
n+1
Sn+1
2 bn,j (t)ξn+2−j − H −1 (t) dt
j=1 n 1− log n
log n n
vn2 (t)dt =: Vn(1) + Vn(2) + Vn(3) .
We also set Wn(1)
j=1
n+1
1 E( n1 )
:=
:= (1)
n E( n1 ) n E( n1 )
log n n
F −1
1 n+1
0
log n n
0 (2)
H −1
n
and
j=1
1 n+1 n
2 bn,j (t)ξj − F −1 (t) dt
2 bn,j (t)ξn+2−j − H −1 (t) dt.
j=1
Observe that Wn and Wn are independent since they are functions of disjoint sets of independent exponential r.v.’s ξj . We will proceed now by showing that
Empirical and Quantile Processes
79
(3)
the central part, Vn , is negligible and that the upper and lower integrals are asymptotically independent and weakly convergent to the above stated limits. This will be achieved by proving the following three claims: Vn(3) →Pr 0.
Claim 1.
(Vn(i) )1/2 − (Wn(i) )1/2 →Pr 0,
Claim 2.
Wn(1) →w
Claim 3.
Wn(2) →w
2c2 |α|(1+c2 ) 2 |α|(1+c2 )
∞
0 ∞ 0
i = 1, 2.
2 (1) (S[y]+1 )α − y α dy,
2 (2) (S[y]+1 )α − y α dy.
Proof of Claim 1. We show first that 1− logn n 1 u2n (t) (3) dt →Pr 0. Vn − f 2 (F −1 (t)) E( n1 ) logn n As in the proof of Proposition 6.3 this reduces to showing that 1− logn n 1 1 dt → 0 1 2 log n f (F −1 (t)) nE( n ) n and 1 √ nE( n1 )
n 1− log n log n n
t1/2 (1 − t)1/2 dt → 0. f 2 (F −1 (t))
To ease the computations we will assume in the remainder of the proof of this claim that c = 1 and replace E( n1 ) by F −1 ( n1 )2 in the last two denominators (the ratio of the two sequences converges, by regular variation, to a positive constant). Extension to general c is straightforward. Regular variation implies that x/f (F −1 (x)) = α and x→0 F −1 (x) lim
x/f 2 (F −1 (x)) 1 lim 1−x = − α, 1 x→0 2 f 2 (F −1 (t)) dt x
which implies, in turn, that 1− logn n 1 1 l(1) (log n/n) α2 lim dt = lim = 0, 1 2 −1 n→∞ nF −1 ( )2 log n f (F (t)) 1/2 − α n→∞ l(1) (1/n) n n where l(1) (x) = x/f 2 (F −1 (x)) ∈ RV2α−1 (0) and 2α − 1 ∈ (−2, −1) and the last limit follows from Lemma 6.23. Similarly, 1− logn n 1/2 α2 1 t (1 − t)1/2 l(2) (log n/n) dt = lim lim √ −1 1 2 = 0, n→∞ f 2 (F −1 (t)) 1/4 − α n→∞ l(2) (1/n) nF ( n ) logn n since l(2) (x) = x3/2 /f 2 (F −1 (x)) ∈ RV2α−1/2 (0) and 2α−1/2 ∈ (−3/2, −1/2). Now 1−log n/n u2n (t) 1 we can prove Claim 1 by showing that F −1 (1/n) 2 f 2 (F −1 (t)) dt →Pr 0. But log n/n
80
E. del Barrio
taking expectations we can see that it suffices to show that 1− logn n 1 t(1 − t) dt = 0. lim 2 (F −1 (t)) n→∞ F −1 ( 1 )2 log n f n n Using again regular variation properties and Lemma 6.23 we have that 1− logn n 1 t(1 − t) l(3) (log n/n) lim −1 1 2 dt = |α| lim = 0, 2 −1 n→∞ F n→∞ f (F (t)) l(3) (1/n) ( n ) logn n now with l(3) (x) = x2 /f 2 (F −1 (x)) ∈ RV−2α (0) and −2α ∈ (−1, 0). This completes the proof of Claim 1. (1) (1) Proof of Claim 2. We will show that (Vn )1/2 − (Wn )1/2 →Pr 0. It suffices to show that log n 1 (1) (1) (F −1 (S[y]+1 /Sn+1 ) − F −1 (S[y]+1 /n))2 dy →Pr 0. F −1 ( n1 )2 0 (1)
To ease the notation we will omit the superscript from S[y]+1 . Similarly as in (6.10) we consider a Taylor expansion
Sj S F −1 Sn+1 − F −1 nj 1 Sn+1 2 Sj 2 f (F −1 (ξ)) 1 Sn+1 Sj 1 − , + = 1− S n Sn+1 f (F −1 ( j )) 2 n Sn+1 f 3 (F −1 (ξ)) n for some ξ between Sj /Sn+1 and Sj /n, which enables us to write, using the obvious
2 analogues of (??) and (??), and the fact that supn≥1 nE 1 − Sn+1 < ∞, that n Sj Sj
1 1 −1 Sj n n −1 Sj √ + F Sn+1 − F n ≤ OP (1) n f (F −1 ( Sj )) n f (F −1 ( Sj )) n n S
j 1 n ≤ OP (1) √ , n f (F −1 ( Sj ))
n
where OP (1) stands for a stochastically bounded sequence which does not depend on j ∈ [1, log n]. We take now > 0 such that 2γ − 3 + < 0. From this bound and regular variation (Lemma 6.23, c)) we obtain that log n S S 2 (F −1 ( S[y]+1 ) − F −1 ( [y]+1 n )) dy n+1 0
1 ≤ OP (1) n ≤ OP (1)
1 n
S[y]+1 2
log n
0
n dy S f 2 (F −1 ( [y]+1 n ))
[log n+1]
Sj 2−2γ− n
j=1
1 ≤ OP (1) n
Sj 2
[log n+1]
j=1
n
f 2 (F −1 (
Sj n ))
[log n+1]
≤ OP (1)n2γ−3+
j=1
j 2−2γ− → 0.
Empirical and Quantile Processes
81
This completes the proof of Claim 2 (note that we need not divide by F −1 ( n1 )2 to obtain the equivalence of the two sequences if γ < 3/2; if γ = 3/2 that division still gives the result).
2 k/n −1 1 n+1 (1) −1 Proof of Claim 3. We set Wn,k = n1 0 F (t) dt j=1 bn,j (t)ξj − F n E( ) k (1) α n α 2 2c2 and Wk = |α|(1+c2 ) 0 (S[y]+1 ) − y dy. With the change of variable t = y/n (1)
we can rewrite Wn,k as (1) Wn,k
k
= bn
F −1 (S[y]+1 /n) − F −1 (y/n)
(1)
2
F −1 (1/n)
0
where bn = F −1 ( n1 )2 /E( n1 ) →
(1)
dy,
2c2 |α|(1+c2 ) , and we conclude that, by regular varia ∞ (1) α (1) 2c2 α 2 that Wn →w |α|(1+c 2) 0 ((S[y]+1 ) − y ) dy it
tion, Wn,k →Pr Wk . To prove suffices, using a 3 argument, to show that (1)
lim lim sup P (|Wn,k − Wn(1) | > ) = 0
k→∞ n→∞
(6.77)
for all > 0. As in the proof of Claim 2 we consider a Taylor expansion
S y
S[y]+1 − y 1 (S[y]+1 − y)2 f (F −1 (ξ)) [y]+1 − F −1 = + , F −1 n n nf (F −1 (y/n)) 2 n2 f 3 (F −1 (ξ)) for some ξ between S[y]+1 /n and y/n, which enables us to write, using the obvious 2 equivalents of (??) and (??), and the fact that supy≥1 E (S[y]+1 − y)/y 1/2 < ∞, that √ √
y/ n 1 1 1 −1 S[y]+1 −1 y −F + , F ≤ OP (1) √ n n n f (F −1 (y/n)) n f (F −1 (y/n)) where OP (1) stands for a stochastically bounded sequence which does not depend on y ∈ [k, log n]. From this bound we obtain that (1)
|Wn,k − Wn(1) | =
bn −1 F ( n1 )2
log n
k
1 ≤ OP (1) −1 1 2 F (n) = OP (1)
1 F −1 ( n1 )2
2 (1) F −1 (S[y]+1 /n) − F −1 (y/n) dy
1 n
k
log n n k n
log n y/n 1 1 dy + 2 dy f 2 (F −1 (y/n)) n k f 2 (F −1 (y/n)) logn n t 1 1 dt + dt . f 2 (F −1 (t)) n nk f 2 (F −1 (t))
log n
From regular variation we obtain that log n logn n n t 1 1 1 dt + dt → C1 k 2−2γ + C2 k 1−2γ 2 (F −1 (t)) 2 (F −1 (t)) k k f n f F −1 ( n1 )2 n n
82
E. del Barrio
and this, combined with the last estimate and the fact that 2 − 2γ < 0 completes the proof of (6.77) and, consequently of Claim 3 . 6.3. Weighted Wasserstein tests of fit to location-scale families of distributions Finally, we apply the foregoing to weighted Wassserstein tests. Recall that Rw n =
Ww2 (Fn , H) 2 (F ) σw n
relates to the quantile process vn via equations (5.6) (assuming conditions (5.3)(5.5)). Theorem 6.24. Let w be a non-negative measurable function satisfying condition (5.3). Let H be a location scale family of distributions as defined in the Introduc1 tion, such that 0 (F −1 (t))2 w(t)dt < ∞ for any (hence for all ) F ∈ H, let G0 ∈ H be chosen so as to satisfy conditions (5.4) and (5.5) and let g0 = G0 . Assume the distribution functions F ∈ H and the weight w satisfies conditions (GH) and that moreover 1 t(1 − t) w(t)dt < ∞. (6.78) 2 F −1 (t) f 0 Then, under the null hypothesis F ∈ H, we have 1 2 1 B 2 (t) B(t) w → w(t)dt − w(t)dt nRn −1 −1 2 d 0 g0 (G0 (t)) 0 g0 (G0 (t)) 2 1 B(t)G−1 0 (t) w(t)dt − . −1 0 g0 (G0 (t))
(6.79)
Note that the hypotheses on F ∈ H are either satisfied by all or by none of the functions in H. Proof. By equivariance we can assume F = G0 . The result follows directly from 2 Theorem 4.5 a) as soon as we show that σw (Fn ) → 1 in probability. By Theorem 4.5 a) vn 2,w = OP (1), and therefore (recall F = G0 and (5.5)) 1 |σw (Fn ) − 1| = Fn−1 2,w − F −1 2,w ≤ Fn−1 − F −1 2,w = √ vn 2,w → 0. Pr n If H in Theorem 6.24 were only a location family or only a scale family then the limit would exhibit only the loss of one degree of freedom, that is, one of the last two integrals would be absent from the limit in (6.79): see Cs¨ org˝ o (2002), where a theorem of this sort for scale families is proved. Theorem 6.25. Under the hypotheses of Theorem 6.24, except that now condition (6.78) is replaced by the weaker conditions (6.11) and 1 1 (s ∧ t − st)2 2 −1 w(s)w(t)dsdt < ∞, (6.80) −1 2 0 0 g0 G0 (t) g0 G0 (s)
Empirical and Quantile Processes we have nRw n
83
1 2 t(1 − t) B (t) − EB 2 (t) → w(t)dt w(t)dt −1 d g02 (G0 (t)) g02 (G−1 0 0 (t)) 2 1 2 1 B(t) B(t)G−1 0 (t) w(t)dt w(t)dt − . − −1 −1 0 g0 (G0 (t)) 0 g0 (G0 (t))
1−1/n
− 1/n
Proof. As above, we can take F = G0 . By Theorem 4.5 b) properly modified to account for the weighted integrals of the Brownian bridge (as done in Theorem 6.17), it suffices to prove that 1−1/n t(1 − t) 1 → σw (Fn ) −1 w(t)dt → 0, 1 and 2 (F ) 2 (F −1 (t)) Pr Pr σw f n 1/n √ 2 which, by condition (6.11), reduce to proving n(σw (Fn ) − 1) = OP (1). We have √ 1 2 n(σw (Fn ) − 1) = √ vn 2,w + 2vn , F −1 w . n By checking the proof of Theorem 6.17 (by way of Theorem 6.12), it is easy to see that un /f (F −1 ), F −1 w,n → B/f (F −1 ), F −1 w,n d
(note that (5.5) implies F −1 ∈ L2 (w(t)dt)). Hence Proposition 2.5 gives vn , F −1 w = OP (1). Likewise, by Theorem 4.5 1−1/n t(1−t) b), vn 2,w is shift convergent in law with shifts 1/n f 2 (F −1 (t)) w(t)dt which, by √ √ (6.11) are o( n), so that vn 2,w / n → 0. Pr
A version of this theorem for scale families is proved in Cs¨ org˝ o (2002), however the hypotheses there are stronger by factors of the order of log n or (log n)2 , the integrals at the end points are not treated analytically and the proof is different (it relies on strong approximations, which account for the stronger assumptions). Next we consider convergence to a normal distribution. This case is less interesting in connection with testing since, as indicated in the Introduction, 1−δ 2 1−δ B 2 (t) vn (t)w(t)dt → δ f 2 (F −1 (t)) w(t)dt if f does not vanish on suppF , and δ d
therefore, if we divide by L(1/n) → ∞, as we must by Theorem 6.21 c), this part of the statistic has no influence on the limit. So, when a distribution satisfies the hypotheses of Theorem 6.21 c) (meaning that g = f (F −1 ) is regularly varying with exponent 1 and L(x) with this g tends to infinity), if one wishes to have a sensible test of fit, it is probably best to find a weight w so that one can apply Theorems 6.24 or 6.25. Hence, we will only consider the normal convergence case with weight w ≡ 1. Theorem 6.26. Let H be a location scale family of distributions and assume for simplicity that the distribution G0 ∈ H with mean zero and variance one is the distribution function of a symmetric random variable. Assume that the following conditions hold for some (hence for all ) F ∈ H: F is twice differentiable on
84
E. del Barrio
(aF , bF ) and satisfies condition (6.7), its density f is strictly positive on (aF , bF ), the function f (F −1 ) is regularly varying of exponent 1 at 0 and at 1, and L(x) → ∞ as x → 0. Then, under the null hypothesis F ∈ H, 1−1/n 1 t(1 − t) → Z,
dt (6.81) nRn − d g02 G−1 L(1/n) 1/n 0 (t) where Z is standard normal and nRn := nRw n is as defined in (5.6) for w ≡ 1. −1 Proof. As in the previous two theorems we can assume F = G0 . Since x f (F dt ) is −1 regularly varying of exponent 1 at 0 and at 1, it follows that F (x) = 1/2 f (F −1 (t)) 1 is slowly varying at 0 and at 1. Hence, 0 |F −1 (t)|r dt < ∞ for all r ∈ R and there∞ r fore, all the moments 0 |x| dF (x), r > 0, are finite. In particular, if Xi are i.i.d. with distribution F , 1 n 1 vn (t)dt = √ (Xi − EXi ) = OP (1) n i=1 0
by the central limit theorem, and σ 2 (Fn ) → 1 a.s. by the law of large numbers. So, it suffices to show that 1−1/n 1 t(1 − t) dt → Z.
vn 22 − vn , F −1 2 − d f 2 F −1 (t) L(1/n) 1/n The arguments in the proof of Theorem 6.21 c) not only show that here we can replace vn 22 by un /f (F −1 2,n , but also that vn , F −1 2 can be replaced by un /f (F −1 ), F −1 n ; therefore, the theorem will follow from Theorem 6.21 c) (hence from Theorem 6.19) if we show that the sequence 1−1/n un un (t)F −1 (t) −1 := , F dt, n ∈ N, (6.82) n −1 f (F ) f (F −1 (t)) 1/n
is stochastically bounded (as it will then tend to zero upon dividing by L(1/n)). For this, we show that the product of the nth variable in (6.82) by Sn+1 /n has expected value tending to zero and variance dominated by a constant independent of n. By (6.2), Lemma 6.1 i) and slow variation of F −1 at 0 and 1, we have 1−1/n −1 n+1 Sn+1 un F (t) i=1 an,i ξi −1 E = E , F dt √ n n f (F −1 ) nf F −1 (t) 1/n 1−1/n 1 |F −1 (t)| dt ≤ √ n 1/n f F −1 (t) F −1 (1−1/n) 1 = √ |u|du n F −1 (1/n) =
(F −1 (1/n))2 + (F −1 (1 − 1/n))2 √ → 0. 2 n
Empirical and Quantile Processes
85
Let X be a random variable with distribution F . By Lemma 6.1 iii) and finiteness of the absolute moments of X, we have Sn+1 un −1 , F n Var n f (F −1 ) 2 1−1/n −1 n+1 F (t) i=1 an,i (ξi − 1) 1 =E √ dt n 1/n f F −1 (t) ˜ n (s, t)F −1 (s)F −1 (t) 1 1−1/n 1−1/n K dsdt = n 1/n f F −1 (s) f F −1 (t) 1/n 1−1/n t s(1 − t)|F −1 (s)||F −1 (t)| dsdt ≤6 −1 (s) f F −1 (t) 1/n 1/n f F F −1 (1−1/n) v =6 F (u)(1 − F (v))|u||v|dudv F −1 (1/n) 0
≤6
F −1 (1/n)
v
F (u)|u||v|dudv F −1 (1/n)
F −1 (1/n)
F −1 (1−1/n)
0
+
F −1 (1/n)
0 F −1 (1−1/n)
F (u)(1 − F (v))|u||v|dudv
v
(1 − F (v))|u||v|dudv
+ 0
0
3 EX 4 + (EX 2 )2 < ∞, 4 where at the last step we use Fubini and integration by parts. ≤
As in Theorem 6.21 c), symmetry of G0 is not necessary. Cs¨org˝ o (2002) also proves a result for correlation tests where the limit is normal, but only for the special case of Weibull scale families. Likewise, Theorem 6.22 can be used to obtain the limiting distribution of nRn when f (F −1 ) is regularly varying at the end points with exponent γ > 1, but we refrain from doing so, to avoid too much repetition. Example. (Gauss-Laplace location-scale families) This is a modification of a result in Cs¨org˝ o (2002). Consider the distribution functions from the above Example, 1 −|x|α if x ≤ 0 2e α Fα (x) = for α > 0. if x ≥ 0 1 − 12 e−x Then, just in that example, Theorem 6.24 with w ≡ 1 holds for the location scale family based on Fα iff α > 2, Theorem 6.25 with w ≡ 1 holds iff 4/3 < α ≤ 2, hence for the normal distribution (which gives Shapiro-Wilk), and Theorem 6.26 with w ≡ 1 holds for 0 < α ≤ 4/3, in particular for the symmetric exponential distribution. As mentioned above, if the tail probabilities are of different order, the largest dominates and the same conclusions apply to the one sided families.
86
E. del Barrio
Example. (Testing fit to the Laplace location scale family) It follows from the last example and the comments immediately before Theorem 6.26 that a weighted Wasserstein test would be convenient for the Gauss-Laplace location scale family when the index α is between 0 and 4/3. For any given α > 0 these families are (in terms of the densities): α −|(x−β)/γ|α Hα = Fβ,γ : fβ,γ (x) := e , x ∈ R, β ∈ R, γ > 0 . 2γΓ(1/α) The weight should approach zero near 0 and 1. For simplicity we will only present a test for the Laplace family H1 . Simple but tedious computations using the approxi1 mations in the previous example show that a weight of the order w(t) ∼ | log t(1−t)| τ will allow us to apply Theorem 6.24 if τ > 1 and Theorem 6.25 if 1/2 < τ ≤ 1 (the determining conditions are (6.78) that holds for all τ > 1, and (6.81) that holds for 1/2 < τ ≤ 1). If w is too small near 0 and 1, we make the extreme part of the distribution count less, whereas possibly the limit has more variability as the integral of B 2 − EB 2 is closer to being divergent. de Wet (2000) convincingly suggests taking τ = 1 (see also Cs¨org˝ o (2002)). Concretely, we define 1 1 w(t) := I + I 1/2 0} G(u) = inf{x ∈ R : F (x) ≥ u} = sup{x ∈ R : F (x) < u} for 0 < u < 1, (1.7) ω+ := sup{x : F (x) < 1} for u = 1. The values of the distribution endpoints, ess inf(X) := ω− ≥ −∞ and ess sup(X) := ω+ ≤ ∞, of PX are possibly infinite. The distribution of X is said to be degenerate when X is almost surely [a.s.] constant, namely, when there exists an x0 ∈ R such that P(X = x0 ) = 1. In this case, one has −∞ < ω− = ω+ = x0 < ∞.
(1.8)
On the other hand, for non-degenerate distributions, the following inequalities hold. −∞ ≤ ω− < ω+ ≤ ∞.
(1.9)
Non-degeneracy of PX is conveniently characterized by the property that sup P(X = x) < 1.
(1.10)
x∈R
The quantile function G(·) in (1.7), as well as its right-continuous version, defined, for 0 < u < 1, by G(u+) := inf{x : F (x) > u}, and, for u = 0 or 1, through the relations (1.12) below, fulfills 0≤u≤v≤1
⇒
ω− ≤ G(u) ≤ G(v) ≤ ω+ ;
(1.11)
G(0+) := G(0) = ω− = lim G(u); u↓0
(1.12)
G(1+) := G(1) = ω+ = lim G(u); u↑1
G(u) = G(u−) := lim G(u − ) for 0 < u ≤ 1. ↓0
(1.13)
The right-continuous version G(·+) of the quantile function G(·) of X, fulfills (1.11) and (1.12), but not (1.13). The latter relation is replaced by G(u+): = lim G(u + ) = inf{x : F (x) > u} ↓0
= sup{x : F (x) ≤ u} for 0 < u < 1.
(1.14)
One has the following refinement of (1.12) for G(·+). G(0+) = ω− = lim G(u+); u↓0
G(1+) = ω+ = lim G(u+). u↑1
(1.15)
Example 1.1. Let F and G be, respectively, the df and qf of a rv X. 1◦ ) Show that the set of all x ∈ R such that F (x+) − F (x−) ≥ 1/n is finite or void, for each n ≥ 1. Conclude that the set DF of discontinuity points of F is, at most, countable. 2◦ ) Show likewise that the set DG of discontinuity points of G is, at most, countable.
96
P. Deheuvels
1.2.2. Generalized inverses of nondecreasing functions. The distribution and quantile functions are mutual inverses in a sense made precise below. Consider a non-decreasing function {H(t) : A ≤ t ≤ B}, with −∞ ≤ A ≤ B ≤ ∞ and −∞ ≤ H(A) ≤ H(B) ≤ ∞. We do not make continuity assumptions on H, except at end-points when A < B, in which case, we require that H(A) = lim H(t) t↓A
and H(B) = lim H(t). t↑B
(1.16)
Then, it is always possible to define a left-continuous inverse H → (s) (resp. a rightcontinuous inverse H ← (s)) of H(·), over H(A) ≤ s ≤ H(B), as follows. (i) When H(A) < s ≤ H(B) (resp. H(A) ≤ s < H(B)), we set H → (s) = sup{t : H(t) < s}
(resp. H ← (s) = inf{t : H(t) > s}).
(1.17)
(ii) When H(A) = s < H(B) (resp. H(A) < s = H(B)), we define H → (s) (resp. H ← (s)) via (1.17) and H → (H(A)) = lim H → (v)
(resp. H ← (H(B)) = lim H ← (v)).
v↓H(A)
(1.18)
v↑H(B)
(iii) There is one possibility uncovered by (1.17)–(1.18), when H(A) = H(B). In this case, we define H → (s) (resp. H ← (s)) for H(A) = s = H(B) by setting H → (H(A)) = A (resp. H ← (H(B)) = B).
(1.19)
If, in addition to the previous assumptions, H(·) is right-continuous, then, the definition (1.17)–(1.18)–(1.19), of the left-continuous inverse H → (·) of H(·), may be simplified into H → (s) = inf{t : H(t) ≥ s}
for H(A) ≤ s ≤ H(B).
(1.20)
Likewise, when H(·) is left-continuous, the definition (1.17)–(1.18)–(1.19), of the right-continuous inverse H ← (·) of H(·), may be simplified into H ← (s) = sup{t : H(t) ≤ s}
for H(A) ≤ s ≤ H(B).
(1.21)
Under the additional assumption that H(A) < H(B), the right-continuous (resp. left-continuous) inversion operators, mapping H onto H → (resp. H onto H ← ), are involutions. The condition H(A) < H(B) (which entails A < B) allows to define intrinsic end-points, A∗ and B ∗ , pertaining to H(·), by setting and
A∗ = inf{t ∈ (A, B) : H(t) > H(A)} B ∗ = sup{t ∈ (A, B) : H(t) < H(B)}.
(1.22)
We have −∞ ≤ A ≤ A∗ < B ∗ ≤ B ≤ ∞, and −∞ ≤ H(A) = H(A∗ ) < H(B ∗ ) = H(B) ≤ ∞. It is readily checked, that, whenever H(·) is right-continuous (resp. left-continuous), for all A∗ ≤ t ≤ B ∗ , H(t) = {H → }← (t) = {H ← }← (t)
(resp.
H(t) = {H ← }→ (t) = {H → }→ (t)). (1.23)
Topics on Empirical Processes
97
We note that (1.23) is invalid when H(t) = µ is constant for t ∈ [A, B]. In this case (1.22) is meaningless, and the definitions, (1.19) of H → and H ← , reduce to H ← (s) = A and H → (s) = B for s = µ, so that and
{H ← }← (t) = {H ← }→ (t) = µ, → ←
→ →
{H } (t) = {H } (t) = µ,
for t = A, for t = B.
(1.24)
Both operators, H → H → and H → H ← , are natural extensions of the inversion operator, H → H −1 , with respect to composition of applications. The interest of H → and H ← is to be always defined, whereas such is not the case for H −1 . When H(·) is (strictly) increasing and continuous on [A, B], with a properly defined inverse mapping, H −1 (·), (strictly) increasing and continuous on [H(A), H(B)], all three definitions coincide, since then, A = A∗ , B = B ∗ , and, for each s ∈ [A, B] (resp. t ∈ [H(A), H(B)], H H −1 (t) = t, H −1 H(s) = s, H −1 t = H → t = H ← t . (1.25) Proposition 1.1 below provides a variant of (1.25), tailored to the case of distribution and quantile functions. Its proof can be adapted to show that continuity of H and H −1 is sufficient to imply (1.25). 1.2.3. Some useful properties of distribution and quantile functions. As follows from (1.7) and (1.20)–(1.21)–(1.23), in the special case where G is the qf of the df F , we have and
and
G(u) = F → (u) for ←
F (x) = G (x)
for ω− ≤ x ≤ ω+ ,
G(u+) = F ← (u) for →
F (x−) = G (x)
0≤u≤1 0≤u≤1
for ω− ≤ x ≤ ω+ .
(1.26)
(1.27)
In view of (1.26)–(1.27) and (1.20)–(1.21), we obtain that, for each x fulfilling ω− ≤ x ≤ ω+ , F (x) = sup{u : G(u) ≤ x} = sup{u : G(u+) ≤ x},
(1.28)
F (x−) = inf{u : G(u) ≥ x} = inf{u : G(u+) ≥ x}.
(1.29)
As follows from §1.2.2, the relations (1.28)–(1.29) become invalid when x ∈ [ω, ω+ ]. We note, further, that the definition (1.7) of G(·) entails that, for each u ∈ [0, 1] such that −∞ < G(u) < ∞, F (G(u) − ε) < u ≤ F (G(u) + ε) ∀ε > 0.
(1.30)
This implies that the image set of [0, 1] by either G(·) or G(·+) is nothing else but the support of the distribution of X. The latter, denoted by supp(X), is a closed subset of R (see, e.g., p. 277 in [10]), collecting all x ∈ R such that P(Vx ) > 0 for each open neighborhood Vx of x in R. It is noteworthy that P(X ∈ supp(X)) = 1. The complement in R of the support of X, namely R − supp(X), is the union of all open subsets O of R such that P(X ∈ O) = 0.
98
P. Deheuvels
A similar argument as that used to infer (1.30) from (1.7), shows that, for each 0 < u < 1 (this implying that −∞ < G(u) ≤ G(u+) < ∞) and ε > 0, we have and
F (G(u) − ε) < u ≤ F (G(u) + ε), F (G(u+) − ε) ≤ u < F (G(u+) + ε).
(1.31)
By letting ε ↓ 0 in (1.31), we obtain readily the inequalities F (G(u)−) ≤ F (G(u+)−) ≤ u ≤ F (G(u)) ≤ F (G(u+)).
(1.32)
Another application of (1.7), shows readily that, for each ω− < x < ω+ , G(F (x−)) ≤ G(F (x)) ≤
x ≤ G(F (x−)+) ≤ G(F (x)+).
(1.33)
A simple and useful consequence of these inequalities is stated in the next proposition. Proposition 1.1. Assume that F (·) is continuous on (ω− , ω+ ), and that G(·) is continuous on (0, 1). Then, for each x ∈ [ω− , ω+ ] and u ∈ [0, 1], one has G(F (x)) = x,
and
F (G(u)) = u.
(1.34)
Proof. The assumption that F (·) and G(·) are continuous is equivalent to the identities F (·−) = F (·) and G(·) = G(·+). We have therefore F (x−) = F (x) and G(u) = G(u+) in (1.32)–(1.33), which, in turn, entail (3.11) for x ∈ (ω− , ω+ ) and u ∈ (0, 1). To conclude, we observe that (1.34) holds for x = ω± and u = 0 or 1, as a direct consequence of the definitions (1.1) and (1.7) of F and G. Exercise 1.1. Let X denote a [0, 1]-valued rv with density f (·), continuous and positive on [0, 1]. Denoting by F (·) and G(·) the df and qf of X, show that the d quantile density g(u) = du G(u) is continuous and positive on [0, 1], and is equal to 1/f (G(u)). Exercise 1.2. Let G be the qf of a rv X. Prove the equality 1 G(u)du. E(X) = 0
1.3. Topologies on spaces of measures and functions 1.3.1. Weak and vague convergence of measures. We first recall some basic topological facts. The notion of Polish, and of locally compact, topological space plays here an essential role. A topological space E is Polish, iff its topology can be defined by a metric δ, for which it is complete, and separable (meaning, see, e.g., p. 209 in [10], that there exists a countable dense subset of E, or, equivalently, that there exists a countable base of open sets). A locally compact space is a Hausdorff topological space (such that two distinct points have disjoint neighborhoods), for which each point has a compact neighborhood. A locally compact space E has a countable base (see, e.g., p. 209 in [10]), iff there exists a countable base of open sets of E (meaning that each open set of E is the union of some of these subsets). Both structures are closely related. In fact, the topology of an arbitrary separable metric space (E, δ) can be defined by a metric δ ∗ , such that the completion E of
Topics on Empirical Processes
99
E with respect to δ ∗ is a separable compact metric space (see, e.g., p. 72 in [83]). Conversely, a locally compact space is metrizable if and only it has a countable base, in which case it is Polish (see, e.g., p. 225 in [10]). Because of this, and since, in the applications we consider, E = R or Rd , we will let, unless otherwise specified, E denote a metrizable locally compact space, which, as follows from the previous arguments, is necessarily a Polish space. To work on locally compact spaces simplifies matters with respect to measure theory, since we may restrict our interest to Radon measures, by definition finite on compact subsets. We recall below the usual definitions and properties of vague (resp. weak convergence) in the space M+ (E) of (non-negative) Radon measures (resp. M+ f (E) of (non-negative) finite Radon measures) on a metrizable separable locally compact space E. In words, M+ (E) (resp. M+ f (E)) consists of all non-negative measures µ defined on the Borel algebra BE of E, and such that µ(A) < ∞ for each relatively compact Borel subset A ∈ BE of E (resp. µ(E) < ∞). We denote by M(E) (resp. Mf (E)) the set of all signed (Radon) measures (resp. totally bounded signed (Radon) measures on E), and keep in mind that µ ∈ M(E) (resp. µ ∈ Mf (E)) iff there exist two non-negative measures µ∗+ ∈ M+ (E) and µ∗− ∈ M+ (E) (resp. + ∗ µ∗+ ∈ M+ f (E) and µ− ∈ Mf (E)), such that µ = µ∗+ − µ∗− .
(1.35)
Given µ ∈ M(E), the choice of µ∗+ and µ∗− in (1.35) is not unique. However, the Hahn-Jordan decomposition theorem (see, e.g., p. 173 in [172] and pp. 178–181 in [83]), allows to choose these two measures, by setting µ∗± = µ± , where µ+ and µ− are defined through a decomposition of E = E− ∪ E+ into the union of two disjoint measurable subsets E− and E+ , such that, for each relatively compact A ∈ BE , µ+ (A) = µ(A ∩ E+ ) and µ− (A) = µ(A ∩ E− ),
E− ∩ E+ = ∅.
(1.36)
The property (1.36) characterizes orthogonality, of µ+ and µ− (denoted hereafter by µ+ ⊥µ− ). The measures µ± in (1.36) are uniquely defined for each A ∈ BE , via the relations µ+ (A) =
sup B⊆A, B∈BE
µ(B ∩E+ ) and µ− (A) =
sup B⊆A, B∈BE
{−µ(B ∩E− )}. (1.37)
In view of (1.37), we may define the total variation measure |µ| ∈ Mf (E) of µ ∈ Mf (E), via |µ|(A) = µ+ (A) + µ− (A) =
sup B⊆A, B∈BE
µ(B)
for each A ∈ BE .
(1.38)
In the special case where E = R, we may observe that the condition that µ ∈ Mf (R) is equivalent to the condition that the distribution function Hµ (x) = µ (−∞, x] of µ is of bounded variation on R, with Hµ (−∞) := limx↓∞ Hµ (x) = 0, and with total variation on R equal to |dHµ |(R) = |µ|(R). In the sequel the set of functions f of bounded variation on a sub-interval I of R will be denoted by BV(I).
100
P. Deheuvels
Definition 1.1. A family of measures µα ∈ M(E), indexed by an oriented net A, is said to be vaguely convergent to µ ∈ M(E) (resp. a net-oriented family, µα ∈ Mf (E), α ∈ A, is said to be weakly convergent to µ ∈ Mf (E)), when f dµα → f dµ. (1.39) R
R
for all f , continuous on E with compact support (resp. continuous and bounded on E). Vague convergence of µα ∈ M+ (E) to µ ∈ M+ (E) is equivalent to the property that µα (K) → µ(K), (1.40) for each compact subset K of E, with boundary ∂K such that µ(∂K) = 0. It is noteworthy (see, e.g., p. 235 in [10]) that weak convergence of µα ∈ M+ f (E) to + µ ∈ Mf (E) is equivalent to vague convergence, with the additional condition that µα (E) → µ(E).
(1.41)
Remark 1.1. The weak convergence of probability measures on a separable metric space can be defined by the Prohorov metric (see, e.g., pp. 393–399 in [82]). When E is locally compact, the vague topology on M+ (E) is metrisable iff E has a countable base, or, equivalently, iff E is Polish (see, e.g., p. 243 in [10]). Remark 1.2. Neither of the nice characterizations, (1.40) and (1.41), of vague and weak convergence holds when M+ (E) and M+ f (E) are replaced, respectively, by M(E) and Mf (E). The vague and weak topologies are rather nasty to handle for signed measures, since they are not metrizable on M(E). This (far from obvious) fact can be shown through the following arguments. First, the space Cb (E) of bounded continuous functions on E, endowed with the sup-norm, is a Banach space. Its topological dual Cb∗ (E), endowed with the weak∗ topology, coincides with the set Mf (E) of finite signed measures on E, endowed with the weak topology (see, e.g., p. 16 in [19]). Second, it turns out (see, e.g., p. 68 in [173]) that the topological dual X∗ of an infinite-dimensional Banach space X is not metrizable (and such is the case, therefore, in general for (Mf (E), W)). On the other hand, it can be shown (see, e.g., p. 68 in [173]), as a consequence of the Banach-Alaoglu theorem, that any weak∗ -compact subset of X∗ is metrizable. This last fact underlies some of the results of the forthcoming §1.3.4. Because of the mathematical difficulties underlined in Remark 1.2, in the remainder of this section, we will limit ourselves to the study of vague and weak convergence on the spaces of non-negative measures M+ (E) and M+ f (E). In §§1.3.3– 1.3.4, we will discuss again the case of signed measures, when E = [0, 1]. A subset S ⊆ M+ (E) is vaguely relatively compact if and only if (see, e.g., p. 241 in [10]) sup f dµ < ∞ (resp. sup µ(K) < ∞), (1.42) µ∈S
µ∈K
Topics on Empirical Processes
101
for each bounded function f on E with compact support (resp. compact subset K of E). For the set P(E) of probability measures on E, (1.41) is automatic, so that vague and weak convergence are equivalent, for a sequence µn ∈ P(E), when the limit µ is in P(E). However, the set P(E) is not closed in M+ (E) with respect to the vague topology, so that (1.42) is not sufficient to ensure weak relative compactness of a subset S of P(E). A useful characterization of this property is given by Prohorov’s theorem (refer to p. 37 in [19]), which shows that weak relative compactness of S ⊆ P(E) is equivalent to the tightness of S, namely, to the condition that, for each > 0, there exists a compact subset Kε of S such that inf µ(Kε ) ≥ 1 − ε.
µ∈S
(1.43)
In general (see, e.g., pp. 237–238 in [10]), weak convergence of µn ∈ M+ f (E) to µ ∈ M+ (E) is equivalent to the convergence of µ (Q) to µ(Q) for each Q ∈ BE n f with boundary ∂Q of null µ measure. For the probability distributions on E = R, considered in §1.3.2, this last condition is equivalent to the pointwise convergence (1.45) of df’s at each continuity point of the limiting df. The same property holds when E = Rd for an arbitrary d ≥ 1. 1.3.2. Weak convergence and the L´evy metric. We consider briefly in the present section the topology, denoted hereafter by W, of weak convergence on the space P(R) of probability measures on R, identified to the space F(R) of the corresponding (right-continuous) distribution functions. We refer to §1.3.1 for basic definitions, to [19] for more details on weak convergence of probability measures, and to §1.3.3 (resp. §1.3.4) below, for a study of of this topology on the set of bounded non-negative (resp. signed) measures on [0, 1]. One important feature of the topological space (P(R), W), is that it is conveniently metricized by the L´evy metric (or distance), defined as follows. Consider two, possibly unbounded, nondecreasing functions H1 and H2 on [A, B] ⊆ R. Extend the definition of these functions to R, by setting, for i = 1, 2, Hi (A) for x ≤ A, ' Hi (x) = Hi (B) for x ≥ A. Given this notation, the L´evy distance between H1 and H2 is defined by ' 1 (x − θ) − θ ≤ H ' 2 (x) ≤ H ' 1 (x + θ) + θ ∀x ∈ R , inf θ ≥ 0 : H δL (H1 , H2 ) = whenever such a θ ≥ 0 exists (1.44) ∞ otherwise. When H1 and H2 are bounded, (1.44) implies that δL (H1 , H2 ) < ∞. It is readily checked that δL defines a metric on the set F(R) of all probability df’s on R. Denoting, as usual, by CF the set of continuity points of F , a necessary and sufficient condition for a sequence {Fn : n ≥ 1} of df’s to converge weakly to the
102
P. Deheuvels
limiting df F is that (see, e.g., p. 18 in [19]) W
Fn → F ⇔ δL (Fn , F ) → 0 ⇔ lim Fn (x) = F (x) n→∞
∀x ∈ CF .
(1.45)
Consider now two quantile functions, G1 and G2 , pertaining to the two distribution functions F1 ∈ F and F2 ∈ F. One may extend the definition of G1 and G2 to R, by setting, for i = 1, 2, Gi (0) for x ≤ 0, ' Gi (x) = Gi (1) for x ≥ 1. Given this notation, one may define the L´evy distance δL (G1 , G2 ) between G1 '1 , G ' 2 . We have the and G2 , by the formal replacement of H1 , H2 in (1.44) by G following useful proposition. Proposition 1.2. When G1 , G2 are the quantile functions of the distribution functions F1 , F2 , we have δL (F1 , F2 ) = δL (G1 , G2 ).
(1.46)
Proof. Combine (1.32)–(1.33) with (1.45).
An interesting application of (1.45) and (1.46) is provided by the following characterization of weak convergence. Let Gn = Fn→ denote the quantile function of Fn for n = 1, 2, . . ., and let G = F → denote the quantile function of F . Denote by CG ⊆ [0, 1] the set of all continuity points of G. Then, W
Fn → F ⇔ δL (Gn , G) → 0 ⇔ lim Gn (u) = G(u) ∀u ∈ CG . n→∞
(1.47)
1.3.3. Weak and uniform topologies for df ’s of non-negative measures on [0, 1]. We now specialize in the study of (non negative bounded) Radon measures with support in a bounded closed interval of R. For convenience, and without loss of generality, we will limit ourselves to the case where this interval is E = [0, 1]. As follows from Definition 1.1, the vague and weak topologies coincide on M[0, 1]. Let M+ non-negative measures µ on R, with f [0, 1] denote the set of all bounded support in [0, 1], namely, fulfilling µ R − [0, 1] = 0 and µ(A) ≥ 0 for each A ∈ BR . We denote by I+ [0, 1] (resp. I− [0, 1]) the set of all right-continuous (resp. leftcontinuous) distribution functions H(·+) (resp. H(·−)) of measures µ ∈ M+ f [0, 1], of the form Hµ (x+) = µ ([0, x])
and Hµ (x−) = µ ([0, x))
for
x ∈ R. (1.48)
We let I± [0, 1] denote either one of the sets I+ [0, 1] or I− [0, 1]. The correspondence µ ↔ Hµ (·+) ↔ Hµ (·−) being one-to-one, we may endow I± [0, 1] with the weak topology W of convergence of measures in M+ f [0, 1]. This topology is metricized by the L´evy metric δL (·, ·) defined via (1.44), and we let (I± [0, 1], W) denote the set I± [0, 1] endowed with the weak topology W. Likewise, we let (I± [0, 1], U) (resp. + (M+ f [0, 1], U)) denote the set I± [0, 1] (resp. Mf [0, 1]), endowed with the uniform
Topics on Empirical Processes
103
topology U. The latter topology is induced by the sup-norm distance, defined, for Hµ , Hν ∈ I± [0, 1], with µ, ν ∈ M+ f [0, 1], via δU (µ, ν) = Hµ − Hν = sup |Hµ (x) − Hν (x)|.
(1.49)
x∈R
In view of (1.42), for any (finite) constant C ≥ 0, the set M+ f,C [0, 1] of all µ ∈ + Mf [0, 1] such that µ ([0, 1]) = Hµ (1+) ≤ C is weakly compact. The corresponding set of distribution functions is IC ± [0, 1] := {H ∈ I± [0, 1] : H(1+) ≤ C} = {Hµ (·±) : µ ([0, 1]) ≤ C}.
(1.50)
The Lebesgue decomposition of µ ∈ M+ f [0, 1] shows that the df Hµ (·+) ∈ I+ [0, 1] (resp. Hµ (·−) ∈ I− [0, 1]) of µ has a unique decomposition into the sum of two components, as follows. For all t ∈ R, Hµ (t−) = h(u)du + HµS (t−) resp. Hµ (t+) = h(u)du + HµS (t+) . [0,t]
[0,t]
(1.51) Here, h is a non-negative function, Lebesgue-integrable on [0, 1], defined uniquely up to an almost everywhere [a.e.] equivalence, and µS := dHµS is a singular nonnegative measure on [0, 1], with support of null Lebesgue measure, and such that 1 µ [0, 1] = Hµ (1+) = h(u)du + HµS (1+). (1.52) 0
The function h in (1.51)–(1.52) is the Lebesgue derivative of H = Hµ , denoted d H. By definition, Hµ is absolutely continuous (with rehereafter by H˙ = dx spect to Lebesgue measure) iff HµS (1+) = µS [0, 1] = 0. In the sequel, the set of non-decreasing, absolutely continuous distribution functions Hµ of measures µ ∈ M+ f [0, 1], will be denoted by ACI0 [0, 1]. Naturally, we may combine the Lebesgue and Hahn-Jordan decompositions of a totally bounded signed measure µ ∈ Mf [0, 1], to write h(u)du + Hµ+ (t) − Hµ− (t), (1.53) Hµ (t) = µ [0, t] = [0,t]
S
S
− where µ+ S ⊥µS are singular non-negative measures on [0, 1], with supports of null d Hµ is the Lebesgue derivative of Hµ . The Lebesgue measure, and h = H˙µ = dx set of all absolutely continuous functions Hµ on [0, 1], fulfilling Hµ± (1+) = 0, will S be denoted by AC0 [0, 1]. A function H ∈ AC0 [0, 1] (resp. ACI0 [0, 1] is such that H(0) = 0. If we let H(0) be arbitrary, we obtain a general absolutely continuous (resp. absolutely continuous nondecreasing) function on [0, 1]. The corresponding sets of functions will be denoted by AC[0, 1] (resp. ACI[0, 1]).
1.3.4. Weak convergence of signed measures on [0, 1]. We now concentrate on the study of the weak topology W, on the space Mf [0, 1] of totally bounded signed measures µ on R, with support in [0, 1]. Recalling the Hahn-Jordan decomposition
104
P. Deheuvels
µ = µ+ − µ− of µ in (1.36), we see that µ ∈ Mf [0, 1] and |µ| = µ+ + µ− ∈ M+ f [0, 1] fulfill, in view of (1.38), |µ| [0, 1] = µ+ [0, 1] + µ− [0, 1] < ∞ and supp µ := supp |µ| ⊆ [0, 1]. (1.54) The right-continuous distribution function [df] Hµ (x) = µ [0, x] of µ ∈ Mf [0, 1], fulfills Hµ (0−) = 0, and is of bounded variation |dHµ |(R) = |µ| [0, 1] < ∞ on R. Conversely, via the Lebesgue-Stieltjes integral, any right-continuous function H of bounded variation on R, with Hµ (x) = 0 for x < 0, and Hµ (x) = H(1) for x ≥ 1 is the df of some µ ∈ Mf [0, 1]. Here, and elsewhere, we set BV0 [0, 1] = Hµ : µ ∈ Mf [0, 1] , (1.55) and, for an arbitrary C ≥ 0, and
Mf,C [0, 1] = µ ∈ Mf [0, 1] : |µ| [0, 1] ≤ C BV0,C [0, 1] = Hµ : µ ∈ Mf,C [0, 1] .
(1.56)
Introduce now the metric (see, e.g., H¨ogn¨ as [113]), defined, for µ, ν ∈ Mf [0, 1], by 1 δH (µ, ν) = δH (Hµ , Hν ) = |Hµ (t) − Hν (t)|dt + |Hµ (1) − Hν (1)|. (1.57) 0
As mentioned in Remark 1.2, the weak topology on M(R) is not metrizable. For this reason, one should discuss the weak convergence of µα ∈ M[0, 1] to µ ∈ M[0, 1], along an oriented net A, rather than on a sequence. We have the following characterization of weak convergence in Mf [0, 1], due to H¨ ogn¨ as [113]. Proposition 1.3. A net-oriented family µα ∈ Mf [0, 1], with α ∈ A, of totally bounded signed measures on [0, 1] is weakly convergent to µ ∈ Mf [0, 1], if and only if: 1◦ ) There exists a constant 0 < C < ∞ such that, ultimately along the net A, µα ∈ Mf,C [0, 1]; 2◦ ) We have, along the net A, dH (µα , µ) → 0. Proof. See, e.g., H¨ogn¨ as [113].
A direct consequence of Proposition 1.3 is that, for each C ∈ [0, ∞), the restriction of the weak topology W to Mf,C [0, 1] is metrizable. This is captured in the following corollary of Proposition 1.3. Corollary 1.1. For each 0 < C < 0, the metric dW (·, ·) endows Mf,C [0, 1] with the weak topology. We infer readily from the above arguments the following proposition. Proposition 1.4. For each 0 < C < 0, the set Mf,C [0, 1] is weakly compact. Proof. Consider any sequence {µn : n ≥ 1} ⊆ Mf,C [0, 1] such that dH (µm , µn ) → 0 − as m ∧ n → ∞. For each n ≥ 1, denote by µn = µ+ n − µn the Hahn-Jordan + decomposition of µn . Observe, via (1.38), that, for each n ≥ 1, µ± n ∈ Mf,C [0, 1].
Topics on Empirical Processes
105
By (1.42), M+ f,C [0, 1] is a weakly compact metrizable space. Therefore, for each increasing sequence {nj : j ≥ 1} of positive integers, there exists an increasing + ± + − subsequence, along which µ± n →W µ , for some µ and µ ∈ Mf,C [0, 1]. This, in ± ± turn, entails that, along this subsequence, δH (µn , µ ) → 0, and hence, δH (µn , µ) → 0, where µ := µ+ − µ− . Since the so-defined µ is necessarily unique, we infer from this argument that (Mf,C [0, 1], δH) is a complete metric space. The same argument shows readily that Mf,C [0, 1] is sequentially compact, and therefore compact, with respect to W. Remark 1.3. As mentioned above in §1.3.1, the fact that Mf,C [0, 1] is metrizable with respect to the weak topology W, is a general consequence of the Banach-Alaoglu theorem (refer to pp. 67–68 in [173]). The H¨ ogn¨ as metric (1.57) provides here a convenient example of metric endowing Mf,C [0, 1] with W. 1.3.5. Compact sets based upon rate functions. We will consider below a series of examples of compact subsets of Mf [0, 1], endowed, either with the weak topology W, or with the uniform topology U, of the corresponding distribution functions. Throughout, we will identify µ ∈ Mf [0, 1] with its distribution function Hµ (t) = µ [0, t] , so that the sets we will consider will be defined equivalently, in terms of measures, or in terms of functions. In particular, the uniform topology U will be defined on Mf [0, 1], by setting, for µ, ν ∈ Mf [0, 1] dU (µ, ν) = Hµ − Hν = sup |Hµ (t) − Hν (t)|.
(1.58)
t∈R
The compact sets we shall consider will be defined through rate functions, which, in the applications discussed later on, will be chosen as Chernoff functions, of the form (2.8). However, in the present section, this restriction is not necessary, and we will work in a more general setup. By rate function is meant a non-negative convex (possibly infinite) function {Ψ(α) : α ∈ R}, with the following properties, holding for some specified constant m ∈ R. 0 ≤ Ψ(α) ≤ ∞;
(Ψ.1) (Ψ.2)
Ψ(m) = 0
and Ψ is convex on R.
We set further, for Ψ fulfilling (Ψ.1–2), −∞ ≤ t0 := lim
α→∞
Ψ(α) Ψ(α) ≤ 0 ≤ t1 := lim ≤ ∞. α→−∞ α α
(1.59)
Example 1.2. The following three examples of rate functions deserve special attention. 1◦ ) Ψ(α) = α2 , with µ = 0, t1 = −∞ and t0 = ∞; 2◦ ) Ψ(α) = h(α), with µ = 1, t1 = −∞ and t0 = ∞, where h is defined by α log α − α + 1 for α > 0; (1.60) h(α) = 1 for α = 0; ∞ for α < 0.
106
P. Deheuvels
3◦ ) Ψ(α) = (α), with µ = 1, t1 = −∞ and t0 = 1, where is defined by α − log α + 1 for α > 0; (α) = (1.61) ∞ for α ≤ 0. The following theorems will have useful consequences. For each H ∈ AC[0, 1] (the set of absolutely continuous functions on [0, 1], see §1.3.3), we denote by d ˙ H(t) = dt H(t) the Lebesgue derivative of H. By (AC[0, 1], U) is meant the set AC[0, 1] endowed with the uniform topology U. The set BC0 [0, 1] collects all distribution functions Hµ of totally bounded signed measures µ ∈ Mf [0, 1] with support in [0, 1]. Theorem 1.1. Let Ψ, fulfilling (Ψ.1–2), be such that t1 = −∞ and t0 = ∞. Introduce the set 1 ˙ ∆Ψ,c = H ∈ AC[0, 1] : H(0) = 0 and cΨ(c−1 H(u))du ≤1 . (1.62) 0
Then ∆Ψ,c is a compact subset of (AC[0, 1], U). Proof. We limit ourselves to c = 1, and set ∆Ψ = ∆Ψ,1 . By the Arzel`a-Ascoli theorem (see, e.g., p. 369 in [173]), it is enough to show that ∆Ψ is closed, uniformly equi-continuous and bounded. To establish this property, we fix an arbitrary ε > 0. The assumption that Ψ(α)/|α| → ∞ as |α| → ∞ ensures the existence of an Mε such that Ψ(α) ≥ 1ε |α| for all |α| ≥ Mε . Now, this ensures that, for each 0 ≤ a < b ≤ 1 and H ∈ ∆Ψ , H(b) − H(a)
|H(b) − H(a)| |H(b) − H(a)| ≥ Mε ⇒ ≤ εΨ ± b−a b − a b−a b ε ε ˙ , Ψ(H(u))du ≤ ≤ b−a a b−a where we have used the convexity of Ψ. Thus, we have |H(b) − H(a)| ≤ ε∧{Mε (b− a)}, whence |b − a| ≤ ε/Mε ⇒ |H(b) − H(a)| ≤ ε. This establishes the uniform equicontinuity of ∆Ψ , the boundedness of ∆Ψ following trivially, in turn, from the fact that H(0) = 0 for all H ∈ ∆Ψ . Finally, the fact that ∆Ψ is closed follows from the observation (see, e.g., Lynch and Sethuraman [144]) that the mapping 1 ˙ Ψ(H(u))du, (1.63) H ∈ AC[0, 1] → 0
is lower semi-continuous. Theorem 1.2. Let Ψ, fulfilling (Ψ.1–2), be such that Ψ(α) = ∞
for
α ≤ ∞,
t1 = −∞
and
t0 < ∞.
Topics on Empirical Processes For each c > 0, introduce the set ∆Ψ,c = H ∈ I± [0, 1] : H(0) = 0,
1
107
˙ cΨ(c−1 H(u))du + t1 HS (1+) ≤ 1 . (1.64)
0
Then ∆Ψ,c is a compact subset of (I± [0, 1], W).
Proof. See, e.g., Lynch and Sethuraman [144]. Example 1.3.
1◦ ) The Strassen set of functions (see, e.g., Strassen [199] and (2.34) in the sequel), defined by 1 S = f ∈ AC[0, 1] : f (0) = 0 and (1.65) f˙(u)2 du ≤ 1 , 0
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1], endowed with the uniform topology U. This follows from Theorem 1.1, taken with Ψ(u) = u2 . We note further that ΨX (u) = 12 u2 is the Chernoff function, (2.8), of a rv X following a standard normal N (0, 1) law (see, e.g., §2.3.3). 2◦ ) The Finkelstein set of functions (refer to Finkelstein (1971)) defined by 1 (1.66) f˙(u)2 du ≤ 1 , F = f ∈ AC[0, 1] : f (0) = f (1) = 0 and 0
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1], endowed with This follows from 1◦ ) and the observation the uniform topology. that F = f ∈ S : f (1) = 0 is closed in (C[0, 1], U). 3◦ ) The set of functions (refer to Deheuvels and Mason (1990, 1992)) defined for c > 0 by 1 ∆c = f ∈ AC[0, 1] : f (0) = 0 and ch c−1 f˙(u) du ≤ 1 , (1.67) 0
is a compact subset of the set (C[0, 1], U) of continuous functions on [0, 1], endowed with the uniform topology U. This follows from Theorem 1.1, taken with Ψ(u) = h(u), which is nothing else but the Chernoff function, (2.8), of a Poisson rv with expectation equal to 1 (see, e.g., (2.60)). 4◦ ) The set of functions (refer to Deheuvels and Mason (1990, 1992)) defined for c > 0 by 1 Γc = f ∈ I± [0, 1] : f (0) = 0 and c c−1 f˙(u) du + fS (1+) ≤ 1 , (1.68) 0
is a compact subset of the set (I± [0, 1], W) of non-decreasing non-negative functions on [0, 1], endowed with the weak topology. This follows from Theorem 1.2, taken with Ψ(u) = (u). This last function is the Chernoff function, (2.8), of an exponentially distributed rv with mean 1.
108
P. Deheuvels
The next theorem extends Theorems 1.1 and 1.2 to the case where t0 and t1 are arbitrary. Theorem 1.3. Let Ψ, fulfilling (Ψ.1–2), be such that −∞ ≤ t1 ≤ t0 ≤ ∞. Recall (1.53), and set BV0 [0, 1] = {Hµ : µ ∈ Mf [0, 1]}. For each c > 0, introduce the set 1 cΨ c−1 H˙ µ (u) du + t0 Hµ+ (1+) − t1 Hµ− (1+) ≤ 1 . ∆Ψ,c = Hµ ∈ BV0 [0, 1] : S
0
S
(1.69) Then ∆Ψ,c is a compact subset of (BV0 [0, 1], W).
Proof. See, e.g., Deheuvels [51].
In the forthcoming §2.1, we will give some examples of rate functions based upon large deviation theory. Exercise 1.3. 1◦ ) Let h(·) and (·) be as in (1.60)–(1.61). Check the relations αh(1/α) = (α)
and
α(1/α) = h(α)
and
α > 0.
◦
2 ) Recall the results of Exercise 1.1. Let F denote the set of df ’s F (x) of rv’s d F (x) of X continuous and with support in [0, 1], having density f (x) = dx positive on [0, 1]. Let likewise G denote the set of qf ’s of random variables, d with density quantile function g(u) = du G(u) continuous and positive on [0, 1]. Show that the mapping F ∈ F → G = F → ∈ G defines a one-toone mapping of F onto G, continuous with respect to the weak topology W. Show likewise that the mapping G ∈ G → F = G← ∈ F defines a one-to-one mapping of G onto F , continuous with respect to the weak topology W. Check that {F → }← = F and {G← }→ = G, for F ∈ F and G ∈ G. ◦ 3 ) Show that the sets F∩∆1 and G∩Γ1 are homeomorphic (for additional details, see, e.g., [65, 66]). 1.4. The quantile transform 1.4.1. The univariate quantile transform. Theorem 1.4. For any random variable X ∈ R, with distribution function F and quantile function G, the following properties hold. (i) Whenever F is continuous, the random variable U := F (X) is uniformly distributed on (0, 1); (ii) When F is arbitrary, and V is a uniform (0, 1) rv, one has the distributional identity d X = G(V ). (1.70) Proof. The implications F (X) < F (x) ⇒ X < x and X ≤ x ⇒ F (X) ≤ F (x) entail that P(F (X) < F (x)) ≤ P(X < x) = F (x−) (1.71) ≤ F (x) = P(X ≤ x) ≤ P(F (X) ≤ F (x)).
Topics on Empirical Processes
109
When F is continuous, u = F (x) reaches all possible values in (0, 1) when x varies in (ω− , ω+ ). Thus, by (1.71), the df’s, H(u) := P(F (X) ≤ u) and H(u−) := P(F (x) < u), of F (X) fulfill, for all u ∈ (0, 1), H(u−) ≤ u ≤ H(u). This, in turn, readily implies that H(u) = u for all u ∈ CH , the set of continuity points of H. Since this, in turn, implies that H(u) = u for all u ∈ [0, 1], we obtain readily Assertion (i) of the theorem. d
To establish Assertion (ii), we define a random variable Y = N (0, 1), following a standard normal law, and independent of X (in fact, any rv with a continuous and positive density on R will do). For each n ≥ 1, we observe that the rv Xn := X+n−1 Y ∈ R has a density fn , continuous and positive on R. This, in turn, implies that the restriction of the df Fn of Xn to R, and the restriction of the qf Gn of Xn to (0, 1), are continuous and increasing. By the just-proven Assertion (i), this implies that, for each n ≥ 1, Un := Fn (Xn ) is uniformly distributed on (0, 1). Moreover, an application of (1.34) in Proposition 1.1 shows that Xn = Gn (Fn (Xn )) = Gn (Un ) W for each n ≥ 1. Now, as n → ∞, Xn = X + n−1 Y → X, whence Fn → F (recall that almost sure [a.s.] convergence implies convergence in probability, and hence, weak convergence of the underlying probability measures). By (1.47), this implies, in turn, that Gn (u) → G(u) for each u ∈ CG , the set of continuity points of G. d
We next observe that Xn = Gn (U ), where U is a fixed, uniformly distributed on (0, 1), random variable. Since DG := R − CG is, at most, countable, we have d P(U ∈ CG ) = 1. By all this, it follows that, as n → ∞, Xn = Gn (U ) → G(U ), W
W
and hence, Xn → G(U ). Since Xn → X a.s., and hence, Xn → X, as n → ∞, it d
follows that G(U ) = X, which was to be proved.
The quantile transform Theorem 1.4 plays an essential role in empirical process theory. By this theorem, it is always possible without loss of generality [w.l.o.g.] to define an arbitrary sequence X1 , X2 , . . . of independent and identically distributed [iid] random variables [rv’s], with df F (·) and qf G(·), on a probability space (Ω, A, P), carrying an iid sequence U1 , U2 , . . . of uniform (0, 1) rv’s, such that Xn = G(Un ) for each n ≥ 1.
(1.72)
Because of (1.72), it is, most of the time, more convenient to work directly on the sequence U1 , U2 , . . ., rather than on X1 , X2 , . . ., given that any result which can be obtained in this case has a counterpart for X1 , X2 , . . ., which often follows by book-keeping arguments based upon the mapping u → G(u). An inconvenience of Theorem 1.4 is that it is specific to dimension 1, being only valid for real-valued random variables. In the next section, we show that this result can be yet extended to Rd -valued random vectors, at the price of some additional technicalities.
110
P. Deheuvels
1.4.2. The multivariate quantile transform. First, we need recall some classical (but far from obvious) properties of conditional distributions. Recall (see, e.g., p. 209 in [10]) that a set E is a Polish space, when there is a complete metric defining its topology, with a countable dense subset in E. We will be concerned here with the case of Rd , which is Polish with respect to the usual topology. Consider now two E-valued rv’s, X and Y, defined on the same probability space (Ω, A, P). Denote by BE of all Borel subsets of E (the smallest σ-algebra generated by open (and closed) subsets of E). We assume that X and Y are measurable in the sense that the events X−1 (A) = {X ∈ A} and Y−1 (A) = {Y ∈ A} belong to A for all A ∈ BE . We denote by AY the σ-algebra of A, generated by the sets Y−1 (A), for A ∈ BE . In words, AY is the smallest σ-algebra of A such that Y−1 (A) = {Y ∈ A} = {ω ∈ Ω : Y (ω) ∈ A} ∈ AY for all A ∈ BE . Then, it is possible to define a regular conditional probability P(X ∈ · | Y = y), with the following properties. (i) For each y ∈ E, A ∈ BE → P(X ∈ A | Y = y) defines a probability measure on (E, BE ); (ii) For each A ∈ BE , the mapping ω ∈ Ω → y = Y(ω) ∈ E → P(X ∈ A | Y = y) is AY -measurable; (iii) The conditional probability P(X ∈ A | Y = y), of A ∈ BE given y ∈ E, is defined uniquely, with respect to y, up to an a.e. equivalence over a set B ∈ BE such that P(Y ∈ B) = 0; (iv) For each A ∈ BE and B ∈ BY , P(X ∈ A | Y = y)dP(Y = y) = P(X ∈ A, Y ∈ B).
(1.73)
B
Defining the support, denoted by supp(Y), of Y as the set of all y ∈ E such that P(Y ∈ Vy ) > 0 for all open neighborhoods Vy of y, we see that the definition of P(X ∈ A | Y = y) is meaningful only for y ∈ supp(Y). When y ∈ supp(Y), the conditional probability P(X ∈ · | Y = y) is, therefore, arbitrary. Let X = (X1 , . . . , Xd ) ∈ Rd be a multivariate random vector. The distribution, PX = P(X ∈ ·), of X on BRd is then (see, e.g., p. 17 in [19]), uniquely determined by its distribution function F (x) = F (x1 , . . . , xd ) = P(X ≤ x) = P(X1 ≤ x1 , . . . , Xd ≤ xd ) for
d
x = (x1 , . . . , xd ) ∈ R .
(1.74)
d
We set here x = (x1 , . . . , xd ) ≤ y = (y1 , . . . , yd ), for x, y ∈ R , when xj ≤ yj for j = 1, . . . , d. Of course, F (·) in (1.74) fulfills 0 ≤ F (x) ≤ F (y) ≤ 1.
(1.75)
Topics on Empirical Processes
111
Moreover, setting y ↓ x (resp. y ↑ x) when yj ↓ xj (resp. yj ↑ xj ) for j = 1, . . . , d, we have d
lim F (y) = F (x) for x ∈ R ;
(1.76)
y↓x
F (−∞, . . . , −∞) = 0 = F (∞, . . . , ∞) = 1 =
F (x);
lim x↓(−∞,...,−∞)
lim
x↑(∞,...,∞)
F (x).
(1.77)
When d = 1, the relations (1.75)–(1.76)–(1.77) reduce to (1.2)–(1.3)–(1.4), and are sufficient to define the distribution function of a probability measure on R. On the other hand, this property does not hold for d ≥ 2. Introduce the following notation. For j = 1, . . . , d and u = (u1 , . . . , ud ) ∈ Rd , set ∆j;u F (x) = F (x1 , . . . , xj−1 , xj + uj , xj+1 , . . . , xd ) − F (x1 , . . . , xj−1 , xj , xj+1 , . . . , xd ).
(1.78)
A necessary and sufficient for F to define the df of a multivariate distribution is that, in addition to (1.75), (1.76) and (1.77), we have the 2d − d + 1 inequalities ∆j1 ;u ◦ · · · ◦ ∆jk ;u F (x) ≥ 0,
(1.79)
which must hold for all x ∈ Rd , u ∈ Rd , with u1 ≥ 0, . . . , ud ≥ 0, and 1 ≤ j1 < · · · < jk ≤ d. Theorem 1.5 below gives a multivariate version of Theorem 1.4, by combining results of Rosenblatt [169], Einmahl [92] (see his Theorem 6), and Deheuvels and Mason [68] (see their Lemma 3.1). The following notation will be needed. Let U = (U1 , . . . , Ud ) be a uniform (0, 1)d random vector. Let F1 (x1 ) = P(X1 ≤ x1 ) denote the distribution function of X1 , and let, for j = 2, . . . , d,
Fj xj x1 , . . . , xj−1 = P Xj ≤ xj X1 = x1 , . . . , Xj−1 = xj−1 , denote a regular conditional distribution function of Xj given X1 = x1 , . . . , Xj−1 = xj . Define a function H(x) ∈ [0, 1]d of x = (x1 , . . . , xd ) ∈ Rd , by setting
(1.80) H(x) = F1 (x1 ), F2 (x1 |x2 ), . . . , Fd (xd |x1 , . . . , xd−1 ) . For t ∈ (0, 1), set H1 (t) = inf{x : F1 (x) ≥ t}, and, for t ∈ (0, 1), j = 2, . . . , d, and x ∈ Rd , set
Hj t x1 , . . . , xj−1 = inf x : Fj (x|x1 , . . . , xj−1 ) ≥ t . Also, set for u = (u1 , . . . , ud ) ∈ (0, 1)d , G1 (u1 ) = H1 (u1 ), and, for j = 2, . . . , d,
Gj (u1 , . . . , uj ) = Hj uj G1 (u1 ), . . . , Gj−1 (u1 , . . . , uj−1 ) . Define a function G(u) ∈ Rd of u = (u1 , . . . , ud ) ∈ (0, 1)d , by setting
G(u) = G1 u1 , G2 u1 , u2 , . . . , Gd u1 , . . . , ud .
(1.81)
112
P. Deheuvels
Theorem 1.5. Under the above assumptions and notation, we have d X=G U . (i) If, in addition, H(·) is continuous on R , then, we have d U=H X .
(1.82)
d
(1.83)
(ii) If H(·) is continuously (resp. twice continuously differentiable) in a neighborhood of x ∈ Rd , then G(·) is continuously (resp. twice continuously differentiable) in the neighborhood of t = H(x), and such that |G(u) − G(t) − DG(t)(u − t)| = o(|u − t|)
as
|u − t| → 0.
(1.84)
Moreover,
−1 DG(t) = DH(x) , (1.85) where DG(t) (resp. DH(x)) denote the differential of G(·) at t = H(x) (resp. x). (iii) If X has a density (·), continuous and positive in a neighborhood of x ∈ Rd , then, the Jacobian of G(·) at t = H(x) is equal to 1/f G(t) . (1.86) Proof. See, e.g., Deheuvels and Mason [68].
The next theorem gives a useful local version of the quantile transformation in Theorem 1.5. Theorem 1.6. Assume that X has a density f , continuously (resp. twice continuously) differentiable in a neighborhood of x ∈ Rd , and that f (x) > 0. Then, there exists a t ∈ (0, 1)d , a neighborhood Vx of x, a neighborhood Vt of t, and a oneto-one mapping G∗ of Vt onto Vx such that x = G(t). Moreover, on a suitable probability space, we may define X jointly with a random vector U uniformly distributed on (0, 1)d , and such that X1I{X∈Vx } = G∗ U 1I{U∈Vt } . (1.87) The function G∗ (·) is of the form
G∗ u = G∗1 (u1 ), G∗2 (u1 , u2 ), . . . , G∗d (u1 , . . . , ud ) ,
(1.88)
where, for j = 1, . . . , j, G∗d (u1 , . . . , uj ) is an increasing function of uj . In addition, G∗ (·) is continuously (resp. twice continuously) differentiable on Vt , and such that ∗ G u − G∗ t − f (x) −1/d (u − t) = o |u − t| as |u − t| → 0. (1.89) Proof. See, e.g., Deheuvels and Mason [68].
Theorems 1.5–1.6 allow to identify, without loss of generality, an arbitrary iid sequence X1 , X2 , . . ., of Rd -valued random vectors, with G(U1 ), G(U1 ), . . ., where U1 , U2 , . . ., are iid uniformly distributed on (0, 1)d random vectors, for some appropriate invertible mapping G of (0, 1)d onto Rd . This is, of course, the best possible result of the kind since there is no homeomorphism (a continuous mapping with
Topics on Empirical Processes
113
a continuous inverse) of an open subset of Rd onto an open subset of Rq unless d = q (see, e.g., [114]). On the other hand (see, Remark 1.4 below), if we let G be non-invertible, the dimension argument breaks down. Unfortunately, Theorems 1.5–1.6 are difficult to apply for d ≥ 2, because of the inherent technicalities due to the conditioning used for the construction of G in (1.82), making it difficult to relate the smoothness properties of F and G. Moreover, the proof of these theorems relies on the choice of a basis in Rd , which leads to technical difficulties. Because of this, we will limit ourselves in the sequel to d = 1. The extension of the presently available results from dimension 1 to higher dimensions yields a series of challenging problems. Remark 1.4. 1◦ ) The multivariate quantile theorem was generalized as follows by Skorohod [188] (see, e.g., Dudley and Phillip [81]). Consider two Polish spaces X1 and X2 , and let L denote a probability law on X1 × X2 , with marginal distribution L1 on X1 . Then, there exists a measurable map Ψ : [0, 1] × X1 → X1 × X2 , such that, if V ∈ X1 is a rv with distribution L1 , and if U is a uniform (0, 1) rv independent of V , then (V, Ψ(U, V )) ∈ X1 × X2 has distribution function L. 2◦ ) The following argument (see, e.g., Lemma A1 in Berkes and Phillip [14]) is often useful in the application of Strong invariance principles. Consider three Polish spaces X1 , X2 and X3 . We consider a probability law L1,2 on X1 × X2 , and a probability law L2,3 on X2 × X3 . We assume that the marginals of L1,2 and L2,3 on X2 coincide. Then, there exists a probability space on which sits a rv (X1 , X2 , X3 ), where (X1 , X2 ) has law L1,2 , and (X2 , X3 ) has law L2,3 . Exercise 1.4. Show that, for any probability law L in a Polish space, there exists a measurable map Ψ : [0, 1] → X, such that Ψ(U ) has probability law L when U is uniformly distributed on (0, 1). Exercise 1.5. Let d ≥ 2 be integer. A copula, C(u) = C(u1 , . . . , ud ), in Rd is, by definition, the df of a random vector U = (U1 , . . . , Ud ) with uniformly distributed on [0, 1] marginals. Given any random vector X = (X1 , . . . , Xd ) with df F (x) = P(X ≤ x) and marginal df ’s Fj (x) = P(Xj ≤ x), for j = 1, . . . , d, a copula C(·) is associated with X whenever the identity F x1 , . . . , xd = C F1 (x1 ), . . . , Fd (xd ) , (1.90) holds for all x = (x1 , . . . , xd ). 1◦ ) Show that, whenever F1 , . . . , Fd are continuous, the copula C(·) of F (·) is defined uniquely by (1.90). 2◦ ) Show that the set Cd of copulas in Rd is weakly compact. Adapt the proof of Theorem 1.4 to show that there exists a copula C(·) for each df F (·) [This is known as Sklar’s theorem (see, e.g., [187, 42, 48]).
114
P. Deheuvels
2. Fluctuations of partial sums 2.1. Some large deviations theory Let X denote a random variable with df F (x) = P(X ≤ x). The corresponding Laplace transform, or moment-generating function [mgf], denoted by ψ = ψX , is defined, for s ∈ R, by ∞ sX ψ(s) = E(e ) = esx dF (x) ∈ (0, ∞]. (2.1) −∞
The mgf ψ(s) of X is always finite for s = 0, and such that ψ(0) = 1. To characterize the domain of finiteness Iψ := {s ∈ R : ψ(s) < ∞} of ψ, set −∞ ≤ t1 = inf{t ∈ R : ψ(t) < ∞} ≤ 0 ≤ t0 = sup{t ∈ R : ψ(t) < ∞} ≤ ∞. (2.2) The following proposition has a more or the less straightforward analytic proof. Proposition 2.1. The mgf ψ(·) of a random variable X is always finite and C∞ in the neighborhood of each s ∈ (t1 , t0 ). Moreover, when s ∈ (t0 , t1 ), for each m = 1, 2, . . ., we have ∞ dm ψ (m) (s) = m ψ(s) = E X m esX = xm esx dF (x) ∈ R. (2.3) ds −∞ When ψ(s) < ∞ in a right (resp. left) neighborhood of 0, or equivalently, when t0 > 0 (resp. t1 < 0), then, the finiteness of the mth moment, E(X m ), of X, is equivalent to the existence of a finite limit lim ψ (m) (s) = E(X m ) resp. lim ψ (m) (s) = E(X m ) . (2.4) s↓0
s↑0
Proof. Omitted.
Remark 2.1. Recall the definition (2.2) of t0 and t1 . When t0 < ∞, the value of ψX (t0 ) may be either finite on infinite, depending upon the distribution of X. The same is true for ψX (t1 ). The domain of finiteness of ψ = ψX , denoted by Iψ = {s ∈ R : ψ(s) < ∞},
(2.5)
is, depending upon the distribution of X, one among the four intervals (t1 , t0 ), (t1 , t0 ], [t1 , t0 ), [t1 , t0 ]. Proposition 2.2. The mgf ψ(·) of X is always convex on its domain of finiteness. Moreover, the convexity of ψ is always strict, with the exception of the case where X = 0 almost surely. Proof. The result is trivial for t1 = t0 = 0. When t1 < t0 , observe, via (2.3), that, for each s ∈ (t1 , t0 ), (2.6) ψ (s) = E X 2 esX ≥ 0, with equality only possible when X = 0 a.s. The mgf ψ of X follows a stronger property than convexity, being log-convex (meaning that log ψ is convex), as showed in the next proposition.
Topics on Empirical Processes
115
Proposition 2.3. The mgf ψ(·) is always log-convex on its domain of finiteness. Moreover, the log-convexity of ψ(·) is always strict with the exception of the case where the distribution of X is degenerate. Proof. When X is degenerate (see, e.g., (1.10)) with X = c a.s., we have ψ(s) = ecs , d2 and ds 2 log ψ(s) = 0. In the other cases, the property is trivial for t1 = t0 = 0, and we may limit ourselves to give proof when t1 < t0 . We then make use of the Schwarz inequality, to observe that, for each s ∈ (t1 , t0 ), 2 E X 2 esX − E XesX d2 log ψ(s) = ≥ 0, (2.7) 2 ds2 E (esX )
with equality only possible when X is a.s. constant.
The Chernoff function of X (or Legendre transform of the distribution function of X), is defined by Ψ(α) = ΨX (α) =
sup
{sα − log ψ(s)}
(2.8)
s:ψ(s) 0 for s < 0 and Lm (s) < 0 we have Lm (0) = m − ds for s > 0. This, in turn, implies that
Ψ(m) = sup Lm (s) = Lm (0) = 0, s∈Iψ
which is (2.11). The remainder of the part of the proof of Proposition 2.4 relative to (2.10) is technical and omitted. We refer to Lemma 2.1, p. 266 in Deheuvels [51] for details. In the remainder of this section, we consider a sequence X1 , X2 , . . . of independent replicæ of the rv X with mgf ψ and Chernoff function Ψ as defined above. We set S0 = 0 and Sn = X1 + · · · + Xn for n ≥ 1. The following hypotheses will be assumed, at times (recall (1.10)). (C.1) t0 = sup{s : ψ(s) < ∞} > 0; (C.2) t1 = inf {s : ψ(s) < ∞} < 0; (C.3) The distribution of X is nondegenerate, or, equivalently, supx∈R P(X = x) < 1; (C.4) m = E(X) ∈ R exists. Under (C.4) only, Chernoff’s theorem (see, e.g., [9, 30]) may be stated as follows. Theorem 2.1. Under (C.4), we have P(Sn ≥ nα) ≤ exp(−nΨ(α))
for each
α ≥ m;
(2.13)
P(Sn ≤ nα) ≤ exp(−nΨ(α))
for each
α ≤ m.
(2.14)
Moreover, 1 (2.15) log P(Sn ≥ nα) = −Ψ(α) for each α ≥ m; n 1 log P(Sn ≤ nα) = −Ψ(α) for each α ≤ m. lim (2.16) n→∞ n Proof. The proof of (2.13)–(2.14) relies on the simple, but nevertheless powerful, Markov inequality. Below, we establish this result in a slightly more general form. Consider a random variable Y for which ψY (t) = E(exp(tY )) < ∞ for some t ≥ 0. The inequality ety ≤ etu for y ≤ u entails that ∞ ψY (t) = E(exp(tY )) = etu dP(Y ≤ u) −∞ tu ty ≥ e dP(Y ≤ u) ≥ e dP(Y ≤ u) = ety P(Y ≥ y), lim
n→∞
[y,∞)
[y,∞)
or, equivalently,
ty − log ψY (t) .
(2.17)
A similar argument, which we omit shows that
ty − log ψY (t) . P(Y ≤ y) ≤ exp − sup
(2.18)
P(Y ≥ y) ≤ exp −
sup t≥0:ψ(t) 0} such that 0 < aT < T for T > 0. For each T > 0 and t ≥ 0, consider the function of s ∈ [0, 1] defined by
ζT,t (s) = {W (t + aT s) − W (t)}/ 2aT {log(T /aT ) + log2 T }, (2.45) where log2 T = log+ log+ T , and log+ u = log(u ∨ e). Theorem 2.5. Let aT ↑ and T −1 aT ↓. Then, with probability 1,
sup inf ζT,t − f
lim T →∞
0≤t≤T −aT
f ∈S
= sup lim inf T →∞
f ∈S
inf
0≤t≤T −aT
ζT,t − f
(2.46) = 0.
If, in addition to the above assumptions, ((log(T /aT ))/ log2 T → ∞ as T → ∞, then, with probability 1,
lim sup inf
ζT,t − f
= 0. (2.47) T →∞
f ∈S
0≤t≤T −aT
Proof. Following the lines of Exercises 2.2 and 2.3 in the sequel, it is not too difficult to infer the above results from Schilder’s Theorem 2.3. A more refined proof of Theorem 2.5, valid for possibly non-uniform norms, and based on the isoperimetric inequality (see, e.g., [20]) is to be found in [61]. Exercise 2.1. 1◦ ) Show that the conclusion of Theorem 2.4 may be reformulated into the following two complementary statements. (i) For each ε > 0, there exists a.s. a Tε < ∞ such that ηT ∈ Sε for all T ≥ ε; (ii) For each f ∈ S and ε > 0, there exists a.s. a sequence Tn → ∞, such that ηTn ∈ Nε (f ) for each n. 2◦ ) Assume, that aT ↑, T −1 aT ↓ and ((log(T /aT ))/ log2 T → ∞ as T → ∞. Set FT = {ζT,t : 0 ≤ t ≤ T }. Show that, under these assumptions and notation, the conclusion of Theorem 2.5 may be reformulated as follows. For each ε > 0, there exists a.s. a Tε < ∞ such that, for all T ≥ Tε , FT ⊆ Sε and S ⊆ FTε . Exercise 2.2. 1◦ ) Fix any γ > 0, and set Tn = (1 + γ)n for n ≥ 0. Let {W (t) : t ≥ 0} denote a standard Wiener process, and set ηT (u) = W (T u)/ 2T log2 T for 0 ≤ u ≤ 1 and T ≥ 0. Here and elsewhere, log2 u = log+ log+ u and log+ u = log(u ∨ e). Let S be as in (2.34). Making use of (2.41) and (2.39), show that, for each ε > 0, ∞ P ηTn ∈ Sε < ∞. n=0
2◦ ) Infer from the results of (1◦ ) that, with probability 1, for each 0 ≤ ρ ≤ 1, √ lim sup ηTn ≤ 1 and lim sup sup |ηTn (u) − ηTn (v)| ≤ ρ . n→∞
n→∞
0≤u,v≤1, |u−v|≤ρ
Topics on Empirical Processes
123
3◦ ) Recall the notation (2.35). Show that, for all n sufficiently large, uniformly over Tn−1 ≤ T ≤ Tn ,
2Tn log2 Tn 2Tn log2 Tn 1≤ ≤ ≤ 1 + γ. 2T log2 T 2Tn−1 log2 Tn−1 4◦ ) Establish the inequality, for all large n, sup
Tn−1 ≤T ≤Tn
where An = (1 + γ)
sup 1 1+γ
≤θ≤1
ηT − ηTn ≤ An + Bn ,
sup |ηTn (u) − ηTn (θu)|
and
0≤u≤1
Bn = γ ηTn .
5◦ ) Show that, for each γ > 0, with probability 1, lim sup sup
ηT − ηTn ≤ 2γ. n→∞
Tn−1 ≤T ≤Tn
Show that, for each ε > 0, there exists a.s. a Tε < ∞, such that, for all T ≥ Tε , ηT ∈ Sε . 6◦ ) Show that, a.s. for each 0 ≤ ρ ≤ 1, lim sup ηT ≤ 1 and lim sup T →∞
T →∞
sup
0≤u,v≤1, |u−v|≤ρ
√ |ηT (u) − ηT (v)| ≤ ρ . (2.48)
Exercise 2.3. Recall the notation (2.32) and (2.41). Fix any γ > 0, and set Tn = (1 + γ)n for n ≥ 0. Let ηT (·) be as in Exercise 2.2, and set, for n ≥ 1,
ηT∗n (u) = {W (Tn−1 + (Tn − Tn−1 )u) − W (Tn−1 )}/ 2Tn log2 Tn . 1◦ ) Making use of (2.48), show that the processes {ηT∗n (u) : 0 ≤ u ≤ 1} are independent, and such that 2 lim sup ηT∗n − ηTn ≤ √ 1+γ n→∞
a.s.
2◦ ) Let f and ε ∈ (0, 1) be such that |f |H < 1 − ε. Show that there exists a γε such that, for all γ ≥ γε , ∞
P ηT∗n ∈ Nε (f ) = ∞. n=1 ◦
3 ) Making use of the fact that S is compact in (C[0, 1], U), show that, for each f ∈ S, lim inf ηT − f = 0 a.s. T →∞
124
P. Deheuvels
Exercise 2.4. 1◦ ) Check that, whenever W (·) is a standard Wiener process, then the process defined by W ∗ (0) = 0 and W ∗ (t) = tW (1/t) for t > 0 is, again, a Wiener process. 2◦ ) Making use of Corollary 2.1, show that Strassen’s Theorem 2.4 implies the law of the iterated logarithm [LIL] for the Wiener process (due to L´evy [137])
lim sup ±W (1/T )/ 2T log2 (1/T ) = lim sup ±W (T )/ 2T log2 T = 1 a.s. T →0
T →∞
(2.49) 3◦ ) Let {φ(t) : t > 0} be a measurable positive function such that
φ(t)/ 2t log2 (1/t) → ∞ as t ↓ 0. Define a norm on C[0, 1] by setting |f |0 = sup0 0, assume that λ−1 X(λt), for t ∈ [0, M ] takes values in the c` adl` ag space (D[0, M ], S) of rightcontinuous functions with left-hand limits, endowed with the Skorohod topology S (see, e.g., Ch. 3 in [19]). Set ψ(t) = E(exp(sX(1))), and, for each α ∈ R, define the associated Chernoff function (refer to (2.8)), via Ψ(α) = sup sα − log ψ(s) . (2.50) s:ψ(s) 0, by 1 ∆Ψ,c = f ∈ AC[0, M ] : f (0) = 0 and cΨ c−1 f˙(u) du ≤ 1 , (2.52) 0
is, under (C), a compact subset of (C[0, 1], U). For any function f of D[0, M ], define a functional JX,M (f ) by setting M Ψ f˙(u) du, if f ∈ AC[0, M ] and f (0) = 0, 0 JX,M (f ) = ∞ otherwise. Define further, for each A ⊆ D[0, M ] inf f ∈A JX,M (f ) if A = ∅, JX,M (A) = ∞ if A = ∅.
(2.53)
(2.54)
Theorem 2.6. Under (C), for each closed subset F of (D[0, M ], S), we have X(λ·)
1 lim sup log P ∈ F ≤ −JX,M (F ), (2.55) λ λ→∞ λ and, for each open subset G of (D[0, M ], S), we have X(λ·)
1 ∈ G ≥ −JX,M (G). lim inf log P λ→∞ λ λ Proof. See, e.g., Varadhan [206] and Lynch and Sethuraman [144].
(2.56)
Exercise 2.5. 1◦ ) Let {Π(t) : t ≥ 0} denote a standard Poisson process (see, e.g., §2.3.1). Making use of (2.60), show that this process has stationary and independent increments, with Chernoff function Ψ (refer to (2.50)), given by Ψ(α) = h(α), and fulfilling (C). ◦ 2 ) Prove that the set of functions defined by M ∆(M ) = f ∈ C[0, M ] : f (0) = 0, h(f˙(u))du ≤ 1. , 0
is a compact subset of (C[0, M ], U). Recalling (2.37), show that, for each ε > 0, there exists an η > 0 such that, for all T sufficiently large,
Π(log T ) · 1 ∈ / ∆(T )ε ≤ 1+η . P (2.57) log T T [Consult Lemma 2.8 in [65] for an example of application of (2.57).]
126
P. Deheuvels
2.3. Some useful examples of large deviation inequalities 2.3.1. Inequalities for the Poisson process. By a standard Poisson process (see, e.g., §3.4) {Π(t) : t ≥ 0} is meant a homogeneous right-continuous Poisson process on R+ with E(Π(t)) = t for t ≥ 0. Set X = Π(1) to denote a Poisson random variable with expectation µ = 1. We have P(X = k) =
e−1 k!
for k = 0, 1, 2, . . . ,
(2.58)
so that
ψX (t) = E(etX ) = exp et − 1 for t ∈ R. (2.59) The corresponding Chernoff function is denoted by h(α) := ΨX (α). Some calculations show that α log α − α + 1 when α > 0, (2.60) h(α) = 1 when α = 0, ∞ when α < 0. A direct application of (2.23)–(2.24), in combination with (2.60), shows that, for an arbitrary T ≥ 0, P(Π(T ) ≥ T α) ≤
exp(−T h(α))
for α ≥ 1,
(2.61)
P(Π(T ) ≤ T α) ≤
exp(−T h(α))
for α ≤ 1.
(2.62)
Since {Π(t) − t : t ≥ 0} is a martingale, we infer readily from (2.31) that, for an arbitrary T ≥ 0, P sup {Π(t) − t} ≥ T α ≤ exp(−T h(1 + α)) for α ≥ 0, (2.63) 0≤t≤T
P
≤ inf {Π(t) − t} ≥ −T α
0≤t≤T
exp(−T h(1 − α))
for α ≥ 0. (2.64)
The following two propositions are simple consequences of (2.63)–(2.64). Proposition 2.7. For each T ≥ 0 and each α ≥ 0, we have P sup |Π(t) − t| ≥ T α ≤ 2 exp(−T h(1 + α)).
(2.65)
0≤t≤T
Proof. Observe that, for 0 ≤ α ≤ 1, h(1 + α) = (1 + α) log(1 + α) − α = h(1 − α) = (1 − α) log(1 − α) + α =
∞ k=2 ∞ k=2
(−1)k
αk . k(k − 1)
αk . k(k − 1)
(2.66) (2.67)
By combining (2.66)–(2.67) with the observation that h(1 − α) = ∞ for α > 1, we obtain the inequality h(1 + α) ≤ h(1 − α)
for all α ≥ 0.
(2.68)
Topics on Empirical Processes
127
By combining (2.68) with (2.63)–(2.64), we see that, for all α ≥ 0, P sup |Π(t) − t| ≥ T α ≤ exp(−T h(1 + α)) + exp(−T h(1 − α)) 0≤t≤T
≤ 2 exp(−T h(1 + α)),
which is (2.65). Proposition 2.8. For each T ≥ 0 and each 0 ≤ α ≤ 1, we have α T α2 P sup |Π(t) − t| ≥ T α ≤ 2 exp − 1− . 2 3 0≤t≤T
(2.69)
Proof. By (2.67), we see that, for each 0 ≤ α ≤ 1, h(1 + α) =
∞ k=2
(−1)k
αk α2 α3 ≥ − , k(k − 1) 2 6
(2.70)
which, in view of (2.65), suffices for (2.69).
2.3.2. Inequalities for binomial distributions. Let X follow a Bernoulli Be(p) distribution, namely, P(X = 1) = 1 − P(X = 0) = p ∈ (0, 1). The moment generating function of X is then given by ψ(t) = E(exp(tX)) = 1 + p et − 1) for
t ∈ R.
(2.71)
(2.72)
Proposition 2.9. The Chernoff function of the Bernoulli Be(p) distribution is given by 1 − log if α = 0, 1−p α log α + (1 − α) log 1−α if 0 < α < 1, 1−p p Ψ(α) = (2.73) 1 if α = 1, − log p ∞ if α ∈ [0, 1]. Proof. Straightforward.
The following theorem turns out to be quite useful. Recall the definition (2.60) of h(·). Theorem 2.7. Let Sn follow a binomial B(n, p) distribution. Then α
for α ≥ p, P (Sn ≥ nα) ≤ exp − np h p whereas α
P (Sn ≤ nα) ≤ exp − np h for α ≤ p. p
(2.74) (2.75)
128
P. Deheuvels
Proof. Let X = X1 , . . . , Xn be iid Bernoulli Be(p) distributed rv’s. Then, Sn = X1 + · · · + Xn follows a binomial B(n, p) distribution, and, via an application of (2.73) and (2.13)–(2.14), we obtain that P(Sn ≥ nα) ≤ exp(−nΨ(α)) P(Sn ≤ nα) ≤ exp(−nΨ(α))
for α ≥ p, for α ≤ p,
where Ψ = ΨX is as in (2.73). Therefore, the proof of (2.74)–(2.75) boils down to show that, for any 0 < α < 1, we have the inequality α
α α 1 − α ph = α log − α + p ≤ Ψ(α) = α log + (1 − α) log . (2.76) p p p 1−p Now, (2.76) is a straightforward consequence of the inequality 1 − α 1 − α 1 − α 1−α = log − + 1 ≥ 0. h 1−p 1−p 1−p 1−p
This completes the proof of the theorem. d
Corollary 2.2. Let αn (tp) = n−1/2 (Sn − np), where, for 0 < p < 1, Sn follow a √ binomial B(n, p) distribution. Then, for each 0 ≤ u ≤ np, √
√
P |αn (p)| ≥ u p = P |Sn − np| ≥ u np 2 (2.77) u u u
≤ 2 exp − 1− √ . ≤ 2 exp −np h 1 + √ np 2 3 np
Proof. Combine (2.74)–(2.75) with (2.70).
2.3.3. Inequalities for Wiener processes. Let {W (t) : t ≥ 0} denote a standard Wiener process. Obviously, {±W (t) : t ∈ [0, 1]} defines a martingale. Moreover, if we set X = W (1), then ψ(u) = E(exp(uX)) = exp( 12 u2 ), and the corresponding Chernoff function (refer to (2.9)) is then equal to Ψ(α) = ΨX (α) = exp( 12 α2 ) for all α ∈ R. Proposition 2.10. For each T > 0, we have √
P sup ±W (t) ≥ u T
≤
exp − 12 u2 ,
(2.78)
≤
2 exp − 12 u2
(2.79)
0≤t≤T
P
√
sup |W (t)| ≥ u T
0≤t≤T
Proof. It is well known (see, for example, Cs¨org˝ o and R´ev´esz (1981)), that ∞
√ 2 2 e−t /2 dt. (2.80) P sup ±W (t) ≥ u T = 2(1 − Φ(u)) = √ 2π u 0≤t≤T Instead of using (2.80) for proving (2.78), we apply (2.31) to obtain directly that
u
√
= exp − 12 u2 . P sup W (t) ≥ u T ≤ exp − T Ψ √ T 0≤t≤T
Topics on Empirical Processes
129
We so obtain (2.78) in the “+” case. The proof of (2.78) in the “−” case is identical, and omitted. Finally, (2.79) is a trivial consequence of (2.78). 2.4. Strong approximations of partial sums of i.i.d. random variables We limit ourselves here to a few results which are relevant in our discussion, and refer to the monograph [39] of Cs¨ org˝ o and R´ev´esz for additional details and references. In particular, we will concentrate on the case where the random variables have exponential moments. Set S0 = 0 and Sn = X1 + · · · + Xn for n ≥ 1, where X1 , X2 , . . . are iid random replicæ of a rv X. We assume below that the mgf ψ(t) = E(exp(tX)) of X fulfills the condition (C) = (C.1–2), where. (C.1) t0 = sup{s : ψ(s) < ∞} > 0; and (C.2) t1 = inf {s : ψ(s) < ∞} < 0. The assumption (C) = (C.1–2) implies the existence of E(X k ) for each k ∈ N. We will set, in particular m = E(X) and σ 2 = Var(x) = E (X − m)2 . We set, further, S(t) = St for t ≥ 0, where #t$ ≤ t < #t$ + 1 denotes the integer part of t. The following Theorem 2.8(1◦ ) states a variant of the celebrated Koml´os, Major and Tusn´ady strong invariance principle ([129, 130]), with a refinement due to Major [146] for (2◦ ). Part (3◦ ) of the theorem is due to Strassen [199] (see also [147, 148]). We set log+ u = log(u ∨ e), and log2 u = log+ (log+ u), for u ∈ R. Theorem 2.8. 1◦ ) Under (C), it is possible to define {S(t) : t ≥ 0} jointly with a standard Wiener process {W (t) : t ≥ 0}, in such a way that, for universal constants (depending upon the distribution of X) a1 > 0, a2 > 0 and a3 > 0, we have, for all T ≥ 0 and x ∈ R,
P sup |S(t) − mt − σW (t)| ≥ a1 log+ T + x ≤ a2 exp a3 x . (2.81) 0≤t≤T
◦
2 ) If we only assume that E(|X|r ) < ∞ for some r > 2, then we may write, as T → ∞, |S(T ) − mT − σW (T )| = o(T 1/r ). (2.82) ◦ 2 3 ) Finally, under the assumption that E(|X| ) < ∞ only, we have, as T → ∞, |S(T ) − mT − σW (T )| = o(T 1/2 (log2 T )1/2 ).
(2.83)
Proof. The original papers of Koml´ os, Major and Tusn´ ady [129, 130] were short on some important technical arguments, useful for the understanding of the proofs. The details were provided (in a multivariate framework) in the monograph of Einmahl [90] (see also [90, 92]). A routine argument (see, e.g., Exercises 2.6 and 2.7) allows to write sup0≤t≤T (·) in (2.81), and let T vary in R in (2.82)–(2.83), instead of restricting T to the integers 1, 2, . . . . The conclusion of the theorem holds when S(t) is taken as a process with independent increments (such as a Poisson process). The necessary details are given in Lemma 2.1, p. 84 in [69].
130
P. Deheuvels
By combining (2.81) with the Borel-Cantelli lemma (see the Exercises 2.6 and 2.7 below), we obtain readily that, under (C), |S(T ) − mT − σW (T )| = O(log T ) a.s. as
T → ∞.
(2.84)
Let {aT : T > 0} be such that 0 < aT ≤ T for T > 0, aT ↑ and T −1 aT ↓. Set, for each T > 0 and t ≥ 0, ξt,T (s) =
S(t + saT ) − S(t) − msaT
σ 2aT {log(T /aT ) + log2 T }
for 0 ≤ s ≤ 1.
(2.85)
Recall the definition (1.65) of the Strassen set S. Theorem 2.9. Under (E.1–2), whenever aT / log T → ∞, we have, with probability 1,
lim sup inf ξT,t − f
T →∞
f ∈S
0≤t≤T −aT
= sup lim inf T →∞
f ∈S
inf
0≤t≤T −aT
ξT,t − f
(2.86) = 0.
If, in addition to the above assumptions, ((log(T /aT ))/ log2 T → ∞ as T → ∞, then, with probability 1,
lim sup inf
ξT,t − f
= 0. (2.87) T →∞
0≤t≤T −aT
f ∈S
Proof. Combine Theorem 2.5 with (2.84) and Theorem 2.8.
A direct consequence of Theorem 2.9 is that, whenever aT / log T → ∞, we have, a.s. as T → ∞, sup a−1 S(t + aT ) − S(t) − m T 0≤t≤T −aT
" =O
a−1 T {log(T /aT ) + log2 T }
(2.88) = o(1).
The limit law (2.88) breaks down when aT / log T = O(1). This range is covered by the Erd˝ os-R´enyi new law of large numbers (see, e.g., [97]) discussed in §2.5 below. For further refinements, we refer to Deheuvels and Steinebach [71]. When combined, the above results yield readily Strassen’s proof ([199]) of the classical law of the iterated logarithm, due originally to Hartman and Wintner [112] Corollary 2.3. (The Hartman-Wintner-Strassen
LIL) Under the assumption that Var(X) = σ 2 < ∞, the sequence (Sn − nm)/ 2n log2 n is almost surely relatively compact in R (endowed with the usual topology), with limit set equal to the interval [−σ, σ]. We have S − nm
n [−σ, σ] 2n log2 n
and
Sn − nm lim sup ± =σ n→∞ 2n log2 n
a.s.
(2.89)
Topics on Empirical Processes
131
Proof. Combine (2.49) with (2.83). We will use the convenient notation xn L,
(2.90)
to express that the sequence {xn } is relatively compact (in an appropriate topological space) with limit set equal to L. Remark 2.4. 1◦ ) Some basic knowledge of topology is needed to understand the notion of limit set for a relatively compact sequence. The following facts should be kept in mind. A topological space is Hausdorff if two distinct points have nonintersecting neighborhoods (see, e.g., p. 137 in [84]). A metric space is always Hausdorff. A Hausdorff space is compact, iff it is paracompact, meaning that each open cover has a finite sub-cover (see, e.g., pp. 162 & 222 in [84]). A topological space is sequentially compact (resp. countably compact), whenever each sequence contains an infinite convergent subsequence (resp. each countable open cover has a finite sub-cover). Sequential compactness (as well as compactness) implies countable compactness, but, in general, the converse is not true (refer to p. 20 in [194]). However, the notions of compactness, sequential compactness and countable compactness, are equivalent on a separable metric space (see, e.g., p. 209 in [10]). Thus, a separable metric space is compact iff it is sequentially compact and complete. 2◦ ) One should distinguish the conditions of paracompactness and precompactness on a metric space. A metric space is said to be precompact or totally bounded, if, for each ε > 0, there exists a finite covering with balls of radius less than ε. A metrizable space is compact, iff there exists a metric for which it is both complete and precompact. ◦ 3 ) A part of a topological space is said to be relatively compact whenever it is included in a compact set. By definition, the limit set of a relatively compact sequence in a metric space is the set of all limits of infinite convergent subsequences. Exercise 2.6. Let {W (t) : t ≥ 0} denote a (standard) Wiener process. Making use of (2.79), show that, as n → ∞,
sup |W (t + n) − W (t)| = O log n a.s. (2.91) 0≤t≤1
Exercise 2.7. 1◦ ) Let X1 , X2 , . . ., be iid replicæ of a random variable X fulfilling (C). Show that there exists constants c1 > 0 and c2 > 0, such that P(|X| ≥ t) ≤ c1 exp(−c2 t) for all t ≥ 0. ◦ 2 ) Show that, as n → ∞, |Xn | = O (log n)
a.s.
(2.92)
132
P. Deheuvels
Exercise 2.8. Let {xn : n ≥ 1} be a relatively compact sequence in a metric space. Show that the limit set of {xn : n ≥ 1} (that is, the set of all limits of convergent subsequences) is compact. 2.5. The Erd˝ os-R´enyi theorem 2.5.1. The Classical Erd˝ os-R´enyi theorem. Set S0 = 0 and Sn = X1 + · · · + Xn for n ≥ 1, where X1 , X2 , . . . are iid random observations. The classical Erd˝ os-R´enyi [ER] theorem (see, e.g., [97]), stated in Theorem 2.10 below, plays an important role in the general theory of strong invariance principles for partial sums. In particular, it allows to give a lower bound to the best rates of strong approximations of partial sums by Gaussian processes (see Exercise 2.10 below and, e.g., §2.4, pp. 97–101 in [40]). The ER theorem describes the fluctuations of order #c log n$ of the partial sum process {Si : 0 ≤ i ≤ n}, in a sense made precise below. The following assumptions will be assumed, at times on the mgf ψ(t) = E(exp(tX)) of X = X1 . (C.1) t0 = sup{s : ψ(s) < ∞} > 0; (C.2) t1 = inf {s : ψ(s) < ∞} < 0; (C.3) The distribution of X is nondegenerate, or, equivalently, supx P(X = x) < 1; (C.4) m = E(X) ∈ R exists. denote by (C) the joint assumption (C.1–2), and set tα − log ψ(t) , Ψ(α) = sup t:ψ(t) 0 as α → ∞. This, in turn, readily implies the first half of (2.94). The proof of the second half, under (C.2–3– 4), is similar and omitted.
Topics on Empirical Processes
133
For each integer k such that 0 ≤ k ≤ n, consider the maximal and minimal increments Si+k − Si . (2.95) In+ (k) := max Si+k − Si and In− (k) := min 0≤i≤n−k
0≤i≤n−k
Fix a c > 0, and set kn = #c log n$ for n ≥ 1, where #u$ ≤ u < #u$ + 1 denotes the integer part of u. Theorem 2.10. (Erd˝ os and R´enyi) We have, almost surely, under (C.1–4),
and, under (C.2–4),
lim k −1 I + (kn ) n→∞ n n
= α+ c ,
(2.96)
lim kn−1 In− (kn )
= α− c .
(2.97)
n→∞
Proof. In view of Remark 2.5, the statements (2.96)–(2.97) hold trivially when X is degenerate. Therefore, we may limit ourselves establish these statements in the remaining case, where (C.3) is satisfied. Step 1. (Outer bounds) Assume (C.1–3–4). We define an integer sequence {nk : k ≥ k0 }, where k0 := inf{k ≥ 1, e
k+1 c
k
− e c ≥ 1},
and, for each k ≥ k0 , by setting nk := sup{n ≥ 1 : #c log n$ = k}. The inequalities c log n − 1 < #c log n$ = kn ≤ c log n, readily imply that, for each k ≥ k0 , k
e c ≤ nk < e
k+1 c
.
(2.98)
Moreover, it is easy to see that, for all k ≥ k0 , max
nk−1 0. c +ε − c
134
P. Deheuvels
1 + c By all this, +we see that, for k ≥ k0 , Pk (αc + ε) ≤ e exp − kη , and hence, k≥k0 Pk (αc + ε) < ∞. The Borel-Cantelli lemma implies therefore that, for each ε > 0, lim sup kn−1 In+ (kn ) ≤ α+ c + ε a.s. n→∞
Since this relation holds for each specified ε > 0, it implies, in turn, that lim sup kn−1 In+ (kn ) ≤ α+ c
a.s.
(2.100)
n→∞
+ Step 2. (Inner bounds) Assuming (C.1–3–4), we set L (k) = max Sk , S2k − n Sk , . . . , SkN − Sk(N −1) , where N = N (n, k) := #n/k$ − 1 and k = kn = #c log n$. We have, for each α > m = E(X),
N
Qn,k (α) := P L+ ≤ exp − N P Sk ≥ kα . n (k) < kα = 1 − P Sk ≥ kα We next make use of Chernoff’s theorem (refer to (2.15)) to show that, for any ε1 ∈ (m, α),
1 = −Ψ(α − ε1 ), lim log P Sk ≥ k α − ε1 k→∞ k so that, ultimately in k → ∞, for each ε2 > 0
≥ exp − k Ψ(α − ε1 ) + ε2 . (2.101) P Sk ≥ k α − ε 1 + Now, letting α = α+ c , with ε1 ∈ (m, αc ), we infer from (2.94) and the definition + (2.93) of αc that 1 2θ := − Ψ(α+ c − ε1 ) > 0. c By all this, if we set ε2 = θ and k = kn = #c log n$ ≤ c log n in (2.101), we obtain that the inequality nθc
1 P Skn ≥ kn α+ − θ ≥ , ≥ exp − k − ε 1 n c c n holds for all large n. This, in turn, readily entails that, for all n sufficiently large,
N (n, k )
n θc θc/2 n ≤ exp −n , − ε ) ≤ exp − Qn,kn (α+ 1 c n + which is summable in n. Keeping in mind that L+ n (k) ≤ In (k), an application of the Borel-Cantelli lemma shows that, almost surely, + lim inf kn−1 In+ (kn ) ≥ lim inf kn−1 L+ n (kn ) ≥ αc − ε1 . n→∞
n→∞
Since this relation holds for an arbitrary ε1 > 0, we obtain therefore that lim inf kn−1 In+ (kn ) ≥ α+ c n→∞
a.s.
(2.102)
By combining (2.100) with (2.102), we conclude (2.96). The proof of (2.97) is similar, and omitted.
Topics on Empirical Processes
135
Exercise 2.9. Assume that (C.1–4) is satisfied, and that m = E(X) = 0. Set Sj − Si and Jn− (k) = Sj − Si . max min Jn+ (k) = 0≤i≤j≤i+k≤n
0≤i≤j≤i+k≤n
◦
1 ) Show that the conclusion (2.96)–(2.97) of Theorem 2.10 holds, with the formal replacements of In+ (k) and In− (k) by Jn+ (k) and Jn− (k), respectively. 2◦ ) Show that these results fail to hold when m = E(X) = 0 [for additional details, see, e.g., [58, 59]]. Exercise 2.10. Let {W (t) : t ≥ 0} be a Wiener process. Set In+ (k) = max W (i + k) − W (i) and In− (k) = min 0≤i≤n−k
0≤i≤n−k
W (i + k) − W (i) .
◦
1 ) Letting kn = #c log n$, for c > 0, find the a.s. limits lim k −1 In+ (kn ) n→∞ n
and
lim k −1 In− (kn ). n→∞ n
2◦ ) Show that if, on a suitable probability space, |Sn − W (n)| = o(log n) a.s., as n → ∞, then the Chernoff function of X is given by Ψ(α) = 12 α2 for α ∈ R. [As follows from a result of Bartfai [12] (see, e.g., p. 101 in [40]), this implies d
that X = N (0, 1), so that the relation |Sn − W (n)| = o(log n) is impossible, unless Sn coincides initially with W ∗ (n), for some Wiener process W ∗ (·).] 2.5.2. Functional Erd˝ os-R´enyi laws. The original Erd˝ os-R´enyi law [97] has given rise to a number of refinements and variants in the literature (see, e.g., [58, 59] and the references therein). Below, we give a functional version of this theorem due to Deheuvels [51]. We will work in the framework and assumptions of §2.5.1, assuming throughout that (C.3–4) hold. We introduce the following additional notation. We set, for each t ≥ 0, S(t) = St (which is right-continuous). For an arbitrary function 0 < aT ≤ T of T ≥ 0, we set, for x ≥ 0, υx,T (s) = a−1 S(x + saT ) − S(x) for 0 ≤ s ≤ 1, (2.103) T and set, for each T > 0,
FT = υx,T : 0 ≤ x ≤ T − aT .
(2.104)
It is noteworthy that, for each T > 0, FT ⊂ BV0 [0, 1], where BV0 [0, 1] is as in (1.55). Letting Ψ and t0 , t1 be as in (2.8) and (2.10), we set, for each c > 0, 1 ˙ cΨ c−1 H(u) du + t0 HS− (1+) − t1 HS+ (1+) ≤ 1 . ∆Ψ,c = H ∈ BV0 [0, 1] : 0
(2.105) Note that, when t0 = ∞ and t1 = −∞, the set defined by (2.105) reduces to 1 ˙ cΨ c−1 H(u) du ≤ 1 . (2.106) ∆Ψ,c = H ∈ AC[0, 1] : H(0) = 0 and 0
In view of Theorems 1.1, 1.2 and 1.3, we see that: – When t0 , t1 are arbitrary, ∆Ψ,c ⊂ BV0 [0, 1] is compact with respect to the topology W, of weak convergence of distribution functions of signed measures
136
P. Deheuvels ogn¨ as metric (1.57). The set ∆Ψ,c may be idenin Mf [0, 1], defined via the H¨ tified, through the mapping µ ↔ Hµ , to a weakly compact metrizable (via dH ) subset of the (non-metrizable) topological space (Mf [0, 1], W);
– When t1 = −∞ and t0 is arbitrary, ∆Ψ,c ⊂ I+ [0, 1] is compact with respect to the topology of weak convergence of distribution functions, defined by the L´evy distance ∆L (·, ·). This set of functions may be identified, through the mapping µ ↔ Hµ , to a weakly compact subset of the metrizable (via ∆L ) topological space (M+ f [0, 1], W); – When t0 = ∞ and t1 = −∞, ∆Ψ,c ∈ AC0 [0, 1] is compact with respect to the uniform topology. Recall the definition (1.56) of BV0,C [0, 1]. We have the following result (see, e.g., [51]). Theorem 2.11. Fix any c > 0, and assume that aT / log T → c as T → ∞. Then, under either (C.1–3–4) or (C.2–3–4), there exists a 0 < C < ∞ such that ∆Ψ,c ⊆ BV0,C [0, 1], and, a.s. ultimately as T → ∞, FT ⊆ BV0,C [0, 1]. Moreover, if we set, for each ε > 0 and A ⊆ BV0,C [0, 1], A[ε];C = {f ∈ BV0,C [0, 1] : dH (f, g) < ε
for some
g ∈ A},
(2.107)
then, there exists a.s. a Tε < ∞ such that, for all T ≥ Tε , we have [ε];C
FT ⊆ ∆Ψ,c
and
[ε];C
∆Ψ,c ⊆ FT
.
(2.108)
Exercise 2.11. Show that, under (C), (2.107) may be replaced by FT ⊆ ∆εΨ,c
and
∆Ψ,c ⊆ FεT .
(2.109)
3. Empirical Processes 3.1. Uniform empirical distribution and quantile functions Let U1 , U2 , . . . be a sequence of independent and identically distributed [iid] uniform (0, 1) random variables. For each n ≥ 1, the empirical distribution based upon U1 , . . . , Un is defined by 1 1I{Ui ∈B} , n i=1 n
Pn (B) =
(3.1)
for each Borel set B ∈ BR . Denote by #E the cardinality of E. The corresponding uniform empirical distribution function will be denoted hereafter by 1 1 Un (t) = Pn ((−∞, u]) = 1I{Ui ≤x} = #{Ui ≤ t : 1 ≤ i ≤ n} for t ∈ R. n i=1 n (3.2) n
Topics on Empirical Processes
137
Remark 3.1. In general, each sequence X1 , . . . , Xn of random variables, with values in a locally ncompact separable metric space X, defines an empirical measure by µn = n−1 i=1 δXi , where δx stands for the Dirac measure at x. In this framework, µn is a Radon measure on X. Among other problems of interest, one may consider convergence properties, as n → ∞, of the restriction of µn to a class of subsets of BX , or to a class of Borel-measurable functions on X. One of the main issues to be considered is then the central limit theorem, namely, the asymptotic convergence of such an empirical processes, indexed by sets or functions, to a limiting Gaussian process. Also, it is of interest to characterize the cases where this empirical process fulfills a Glivenko-Cantelli-type law of large numbers, or a functional law of the iterated logarithm. We refer to [6, 82, 99, 164, 204] for advanced courses on the subject. In what follows, we shall limit ourselves to the case of X = R, which is of interest in and of itself. In particular, the notion of empirical quantile process (see, e.g., §3.2 below) gives rise to developments which are very much specific to X = R, and difficult to extend properly in higher dimensions. Remark 3.2. Let X1 , . . . , Xn be iid replicæ of a random variable X with continuous df F (x) = P(X ≤ x) and quantile function G(t) = inf{x : F (x) ≥ t}. By Theorem 1.4, the rv’s U1 = F (X1 ), . . . , Un = F (Xn ) are iid uniform (0, 1) rv’s. The empirical measure µn based upon X1 , . . . , Xn has a df given by Fn (x) = Un (F (x)), for x ∈ R, and a qf given by Gn (x) = G(Vn (t)), for t ∈ (0, 1). For this reason, the study of the limiting behavior of Fn and Gn can be reduced to the study of the limiting behavior of Un and Vn . In the sequel, we will deal mostly with uniform empirical and quantile processes. We refer to [40, 38, 85, 183], and the references therein for details on the general case. For each n ≥ 0, set U0,n = 0 and Un+1,n = 1, and, for each n ≥ 1, denote by U1,n ≤ · · · ≤ Un,n the order statistics of U1 , . . . , Un , obtained by sorting these random variables increasingly. Proposition 3.1. For each n ≥ 0, we have, with probability 1, 0 = U0,n < U1,n < · · · < Un,n < Un+1,n = 1.
(3.3)
Proof. Obviously, for each i ≥ 1 and j ≥ 1 with i = j, we have P(Ui = 0 or 1) = 0
and P(Ui = Uj ) = 0,
so that U1 , . . . , Un are a.s. distinct and in (0, 1). The conclusion (3.3) is then straightforward. Unless otherwise specified, we will assume, without loss of generality, that (3.3) holds on the probability space (Ω, A, P) on which sit U1 , U2 , . . . . This allows us to write 0 for t ≤ U1,n , Un (t) = ni for Ui−1,n < t ≤ U1,n , i = 1, . . . , n, (3.4) 1 for t ≥ Un,n .
138
P. Deheuvels
The uniform empirical quantile function is defined by U1,n for t = 0, Vn (t) = i Ui,n for i−1 i = 1, . . . , n. n < t ≤ n,
(3.5)
Let I(t) = t denote identity, and, for each bounded function f on [0, 1], set
f = sup |f (t)|.
(3.6)
0≤t≤1
Proposition 3.2. We have, with probability 1,
Un (Vn ) − I =
1 . n
(3.7)
i i Proof. For each i = 1, . . . , n and i−1 n < t ≤ n , we have Un (Vn (t)) = Un (Ui,n ) = n , so that sup i−1 t) = e−t for t ≥ 0. (3.9) Set T0 = 0 and Tm = ω1 + · · · + ωm for m ≥ 1. Then, for each n ≥ 0, Tn+1 and the random vector {Ti /Tn+1 : i = 0, . . . , n + 1} are independent. Moreover, one has the distributional identity d
{Ui,n : i = 0, . . . , n + 1} = {Ti /Tn+1 : i = 0, . . . , n + 1} .
(3.10)
Proof. The joint density of ω1 , . . . , ωn+1 is f (u1 , . . . , un+1 ) = e−s , with s = u1 + · · · + un+1 . The linear change of variables (u1 , . . . , un+1 ) → (u1 , . . . , un , s) has Jacobian equal to 1, so that the joint density of (u1 , . . . , un , s) is equal to f (u1 , . . . , un , s) = e−s .
Topics on Empirical Processes
139
Now, the change of variables (u1 , . . . , un , s) → (u1 /s, . . . , un /s, s) has Jacobian equal to s−n , so that the joint density of (v1 := u1 /s, . . . , vn := un /s, s) equals sn e−s . n! We obtain therefore that (T1 /Tn+1 , . . . , Tn /Tn+1 ) and Tn+1 are independent. Moreover, (i) (T1 /Tn+1 , . . . , Tn /Tn+1 ) is uniformly distributed (with constant density, equal to n!) on the set {(v1 , . . . , vn ) = 0 ≤ v1 ≤ · · · ≤ vn ≤ 1}; (ii) Tn+1 follows a Γ(n + 1) distribution on R+ . To conclude (3.10) we observe that the joint distributions of the random vectors (T1 /Tn+1 , . . . , Tn /Tn+1 ), and (U1,n , . . . , Un,n ), coincide. f (v1 , . . . , vn , s) = n! ×
Theorem 3.2. For each n ≥ 1, the random variables i U i,n , i = 1, . . . , n, Ui+1,n
(3.11)
are independent and uniformly distributed on (0, 1). Proof. We change variables by setting Yi = − log Ui , for i = 1, 2, . . ., so that Y1 , Y2 , . . . defines an iid sequence of exponentially distributed random variables. For each n ≥ 1, set Y0,n = 0, and denote by 0 < Y1,n < · · · < Yn,n , the order statistics of Y1 , . . . , Yn . Keeping in mind that Yi,n = − log Un−i+1,n
for i = 0, . . . , n,
we see that (3.11) is equivalent to the property that (n − i + 1){Yi,n − Yi−1,n },
i = 1, . . . , n,
(3.12)
are independent and exponentially distributed. To establish this property, we recall from the proof of Theorem 3.1 that the joint density of (U1,n , . . . , Un,n ) is constant and equal to n! on the set {(v1 , . . . , vn ) = 0 ≤ v1 ≤ · · · ≤ vn ≤ 1}. The change of variables (v1 , . . . , vn ) → (y1 := − log vn , . . . , yn := − log v1 ) has Jacobian 1/(y1 . . . yn ) = exp(y1 + · · · + yn ), so that the joint density of (Y1,n , . . . , Yn,n ) equals f (y1 , . . . , yn ) = n! e−(y1 +···+yn ) , on the domain {(y1 , . . . , yn ) : 0 ≤ y1 ≤ · · · ≤ yn }. We now make the change of variables (y1 , . . . , yn ) → (t1 := ny1 , (n − 1)(y2 − y1 ), . . . , tn := yn − yn−1 ). Since the corresponding Jacobian equals n!, we conclude that the joint density of (nY1,n , (n − 1)(Y2,n − Y1,n ), . . . , Yn,n − Yn−1,n ) is equal to f (t1 , . . . , tn ) =
n
e−ti
for
ti ≥ 0,
i = 1, . . . , n.
i=1
This completes the proof of (3.12), and hence, of (3.12).
140
P. Deheuvels
Define the uniform spacings by Si,n = Ui,n − Ui−1,n
for
i = 1, . . . , n + 1.
(3.13)
Exercise 3.1. Denote, respectively, the maximal (resp. minimal) uniform spacing by Mn+ = max Si,n and Mn− = min Si,n . 1≤i≤n+1
1≤i≤n+1
Establish the limit laws, for x ∈ R and y ≥ 0,
−x and lim P nMn − log n ≤ x = e−e n→∞
lim P n2 mn ≤ y = 1 − e−y .
n→∞
Exercise 3.2. Fix a constant c > 0, and set kn = #c log n$. Letting (·) be as in (1.61), set γc+ = inf{x ≥ 1 : (x) ≥ 1/c}
and
γc− = sup{x ≤ 1 : (x) ≥ 1/c}.
Consider the statistics, for 1 ≤ k ≤ n + 1, + = Mn,k
and
− Mn,k =
max
{Ui+k−1,n − Ui−1,n }
min
{Ui+k−1,n − Ui−1,n }.
1≤i≤n−k+2 1≤i≤n−k+2
Show that, in probability, as n → ∞, + Mn,k n
kn
→
γc+
and
− Mn,k n
kn
→ γc− .
3.2. Uniform empirical and quantile processes The uniform empirical process is defined by αn (t) = n1/2 (Un (t) − t)
for t ∈ R,
(3.14)
for t ∈ [0, 1].
(3.15)
and the uniform quantile process is defined by βn (t) = n1/2 (Un (t) − t) It is convenient to set βn (t) = 0 for t ∈ [0, 1]. The following results are well known, and will be discussed in the sequel. Theorem 3.3. We have, for each n ≥ 1 and t ≥ 0, 2
P ( βn ≥ t) = P ( αn ≥ t) ≤ 2e−2t .
(3.16) 2
Proof. Dvoretzky, Kiefer and Wolfowitz [86] showed that P ( αn ≥ t) ≤ Ce−2t , for some universal constant C ≥ 2, proved equal to 2 by Massart [157]. We invoke (3.8) for the equality αn = βn . Remark 3.3. The extension of (3.16) to dimension d ≥ 2 is delicate. For x = (x1 , . . . , xd ) and y = (y1 , . . . , yd ) ∈ Rd , write x ≤ y when xj ≤ yj for j = 1, . . . , d.
Topics on Empirical Processes
141
Next, given an iid sequence U1 , U2 , . . . of random vectors, uniformly distributed on [0, 1]d , define the d-dimensional uniform empirical process by αn (x) = n−1/2
n
1I{Ui ≤x} −
i=1
d
xj ,
(3.17)
j=1
for x = (x1 , . . . , xd ) ∈ [0, 1]d. An example of Dvoretzky-Kiefer-Wolfowitz-type bound which can be obtained in this framework is given by Devroye [75, 76], who established the inequality 2
P ( αn ≥ t) ≤ 2e2 (2n)d e−2t ,
(3.18)
for all n ≥ 1 and t ≥ n−1/2 d2 . Theorem 3.4. (Chung) We have, with probability 1, lim sup(2 log2 n)−1/2 αn = lim sup(2 log2 n)−1/2 βn = n→∞
n→∞
1 . 2
(3.19)
Proof. This result is due to Chung [31]. Below, we give a simplified proof of (3.19) to illustrate the use of some important technical tools. We first observe that, for each specified t ∈ [0, 1], we may write n1/2 αn (t)
=
n i=1
n 1I{Ui ≤t} − t = (Yi − E(Yi )), i=1
d
where Yi = Be(t), i = 1, 2, . . ., is an iid sequence of random variables following Bernoulli distributions (denoted by Be(t)) with probability of success P(Yi = 1) = t. Since Var(Yi ) = t(1 − t), the law of the iterated logarithm [LIL] (see, e.g., (2.89)) implies that (for each specified t ∈ [0, 1])
±αn (t) = t(1 − t) a.s. (3.20) lim sup n→∞ 2 log2 n By choosing t = 1/2 in (3.20), and recalling (3.8), we obtain readily that 1 ±αn (1/2) = . lim sup(2 log2 n)−1/2 αn = lim sup(2 log2 n)−1/2 βn ≥ lim sup 2 n→∞ n→∞ n→∞ 2 log2 n (3.21) To obtain the reverse inequality, we select γ > 0 and > 0, and set nk = #(1 + γ)k $ for k = 0, 1, 2, . . ., where #u$ ≤ u < #u$ + 1 denotes the integer part of u. Consider the events, for k = 1, 2, . . ., " Ak = ∃n ∈ (nk−1 , nk ] : n1/2 αn ≥ (1 + 2ε) 12 nk log2 nk , (3.22) " 1/2 (3.23) Bk = nk αnk ≥ (1 + ε) 12 nk log2 nk . Making use of Ottaviani’s lemma, stated in a general framework in Lemma 4.2 in the sequel, we obtain readily that, for all k sufficiently large, one has, for all k
142
P. Deheuvels
sufficiently large, P(Ak ) ≤ 2P(Bk ). In view of (3.16) and (3.23), we obtain, in turn, that, for all large k, 2 P(Bk ) ≤ 2 exp −(1 + ε)2 log2 nk = . (log nk )(1+ε)2
(3.24)
(3.25)
Since, as k → ∞, log nk = (1 + o(1))k log(1 + γ), we infer from (3.24) and (3.25) that ∞ P(Ak ) < ∞. (3.26) k=1
The Borel-Cantelli lemma, when combined with (3.26), implies that the events Ak hold finitely often with probability 1. This, in turn, implies that, a.s., ultimately for all large n, 1/2 " nk log2 nk 1
αn ≤ (1 + 2ε) 2 log2 n nk−1 log2 nk−1 " 1 ≤ (1 + 2ε)(1 + γ) 2 log2 n. (3.27) Here, we have used the observation that, ultimately for all large k, nk log2 nk = (1 + o(1))(1 + γ) ≤ (1 + γ)2 . nk−1 log2 nk−1 Since ε > 0 and γ > 0 in (3.27) may be chosen as small as desired, it follows readily from this limiting statement that, with probability 1, 1 (3.28) lim sup(2 log2 n)−1/2 αn = lim sup(2 log2 n)−1/2 βn ≤ . 2 n→∞ n→∞ We conclude (3.19), by combining (3.21) with (3.28). Exercise 3.3. Fix any 0 ≤ T ≤ 1. Show that, almost surely,
lim sup(2 log2 n)−1/2 sup |αn (t)| = sup t(1 − t). n→∞
0≤t≤T
0≤t≤T
Exercise 3.4. Making use of (3.64) and (3.65), show that the constant C in the 2 Dvoretzky, Kiefer and Wolfowitz [86] inequality, P ( αn ≥ t) ≤ Ce−2t , cannot be chosen less that 2. 3.3. Some further martingale inequalities We have the following useful results concerning empirical processes and Brownian bridges. Proposition 3.4. For each specified n ≥ 1, the process {αn (t)/(1 − t) : 0 ≤ t < 1} , defines a martingale.
(3.29)
Proof. It is equivalent to prove that the process (nUn (t)−nt)/(1−t) : 0 ≤ t < 1 , defines a martingale. Consider any 0 < s < t < 1, and set A = nUn (s) and
Topics on Empirical Processes
143
B = nUn (t) − nUn (s). We obtain readily that
P B = b|σ {Un (u) : 0 ≤ u ≤ s} = P B = b|A = a ) n! n! sa (t − s)b (1 − t)n−a−b sa (1 − s)n−a = a! b! (n − a − b)! a! (n − a)! b n−a−b (t − s) (1 − t) (n − a)! = b! (n − a − b)! (1 − s)n−a b n−a t−s 1−t (n − a)! . = b! (n − a − b)! 1−t 1−s Next, we write
1 − t n−a n−a (n − a)! t − s b t−s b , = (n−a) E B|A = a = 1−s b! (n − a − b)! 1−t 1−s b=0
whence 1 − t
n(t − s) + a(1 − t) − nt = n(1 − t) − (n − a) , E B + A − nt|A = a = 1−s 1−s and, finally
B + A − nt a − ns = , E A = a 1−t 1−s which suffices for our needs. Proposition 3.5. Let {N (t) : 0 ≤ t ≤ 1} define a Brownian bridge. Then, the process B(t)/(1 − t) : 0 ≤ t < 1 , (3.30) defines a martingale. Proof. Omitted.
Proposition 3.4 has some interesting consequences. First, we obtain the following useful corollary. Corollary 3.1. For each n ≥ 1, 0 < T < 1 and λ > 0, we have
1 P sup |αn (t)| ≥ λ T (1 − T ) ≤ 2 . λ 0≤t≤T
(3.31)
Proof. As follows from Proposition and (2.30) taken with r = 2, we obtain that, for u > 0 and 0 < T < 1, √
√
αn (t) ≥u T P sup |αn (t)| ≥ u(1 − T ) T ≤ P sup 0≤t≤T 0≤t≤T 1 − t
αn (T ) 2 T (1 − T ) 1 1 1 . = 2 × = 2 ≤ 2 E 2 u T 1−T u T (1 − T ) u (1 − T ) √ We obtain readily (3.31) by setting u = λ/ 1 − T in this last inequality.
144
P. Deheuvels
Corollary 3.2. For each n ≥ 1, 0 < T < 1 and λ > 0, we have √
λ(1 − T )
. P sup |αn (t)| ≥ λ T ≤ 2 exp − nT h 1 + √ nT 0≤t≤T √ Moreover, if 0 < λ ≤ N T /(1 − T ), we have √
λ(1 − T )
P sup |αn (t)| ≥ λ T ≤ 2 exp − nT h 1 + √ nT 0≤t≤T λ2 λ(1 − T )
≤ 2 exp − (1 − T )2 1 − √ . 2 3 nT
(3.32)
(3.33)
Proof. First note that √
P sup |αn (t)| ≥ λ T 0≤t≤T
≤P
α (t) −α (t) √
√
n n ≥ λ T + P sup ≥λ T . 1−t 1−t 0≤t≤T 0≤t≤T sup
Next, we combine Proposition 3.4 with (2.31). We so obtain that
α (t) n ≥u P sup 1−t 0≤t≤T sα (T )
n ≤ exp − sup su − log E exp 1−T s s s
= exp − sup (u(1 − T )) − log E exp αn (T ) 1−T 1−T s
= exp − sup r(u(1 − T )) − log E exp rαn (T ) . r
We combine this last inequality with Theorem 2.7, and (2.70), to obtain that α (t) √
λ(1 − T )
n P sup ≥ λ T ≤ exp − nT h 1 + √ . 1−t nT 0≤t≤T A similar argument allows to show that −α (t) √
λ(1 − T )
n ≥ λ T ≤ exp − nT h 1 − √ . P sup 1−t nT 0≤t≤T In view of (2.68) and (2.70), we conclude readily (3.33).
Remark 3.4. We will obtain in the forthcoming Proposition 3.9, a refinement of the inequality (3.33), through a completely different argument. 3.4. Relations between empirical and Poisson processes The empirical process is related to the Poisson process by the following general principle. The latter provides a construction of a Poisson process on a general locally compact separable metric space X (see, e.g., §1.3.1). Consider a sequence X = X1 , X2 , . . . of iid X-valued random variables with common distribution PX (B) = P(X ∈ B), for B belonging to the set BX of all Borel subsets of X.
Topics on Empirical Processes
145
Let further K denote a random variable, independent of X1 , X2 , . . ., and following a Poisson distribution with expectation E(K) = λ. We have namely λk −λ e for k = 0, 1, . . . . (3.34) k! By definition, the Poisson process on X with mean measure ν(·) = λPX (·) is the random measure Π(·), defined on the set BX of Borel subsets of X by P(K = k) =
Π(B) =
K
1I{Xi ∈B} ,
(3.35)
i=1
where, by convention ∅ (·) := 0. One of the important features of the Poisson process Π is that: 1◦ ) For each B ∈ BX , Π(B) follows a Poisson distribution with expectation ν(B); 2◦ ) For each sequence B1 , B2 , . . . ∈ BX , with ν(Bi ∩ Bj ) = 0 ∀i = j, Π(B1 ), Π(B2 ), . . . are independent; ◦ 3 ) Whenever Π1 , Π2 , . . . are independent Poisson processes with mean measures ν1 , ν2 , . . ., the superposed process Π1 +Π2 +· · · is Poisson with mean measure ν1 + ν2 + · · · . Strictly speaking, the definition (3.35) is valid only for Poisson processes with bounded mean measures ν ∈ M+ f (X), namely such that ν(X) = λP(X) = λ < ∞. To obtain a more general form of this definition, compatible with the above property (3◦ ), we use the convention that a Poisson process with a possibly unbounded mean measure ν ∈ M+ (X) is defined as the superposition Π1 + Π2 + · · · of an infinite sequence of independent Poisson processes (in the sense of (3.35)), with νj (X) < ∞ for j = 1, 2, . . ., and such that ν = ν1 + ν2 + · · · . At this point, we see the usefulness to assume that X is a locally compact separable metric space, to render such a construction possible for an arbitrary nonnegative Radon measure ν on X. In the particular case where X = R+ and ν is the Lebesgue measure on R+ , we obtain the standard Poisson process, commonly defined through its right-continuous distribution function N (t) = Π (0, t] for t ≥ 0. (3.36) The standard Poisson process {N (t) : t ≥ 0} has the following important property, which yields an alternative construction of this process. Proposition 3.6. Let Tk = inf{t : N (t) ≥ k} for k = 0, 1, . . .. Then, T0 = 0 and the random variables ωk = Tk − Tk−1 , k = 1, 2, . . . are independent and exponentially distributed with unit mean. Proof. Omitted.
For each λ > 0, the homogeneous Poisson process defined by Nλ (t) = N (λt) for t ≥ 0, with N (·) as in (3.36), is such that E(Nλ (t)) = λt for t ≥ 0. The next proposition is a straightforward consequence of the above definitions. Let Un (t) be,
146
P. Deheuvels
as in (3.2), the uniform empirical distribution function pertaining to the first n ≥ 1 observations from an iid sequence U1 , U2 , . . . of uniform (0, 1) random variables. Proposition 3.7. For an arbitrary λ > 0 and n ≥ 1, we have the distributional equality d L {nUn (t) : 0 ≤ t ≤ 1} = L {Nλ (t) : 0 ≤ t ≤ 1}Nλ (1) = n , (3.37) between the distribution L {nUn (t) : 0 ≤ t ≤ 1} of {nUn(t) : 0 ≤ t ≤ 1}, and the conditional distribution L {Nλ (t) : 0 ≤ t ≤ 1}Nλ (1) = n of {Nλ (t) : 0 ≤ t ≤ 1}, given that Nλ (1) = n.
Proof. Omitted.
The fact that a homogeneous Poisson process has independent increment renders this process more tractable than the empirical process. The next proposition gives a useful trick to replace empirical processes by Poisson processes in a series of calculations. Proposition 3.8. For each 0 < T < 1, there exists a universal constant CT with the following property. For any measurable set B of the set (I+ [0, T ], W) of nondecreasing right-continuous nonnegative functions on [0, T ], endowed with the weak topology, and each n ≥ 1, we have
P {nUn (t) : 0 ≤ t ≤ T } ∈ B ≤ CT P {Nn (t) : 0 ≤ t ≤ T } ∈ B . (3.38) In addition, for each ε > 0, there exists an nε < ∞ such that (3.38) holds for all n ≥ 1 with √ CT = (1 + ε)/ 1 − T . (3.39) Proof. This elementary result has been used repeatedly in a series of papers in the literature (see, e.g., Lemma 2.7 in Deheuvels [54]). Let Nλ (·) be as in (3.37), and set λ = n. Introduce the events E1 = {nUn (t) : 0 ≤ t ≤ T } ∈ B and E2 = {Nn (t) : 0 ≤ t ≤ T } ∈ B . Set R = Nn (T ) and R = Nn (1) − Nn (T ). By (3.37), and the independence of (E2 , R) and R, we have P(E1 ) = P(E2 |Nn (1) = n) = P(E2 |R + R = n) =
P(E2 ∩ {R + R = n}) P(R + R = n)
n 1 P(E2 ∩ {R = n − k})P(R = k) P(R + R = n) k=0 n 1 sup P(R = k) P(E2 ∩ {R = n − k}) ≤ P(R + R = n) k≥0 k=0 1 ≤ sup P(R = k) P(E2 ). P(R + R = n) k≥0
=
Topics on Empirical Processes
147
Next, we make use of the observation that, if K follows a Poisson distribution with expectation λ, then sup P(K = k) = P(K = #λ$),
(3.40)
k≥0
where #λ$ ≤ λ < #λ$+1 is the integer part of λ. To establish this property, observe that, if λk −λ e pk := P(K = k) = for k = 0, 1, . . . , k! then pk λ ≥ 0 for k ≤ λ ⇔ k ≤ #λ$, = pk−1 k < 0 for k > λ ⇔ k > #λ$, from where (3.40) is straightforward. By all this, we infer from the Stirling formula, √ N ! = (1 + o(1))(N/e)N 2πN as N → ∞, that, as n → ∞,
(n(1 − T ))n(1−T ) n! 1 n −n(1−T ) e sup P(R = k) = e nn #n(1 − T )$! P(R + R = n) k≥0 1 + o(1) . = √ 1−T This, in turn, suffices to show that
1 sup P(R = k) CT := sup < ∞. n≥1 P(R + R = n) k≥0
Given this relation, we obtain readily (3.38) and (3.39), as sought.
The next proposition gives an example of how (3.38) may be applied. Proposition 3.9. For any 0 < T0 < 1 and ε > 0, there exists an nε√< ∞ such that, for all n ≥ nε , we have, uniformly over 0 < T ≤ T0 and 0 ≤ λ ≤ nT , √
2(1 + ε) λ
P sup |αn (t)| ≥ λ T ≤ √ exp − nT h 1 + √ 1 − T0 nT 0≤t≤T (3.41) 2 λ
2(1 + ε) λ 1− √ ≤√ exp − . 2 1 − T0 3 nT Proof. Let {Π(t) : t ≥ 0} denote a standard√Poisson process. By combining (3.38) and (3.39) with (2.65), taken with α = λ/ nT , we obtain readily that, for all n sufficiently large, λ
√
1+ε P sup |αn (t)| ≥ λ T ≤ √ P sup |Π(t) − t| ≥ nT √ 1 − T0 nT 0≤t≤T 0≤t≤nT
λ 2(1 + ε) exp − nT h 1 + √ , ≤ √ 1 − T0 nT which, in view of (2.70), readily yields (3.41).
148
P. Deheuvels
3.5. Strong approximations of empirical and quantile processes The uniform empirical process αn (·) fulfills E(αn (t)) = 0
and E({m1/2 αm (s)}{n1/2 αn (t)}) = {m ∧ n}{s ∧ t − st},
for all 0 ≤ s, t ≤ 1 and m, n ≥ 1. For each n ≥ 1, the Gaussian process with the same covariance structure as {αn (t) : 0 ≤ t ≤ 1} is the Brownian bridge, conveniently defined ny B(t) = W (t) − tW (1), where W (·) is a (standard onedimensional) Wiener process. The two-parameter Gaussian process with the same covariance structure as {n1/2 αn (t) : 0 ≤ t ≤ 1, n ≥ 0} (with the convention that n1/2 αn (t) = 0 for n = 0) is the Kiefer process {K(t, s) : s ≥ 0, t ≥ 0} (refer to Kiefer [127]). For integer values of n ≥ 0, the Kiefer process K(t, n) is conveniently defined by setting n K(t, n) = Bi (t) for 0 ≤ t ≤ 1, (3.42) i=1
where B1 , B2 , . . . is an iid sequence of Brownian bridges. Given a (standard twodimensional) Wiener process {W (t, s), t ≥ 0, s ≥ 0}, such that, for s, t, s , t , s , t ≥ 0, E(W (t, s)) = 0 and E W (t , s )W (t , s ) = {t ∧ s }{t ∧ s }, (3.43) one defines a Kiefer process, via K(t, s) = W (t, s) − tW (1, s) for
t≥0
and s ≥ 0.
(3.44)
The Doob-Donsker weak invariance principle (see, e.g., Billingsley [19], and the references therein) establishes the weak convergence W
αn → B,
(3.45)
which holds for the probability measures induced by αn and B on the set D[0, 1] endowed with the Skorohod topology (see, e.g., §2.2.2 and Ch. 3 in [19]). In many applications, it is more appropriate and convenient to make use of strong invariance principles, where the empirical processes and Brownian bridges are defined on the same probability space. Below, we cite the most useful theorems of the kind, in the framework of our study. We start with the celebrated Koml´os, Major and Tusn´ady [129, 130] invariance principle for the uniform empirical process (stated in Theorem 3.5), and the Cs¨org˝ o and R´ev´esz variant for the empirical quantile process (stated in Theorem 3.6). Theorem 3.5. On a suitable probability space, it is possible to define {Un : n ≥ 1} jointly with a sequence {Bn : n ≥ 1} of Brownian bridges, in such a way that, for appropriate constants a1 > 0, a2 > 0 and a3 > 0, we have, for all n ≥ 1 and t ∈ R, a1 log n + t √ ≤ a2 exp(−a3 t). P αn − Bn ≥ (3.46) n
Topics on Empirical Processes
149
Proof. The original papers of Koml´ os, Major and Tusn´ ady [129, 130] give hardly any proof for (3.46), and the first self-contained proof of this inequality was given by Mason and van Zwet [155]. Some details are also given in in Cs¨org˝ o and R´ev´esz [39], Bretagnolle and Massart [26] and Cs¨ org˝ o and Horv´ ath [38]. Remark 3.5. The optimal choice of a1 , a2 , a3 in (3.46) (as well as that of a4 , a5 , a6 in (3.47) below) remains an open problem. The best presently known result in this direction is due to Bretagnolle and Massart [26], who showed that (3.46) holds for n ≥ 2 with a1 = 12, a2 = 2 and λ = 1/6. Theorem 3.6. On a suitable probability space, it is possible to define {Un : n ≥ 1} jointly with a sequence {Bn∗ : n ≥ 1} of Brownian bridges, in such a way that, for appropriate constants a4 > 0, a5 > 0 and a6 > 0, we have, for all n ≥ 1 and t ∈ R, a4 log n + t √ ≤ a5 exp(−a6 t). P βn − Bn∗ ≥ (3.47) n
Proof. See, e.g., Cs¨ org˝ o and R´ev´esz [39].
A very useful refinement of Theorem 3.5, due to Mason and van Zwet [155], is cited in the next theorem. Introduce the following notation. Set log+ u = log1 u = log(u ∨ e) for
u ∈ R,
(3.48)
and define recursively, for each integer p ≥ 2, logp u = log+ u(logp−1 (u)) for u ∈ R.
(3.49)
Theorem 3.7. On a suitable probability space, it is possible to define {Un : n ≥ 1} jointly with a sequence {Bn : n ≥ 1} of Brownian bridges, in such a way that, for appropriate constants a7 > 0, a8 > 0 and a9 > 0, we have, for all n ≥ 1, t ∈ R and 0 < d ≤ n a7 log+ d + t √ P sup |αn (u) − Bn (u)| ≥ ≤ a8 exp(−a9 t). (3.50) n 0≤u≤ d n
Proof. See, e.g., Mason and van Zwet [155].
The following two theorems provide (very likely non-optimal) versions of Theorems 3.5 and 3.7 with respect to the approximation of the empirical process by Kiefer processes. Theorem 3.8. On a suitable probability space, it is possible to define {Un : n ≥ 1} 'n : n ≥ 1} of independent Brownian bridges, in such a jointly with a sequence {B way that, for appropriate constants a10 > 0, a11 > 0 and a12 > 0, we have, for all n ≥ 1 and t ∈ R, n
1 ' (log+ n)(a10 log+ n + t) √ ≤ a11 exp(−a12 t). P αn − √ (3.51) Bi ≥ n i=1 n Proof. See, e.g., Koml´os, Major and Tusn´ ady [129, 130].
150
P. Deheuvels
Theorem 3.9. On a suitable probability space, it is possible to define {Un : n ≥ 1} 'n : n ≥ 1} of independent Brownian bridges, in such a jointly with a sequence {B way that, for appropriate constants a10 > 0, a11 > 0 and a12 > 0, we have, for all n ≥ 1 and t ∈ R, P
n (log n)(a log n + t)
1 ' 10 + + √ sup αn (u) − √ ≤ a11 exp(−a12 t). Bi (u) ≥ n n d 0≤u≤ n i=1 (3.52)
Proof. See, e.g., Castelle and Laurent-Bonvalot [27] and Castelle [28].
Theorem 3.10. On a suitable probability space, we have, almost surely as n → ∞,
and
n (log n)2
1 ' √ , Bi (u) = O αn − √ n i=1 n
(3.53)
n
1 ' Bi (u) = O n−1/4 (log n)1/2 (log2 n)1/4 , βn + √ n i=1
(3.54)
'n : n ≥ 1} is a sequence of independent Brownian bridges. In addition, where {B *n : n ≥ 1} of independent Brownian bridges such there does not exist a sequence {B that, almost surely, n
1 * Bi (u) = o n−1/4 (log n)1/2 (log2 n)1/4 . βn + √ n i=1
(3.55)
Proof. By combining Theorem 3.9 with the Borel-Cantelli lemma we obtain (3.53). This, when combined with the Bahadur-Kiefer representation (see, e.g., Theorem 3.4 in the sequel) yields (3.54), first stated by Cs¨ org˝ o and R´ev´esz [39]. The fact that (3.54) is optimal, was proved by Deheuvels [55], who showed that (3.55) cannot *n : n ≥ 1}. hold a.s., for any possible choice of the iid sequence {B The following theorem is due to Cs¨org˝ o, Cs¨org˝ o, Horv´ ath and Mason [41], and Cs¨org˝ o, and Horv´ ath [37]. Theorem 3.11. On a suitable probability space, it is possible to define {Un : n ≥ 1} jointly with a sequence {Bn : n ≥ 1} of Brownian bridges, in such a way that the following limiting property holds. For any 0 ≤ ν < 12 , 0 ≤ τ < 14 and c > 0, we have, as n → ∞, nν and n
τ
sup
c c n ≤t≤1− n
sup
c c n ≤t≤1− n
|αn (t) − Bn (t)| 1
{t(1 − t)} 2 −ν |βn (t) + Bn (t)| 1
{t(1 − t)} 2 −τ
= OP (1) (3.56) = OP (1)
Topics on Empirical Processes
151
Exercise 3.5. Show that, under the assumptions of Theorems 3.5 and 3.6, we have, almost surely, log n
log n
and βn − Bn∗ = O √ . (3.57)
αn − Bn = O √ n n 3.6. Some results for weighted processes Below, we mention some useful results on weighted empirical processes. We refer to Einmahl and Mason [93, 94] for similar results for quantile processes and multivariate empirical processes. The standardized empirical process is defined by αn (t) γn (t) = t(1 − t)
for 0 < t < 1,
(3.58)
and has the property that, for each 0 < t < 1, d
γn (t) → N (0, 1) as n → ∞. The following limit laws are available for functionals of αn and γn . Let 1 γn2 (t)dt, Tn = sup |γn (t)|, An = 0 1 : (x) ≥ 1/c},
δc− = sup{x < 1 : h(x) ≥ 1/c},
(3.91)
γc−
(3.92)
= sup{x < 1 : (x) ≥ 1/c}.
Theorem 3.15. Under (H.1) and (H.4) with nhn / log2 n → ∞, we have, almost surely, lim sup ±(2 log2 n)1/2 ξn (hn ) = lim sup ±(2hn log2 n)1/2 αn (hn ) = 1,
(3.93)
lim sup ±(2 log2 n)1/2 ζn (hn ) = lim sup ±(2hn log2 n)1/2 βn (hn ) = 1.
(3.94)
n→∞
n→∞
n→∞
n→∞
On the other hand, when nhn / log2 n → c, we have, almost surely, lim sup ±(2 log2 n)1/2 ξn (hn ) n→∞
= lim sup ±(2hn log2 n)1/2 αn (hn ) = ±(δc± − 1)(c/2)1/2 ,
(3.95)
n→∞
lim sup ±(2 log2 n)1/2 ζn (hn ) n→∞
= lim sup ±(2hn log2 n)1/2 βn (hn ) = ±(γc± − 1)(c/2)1/2 .
(3.96)
n→∞
Proof. See, e.g., Kiefer [126], and Deheuvels [49].
Mason [156] established a functional version of (3.94), stated in Theorem 3.16 below. This result is a consequence of the following invariance principle, when combined with a general LIL due to Lai [134].
156
P. Deheuvels
Proposition 3.10. Assume (H.1) and (H.4) with nhn / log2 n → ∞. Then, on a suitable probability space, there exists a sequence Wk (·), k = 1, 2, . . . of independent Wiener processes such that, almost surely as n → ∞, n
αn (hn ·) − √1 Wi (hn ·) = o hn log2 n . (3.97) n i=1
Proof. See, e.g., Mason [156]. As a consequence of Proposition 3.10, we obtain the following result.
Theorem 3.16. Assume (H.1) and (H.4) with nhn / log2 n → ∞. Then, as n → ∞, the sequences (2 log2 n)−1/2 ξn (hn ·) = (2hn log2 n)−1/2 αn (hn ·) S, −1/2
(2 log2 n)
−1/2
ζn (hn ·) = (2hn log2 n)
βn (hn ·) S,
(3.98) (3.99)
are almost surely relatively compact in (B[0, 1], U), with limit set equal to S. Proof. See, e.g., Mason [156], and Einmahl and Mason [94].
When 0 < t0 < 1, the limiting behavior of the local empirical process, and that of the local quantile process, become distinct. The following theorem holds. Theorem 3.17. 1◦ ) Let 0 < t0 < 1. Assume (H.1) and (H.4) with nhn / log2 n → ∞. Then, as n → ∞, the sequence (2 log2 n)−1/2 ξn,t0 (hn ·) = (2hn log2 n)−1/2 {αn (t0 + hn ·) − αn (t0 )} S, is almost surely relatively compact in (B[−1, 1], U), with limit set equal to S. 2 ) Let 0 < t0 < 1. Assume (H.1), together with: √ (H.4) nhn / log+ 1/(hn n) → ∞; √ (H.5) log+ 1/(hn n) / log2 n → d ∈ [−∞, ∞]. Then, the sequence √ (2hn log+ 1/(hn n) + log2 n )−1/2 {βn (t0 + hn ·) − βn (t0 )} : n ≥ 1 , ◦
is almost surely relatively compact in (B[−1, 1], U), with limit set equal to S. Proof. See, e.g., Deheuvels and Mason [68], Mason [156], and Deheuvels [54].
3.9. Modulus of continuity of αn and βn Let {hn : n ≥ 1} denote a sequence of constants fulfilling the conditions (H.1) hn ↓ 0, nhn ↑ ∞; (H.2) nhn / log n → ∞; (H.3) (log(1/hn ))/ log2 n → ρ ∈ [0, ∞]. The following theorem collects a series of limiting results concerning the modulus of continuity of αn and βn . Below, we let 0 ≤ c < d ≤ 1 denote specified constants.
Topics on Empirical Processes
157
Theorem 3.18. Under (H.1–2–3), we have, almost surely for each 0 ≤ c < d ≤ 1, lim sup (2hn {log(1/hn ) + log2 n})
−1/2
n→∞
|αn (u) − αn (v)|
(3.100)
sup |αn (u + hn ) − αn (u)|
(3.101)
sup c≤u,v≤d |u−v|≤hn
= lim sup (2hn {log(1/hn ) + log2 n})
−1/2
= lim sup (2hn {log(1/hn ) + log2 n})
−1/2
= lim sup (2hn {log(1/hn ) + log2 n})
−1/2
n→∞
c≤u≤d
n→∞
|βn (u) − βn (v)|
sup
(3.102)
c≤u,v≤d |u−v|≤hn
n→∞
sup |βn (u + hn ) − βn (u)| = 1, c≤u≤d
(3.103) and, setting ρ/(ρ + 1) = 1 when ρ = ∞, lim inf (2hn {log(1/hn ) + log2 n})−1/2 n→∞
= lim inf (2hn {log(1/hn ) + log2 n})
−1/2
= lim inf (2hn {log(1/hn ) + log2 n})
−1/2
= lim inf (2hn {log(1/hn ) + log2 n})
−1/2
n→∞
(3.104)
sup |αn (u + hn ) − αn (u)|
(3.105)
c≤u≤d
n→∞
sup
|βn (u) − βn (v)|
(3.106)
c≤u,v≤d |u−v|≤hn
n→∞
=
|αn (u) − αn (v)|
sup c≤u,v≤d |u−v|≤hn
sup |βn (u + hn ) − βn (u)| c≤u≤d
ρ 1/2 . ρ+1
(3.107)
In addition, we have, in probability, lim (2hn {log(1/hn ) + log2 n})−1/2
n→∞
sup
|αn (u) − αn (v)|
(3.108)
c≤u,v≤d |u−v|≤hn
= lim (2hn {log(1/hn ) + log2 n})−1/2 sup |αn (u + hn ) − αn (u)|
(3.109)
= lim (2hn {log(1/hn ) + log2 n})−1/2
(3.110)
n→∞
c≤u≤d
n→∞
= lim (2hn {log(1/hn ) + log2 n}) n→∞
−1/2
sup
|βn (u) − βn (v)|
c≤u,v≤d |u−v|≤hn
sup |βn (u + hn ) − βn (u)| = c≤u≤d
ρ 1/2 . ρ+1 (3.111)
Proof. To establish (3.100)–(3.111) combine results of Stute [200], Mason, Shorack and Wellner [154], Mason [153], Deheuvels and Mason [66], Deheuvels [52], and Deheuvels and Einmahl [60].
158
P. Deheuvels
Theorem 3.18 turns out to be a direct consequence of functional limit laws due to Deheuvels and Mason [66], Deheuvels [52], Deheuvels [53], and Deheuvels and Einmahl [60]. Let S denote the Strassen set, as defined in (2.34). Consider the random sets of functions of u ∈ [0, 1], −1/2 En = (2hn {log(1/hn ) + log2 n}) {αn (t + hn u) − αn (t)} : c ≤ t ≤ d =: υn,t (u) : c ≤ t ≤ d , (3.112) −1/2 Fn = (2hn {log(1/hn ) + log2 n}) {βn (t + hn u) − βn (t)} : c ≤ t ≤ d =: ρn,t (u) : c ≤ t ≤ d . (3.113) Denote the Hausdorff distance (pertaining to the uniform topology U) between subsets of B[0, 1] by setting, for arbitrary A, B ⊆ B[0, 1], δU (A, B) = inf ε > 0 : A ⊆ B ε and B ⊆ Aε , (3.114) whenever such an ε > 0 exists, and δU (A, B) = ∞ otherwise. Theorem 3.19. Under (H.1–2–3) with ρ = ∞, we have, for each 0 ≤ c < d ≤ 1, lim δU (En , S) = lim δU (Fn , S) = 0
n→∞
n→∞
a.s.
(3.115)
Proof. See, e.g., Deheuvels and Mason [66]. −1
Fix now a T > 0, set hn = n log n, and consider the random sets of functions of u ∈ [0, T ] defined by n Gn = {Un (t + un−1 log n) − Un (t)} : c ≤ t ≤ d , (3.116) log n n Hn = {Vn (t + un−1 log n) − Vn (t)} : c ≤ t ≤ d . (3.117) log n Introduce the sets of functions of u ∈ [0, T ] defined by T ∆T = f ∈ AC[0, T ] : f (0) = 0, h(f˙(u))du ≤ 1 , ΓT
=
f ∈ I[0, T ] : f (0) = 0,
(3.118)
0 T
(f˙(u))du + fS (T +) ≤ 1 . (3.119)
0
Here, I[0, T ] denotes the restriction to [0, T ] of left-continuous nondecreasing functions on R. Each of these functions, say f ∈ I[0, 1] with f (0) = 0, defines, via the Lebesgue-Stieltjes integral, a nonnegative measure on [0, T ]. We may therefore decompose this measure into an absolutely continuous component with Lebesgue ˙ and a singular component dfS . We then define fS (T +) by derivative f, fS (T +) = dfS . (3.120) [0,T ]
Topics on Empirical Processes
159
Recall the definition (1.44) of the L´evy distance ∆L (f, g) between two functions f, g ∈ I[0, T ]. We set, for any f ∈ I[0, T ], N[ε] (f ) = g ∈ I[0, T ] : ∆L (f, g) < ε , (3.121) and, for any subset A ∈ I[0, T ], A[ε] =
(
N[ε] (f ),
(3.122)
f ∈A
when A = ∅, and A[ε] = ∅[ε] = ∅ otherwise. Denote the Hausdorff distance (pertaining to the weak topology) between subsets of I[0, T ] by setting, for arbitrary A, B ⊆ I[0, T ], δW (A, B) = inf ε > 0 : A ⊆ B [ε] and B ⊆ A[ε] , (3.123) whenever such an ε > 0 exists, and δW (A, B) = ∞ otherwise. Theorem 3.20. We have, for each T > 0 and 0 ≤ c < d ≤ 1, lim δU (Gn , ∆T ) = lim δW (Hn , ΓT ) = 0
n→∞
n→∞
a.s.
(3.124)
Proof. See, e.g., Deheuvels and Mason [66].
3.10. The Bahadur-Kiefer representation The classical Bahadur-Kiefer representation, due to Bahadur [8], and Kiefer [124, 125], is stated in the following theorems 3.21 and 3.22. It allows to replace the quantile process βn by (−1) times the empirical process αn with an almost sure uniform error of order n−1/4+o(1) . At times (see, e.g., Deheuvels [55, 57]), this yields the best possible rates of approximation. The strange constants in (3.125) and (3.132) below are generated in a quite natural way by functional limit theorems, and their derivation illustrates well the main topic of the present course. Theorem 3.21. We have, with probability 1, lim sup n1/4 (log n)−1/2 (log2 n)−1/4 αn + βn = 2−1/4 .
(3.125)
n→∞
Proof. We refer to Shorack [182] and Deheuvels and Mason [64] for details. Below, we follow the elegant proof of Shorack [182] to establish (3.125). First, we combine Propositions 3.2 and 3.3, to obtain that, almost surely, 1 sup |βn (t) − αn (Vn (t))| = sup |βn (t) − n1/2 {Vn (t) − Un (Vn (t))}| = √ , n 0≤t≤1 0≤t≤1 and hence, since Vn (t) = t + n−1/2 βn (t), 1
{αn + βn } − {αn − αn (I + n−1/2 βn )} = √ n
a.s.
In view of (3.126), we see that (3.125) is equivalent to lim sup n1/4 (log n)−1/2 (log2 n)−1/4 αn (I + n−1/2 βn ) − αn = 2−1/4 n→∞
(3.126)
a.s. (3.127)
160
P. Deheuvels
The proof of (3.127) is achieved in two complementary parts. Part 1. (Outer bounds). Fix any ε > 0. By Chung’s Theorem
3.4, we have, a.s. for all n sufficiently large, n−1/2 βn ≤ hn := (1 + ε)2−1/2 log2 n. This, when combined with (3.100), shows that, a.s., lim sup(2hn {log(1/hn ) + log2 n})−1/2 αn (I + n−1/2 βn ) − αn n→∞
≤ lim sup(2hn {log(1/hn ) + log2 n})−1/2 n→∞
sup
|αn (u) − αn (v)| = 1.
0≤u,v≤1 |u−v|≤hn
We obtain readily from this last result that, a.s., lim sup n1/4 (log n)−1/2 (log2 n)−1/4 αn (I + n−1/2 βn ) − αn ≤ (1 + ε)1/2 2−1/4 . n→∞
Since ε > 0 may be chosen as small as desired, it follows that, a.s., lim sup n1/4 (log n)−1/2 (log2 n)−1/4 αn (I + n−1/2 βn ) − αn ≤ 2−1/4 . (3.128) n→∞
Part 2. (Inner bounds). In this part, we invoke Finkelstein’s Theorem 3.14. As follows from (3.83), the function f (t) = min{t, 1 − t} for t ∈ [0, 1] belongs to the Finkelstein set F. Thus, a.s. for each > 0, there exists an increasing sequence {nj : j ≥ 1} along which (2 log2 n)−1/2 βn − f ≤ 14 . Set now c = 12 − 14 and d = 12 + 14 . Obviously, for any t ∈ [c, d], we have |f (t) − 12 | ≤ 14 . Therefore, along {nj : j ≥ 1}, we have, uniformly over t ∈ [c, d], |(2 log2 n)−1/2 βn (t) − 12 | ≤ 12 . Setting hn = 2−1/2 n−1/2 (log2 n)1/2 , we have therefore, along {nj : j ≥ 1}, and uniformly over t ∈ [c, d], |n−1/2 βn (t)−hn | ≤ hn . Now, by an application of (3.101) and (3.105), we have −1/2
lim (2hn log(1/hn ))
n→∞
sup |αn (u + hn ) − αn (u)| = 1 a.s.,
(3.129)
c≤u≤d
whereas we infer from (3.101) that −1/2
lim sup (2hn log(1/hn )) n→∞
sup |αn (u + hn ) − αn (u + hn + v)| =
√
a.s.,
c≤u≤d |v|≤hn
(3.130) Making use of (3.129)–(3.130), we obtain therefore that √ −1/2 lim inf (2hn log(1/hn )) sup αn (u + n−1/2 βn ) − αn (u) ≥ 1 −
n→∞
a.s.,
c≤u≤d
Since ε > 0 may be chosen as small as desired, it follows readily from this last statement that, a.s., lim inf n1/4 (log n)−1/2 (log2 n)−1/4 αn (I + n−1/2 βn ) − αn ≥ 2−1/4 . (3.131) n→∞
The proof of (3.127) is achieved by combining (3.128) with (3.130).
Topics on Empirical Processes
161
The pointwise Bahadur-Kiefer representation is stated in the following theorem. Theorem 3.22. For each specified t0 ∈ [0, 1], we have, with probability 1, lim sup n1/4 (2 log2 n)−3/4 |αn (t0 ) + βn (t0 )| = {t0 (1 − t0 )}1/4 21/2 3−3/4 .
(3.132)
n→∞
Proof. See, e.g., Kiefer [124]). Below, we give a simple proof of this statement, given in Deheuvels and Mason [67], and §3.4 of Deheuvels and Mason [68]. We need the following two auxiliary results (see, e.g., Lemmas 3.3 and 3.4 in [68]). Set, for 0 < t0 < 1 and −1 ≤ u ≤ 1, xn = (2 log2 n)−1/2 αn (t0 ),
(3.133)
2 1/2
−1/2 log2 n log2 n fn (t) = 2 n 2 1/2
log2 n × αn (t0 ) − αn t0 − u . n Then, we have, almost surely, as n → ∞,
(3.134)
|n1/4 (2 log2 n)−3/4 {αn (t0 ) + βn (t0 )} − fn (xn )| → 0.
(3.135)
Moreover, the sequence (xn , fn ) is almost surely relatively compact in R×B[−1, 1], where B[−1, 1] denotes the set of bounded functions on [−1, 1], endowed with the unform topology. The limit set is given by 1 x2 (x, fn ) K1 := (x, f ) ∈ R × AC0 [−1, 1] : + f˙(u)2 du ≤ 1 . t0 (1 − t0 ) −1 (3.136) In view of (3.133), (3.134), (3.135) and (3.136), the RHS of (3.132) equals, almost surely,
1/4 sup |f (x)| = t0 (1 − t0 ) sup s(1 − s2 ) = {t0 (1 − t0 )}1/4 21/2 3−3/4 , (x,f )∈K1
0≤s≤1
which yields the theorem.
(3.137)
Exercise 3.9. 1◦ ) Letting xn be as in (3.133), show, as a consequence from Finkelstein’s Theorem 3.14, that, almost surely as n → ∞,
xn [− t0 (1 − t0 ), t0 (1 − t0 ) ], and show that this result is in agreement with (3.136). ◦
2 ) Letting fn be as in (3.134), show, as a consequence from Theorem 3.17, that, almost surely as n → ∞, fn S, and show that this result is in agreement with (3.136).
162
P. Deheuvels
3.11. Application to density estimation The theorems of the previous sections provide a number of applications to nonparametric functional estimation (see, e.g., [4, 13, 45, 46, 47, 15, 16, 17, 18, 23, 25, 29, 33, 34, 35, 45, 46, 50, 52, 53, 56, 72, 74, 77, 75, 78, 79, 87, 96, 105, 107, 108, 109, 110, 111, 115, 117, 118, 131, 133, 143, 149, 158, 159, 160, 161, 162, 165, 168, 169, 170, 171, 175, 176, 177, 178, 180, 196, 197, 198, 200, 202, 203, 208, 209, 210, 212]). We will limit ourselves to one example, showing how the preceding functional limit laws can be applied. Consider an iid sequence X1 , X2 , . . . of rv’s with common distribution F (x) = P(X1 ≤ x), having density f (x), assumed to be positive and continuous on J = (c , d ) ⊆ R. We estimate f (x), for x ∈ I := [c, d] ⊂ J, by the Akaike-Parzen-Rosenblatt kernel estimator (refer to [4, 163, 170]) 1 x − Xi
. K nh i=1 h n
fn,h (x) =
(3.138)
Here, h > 0 is a positive bandswidth, and K(·) a kernel, namely a function of bounded variation on R, such that, for some 0 < M < ∞, (i)
|t| ≥ M ;
K(t) = 0 for
(ii)
K(t)dt = 1. R
The following theorem was proved by Stute [200], under slightly more restrictive conditions. As stated, it is due to Deheuvels and Mason [66] (see also Deheuvels [52] and Deheuvels and Einmahl [60]). Theorem 3.23. Assume that {hn : n ≥ 1} is a sequence of positive constants, fulfilling hn ↓ 0;
nhn ↑ ∞;
nhn → ∞; log n
log(1/hn ) → ∞. log2 n
(3.139)
Then, we have, almost surely, lim
n→∞
1/2 nhn sup ±{fn,hn (x) − Efn,hn (x)} = sup f (x) 2 log(1/hn ) x∈I x∈I
K 2 (t)dt R
1/2 .
(3.140)
Proof. We will establish (3.140) in the special case where J = (0, 1), f (x) = 1 for 0 < x < 1, and K(t) = 0 for t ∈ [ 14 , 34 ]. The arguments of the proof in the general case are essentially identical, with the addition of minor technicalities, which can be omitted. We let n0 be such that t±hn ∈ J for all t ∈ I, when n ≥ n0 . Recall the definition (3.112) of υn,x (u). Our assumptions imply that Xn = Un is uniformly
Topics on Empirical Processes
163
distributed on (0, 1) for n = 1, 2, . . ., so that we may write, for n ≥ n0 , 1 1 x − t fn,hn (x) − Efn,hn (x) = d{Un (t) − t} K hn hn 0 1 1 √ K(u)d{αn (x − hn u) − αn (x)} = hn n 0 1 1 √ d{αn (x − hn u) − αn (x)}dK(u) =− hn n 0 2(log(1/h ) + log n) 1/2 1 n 2 υn,x (−u)dK(u) =− nhn 0 2 log(1/h )) 1/2 1 n υn,x (−u)dK(u), = −(1 + o(1)) nhn 0 after integrating by parts. Now, as follows from Theorem 3.19, we have δU ({υn,x : x ∈ I}, S) → 0 a.s. It is easily checked that, whenever f (u) ∈ S, we have f (−u) ∈ S. Therefore, an easy argument shows that, almost surely as n → ∞, 1/2 nhn sup ±{fn,hn (x) − Efn,hn (x)} 2 log(1/hn ) x∈I 1 = −(1 + o(1)) sup f (−u)dK(u) f ∈S
0 1
→ − sup f ∈S
f ∈S
0
1
f˙(u)K(u)du,
f (u)dK(u) = sup 0
after inegrating by parts. Finally the Schwarz inequality shows that 1 1 ˙ sup K 2 (u)du , f(u)K(u)du = f ∈S
0
0
which suffices for proving (3.140) under our (simplified) assumptions.
At this point, we leave it to the reader to continue these arguments we have used to establish similar results for other non-parametric estimators of interest. Exercise 3.10. Making use of Theorem 3.17, show that, whenever hn ↓ 0;
nhn ↑ ∞;
nhn → ∞, log2 n
we have, for each specified x0 in the neighborhood of which f is continuous, almost surely, nh 1/2 1/2 n lim sup ± {fn,hn (x) − Efn,hn (x)} = f (x) K 2 (t)dt . 2 log2 n n→∞ R
164
P. Deheuvels
4. Auxiliary results 4.1. Some Gaussian process theory Let Z be a centered Gaussian random vector taking values in a separable Banach space X, with norm denoted by | · |X . Throughout, we will work under the assumption that (G.1) P (|Z|X < ∞) = 1. In the sequel, we will mainly consider the case where X = C[0, 1], the set of all continuous functions on [0, 1], endowed with the sup-norm uniform topology U, defined by | · |X = · , and, either, Z =W is (the restriction to [0, 1] of) a standard Wiener process; or Z =B is a Brownian bridge. The assumption (G.1) is clearly satisfied in either of these cases. Denoting by BX the σ-algebra of Borel subsets of X, the distribution of Z is defined on BX via (4.1) PZ (B) = P(Z ∈ B) ∀B ∈ BX . ∗ Denoting by X the topological dual of X, that is, the space of all continuous linear forms h∗ : X → R, one may define a linear mapping J : X∗ → X, by the Bochner integral h∗ ∈ X∗ → J h∗ = E (Zh∗ (Z)) =
X
zh∗ (z)PZ (dz).
(4.2)
The image space H∗ := J (X∗ ) is pre-Hilbertian with inner product defined by ∗ ∗ ∗ ∗ J g , J h H = E (g (Z)h (Z)) = g ∗ (z)h∗ (z)PZ (dz), for g ∗ , h∗ ∈ X∗ . (4.3) X
The reproducing kernel Hilbert space [RKHS] of Z is then defined as the closure H in X of H∗ with respect to the Hilbert norm | · |H = ·, · H . Of special interest here is the unit ball of H, denoted by K = {h ∈ H : |h|H ≤ 1} .
(4.4)
The following assumption will be needed, in addition to (G.1). Throughout, we assume that (G.2) K is a compact subset of (X, | · |X ). We will make use of a generalized version of the Cameron-Martin formula stated in the following theorem. Theorem 4.1. For each h ∈ H, there exists a measurable linear form ' h on X , fulfilling the equalities
PZ (B + h) = exp ' h(x) − 12 |h|2H PZ (dx) for each B ∈ BX , (4.5) B
' E ' h2 (Z) = (4.6) h2 (x)PZ (dx) = |h|2H . X
Topics on Empirical Processes Proof. See, e.g., Borell (1976).
165
The next proposition (see, e.g., Borell (1977)) states a useful consequence of (4.6). Proposition 4.1. Let h ∈ H and let A ∈ BX denote a symmetric Borel subset of X . Then,
PZ (A + h) ≥ exp − 12 |h|2H PZ (A). (4.7) Proof. There is noting to prove if PZ (A) = 0. Assuming that PZ (A) > 0, we set B = A in (4.7), then make use of the Jensen inequality to obtain, by symmetry of A and linearity of ' h, that
2 1 PZ (A + h) = exp − 2 |h|H exp ' h(x) PZ (dx) A
P (dx) Z 2 1 = exp − 2 |h|H PZ (A) exp ' h(x) P Z (A) A
PZ (dx) ' ≥ exp − 12 |h|2H PZ (A) exp h(x) PZ (A) A
2 ≥ exp − 12 |h|H PZ (A),
which is (4.7).
The next statement is the celebrated isoperimetric inequality due to Borell (1975) and Sudakov and Tsyrelson (1978). Denote by x 2 1 Φ(x) = √ e−t /2 dt, (4.8) 2π −∞ the df of the standard N (0, 1) law, and by Φ−1 = Φ← the corresponding quantile function, fulfilling Φ Φ−1 (u) = u for all 0 < u < 1. (4.9) Theorem 4.2. For each A, B ∈ BX such that B ∩ (A + rK) = ∅, we have PZ (B) ≤ 1 − Φ Φ−1 (PZ (A)) + r .
(4.10)
The following simple inequalities will be useful to evaluate the right-hand side of (4.10). Lemma 4.1. We have, for every t > 0, 2 2 e−t /2 e−t /2 1 √ 1 − 2 ≤ 1 − Φ(t) ≤ √ . t t 2π t 2π
(4.11)
Moreover, for every t ≥ 0, we have 2 2 2 1 − Φ(t) ≤ √ e−t /2 ≤ e−t /2 . 2π
(4.12)
166
P. Deheuvels
Proof. By integrating by parts, we first observe that ∞ ∞ 2 1 1 1 −x2 /2 d{e−x /2 } e dx = − √ 1 − Φ(t) = √ x 2π t 2π t ∞ 2 2 2 1 e−t /2 1 e−t /2 −x2 /2 √ √ = −√ d{e−x /2 }. d{e } ≤ 2 t 2π 2π t x t 2π By integrating once more by parts, we obtain likewise that ∞ 2 2 2 1 e−t /2 1 1 e−t /2 −x2 /2 √ √ 1 − Φ(t) = 1− 2 + √ 1− 2 . d{e } ≥ 3 t t t 2π 2π t x t 2π The above inequalities readily yield (4.11). To establish (4.12), we infer from (4.11) that the inequality 2
2 e−t /2 2 1 − Φ(t) ≤ √ ≤√ e−t /2 , t 2π 2π always holds when t ≥ 1. On the other hand, when 0 ≤ t ≤ 1, we may write ∞ 2 2 2 1 1 −x2 /2 e−t /2 . e dx + e−x /2 dx ≤ √ 1 − Φ(t) = √ 2π 2π t 1 By combining these two cases, we obtain readily (4.12) as sought.
4.2. A functional LIL for superpositions of Gaussian processes To illustrate the Gaussian process methodology of §4.1, we give below a proof of an elementary version of the functional law of the iterated logarithm [FLIL] for superpositions of independent Gaussian processes. We refer to [93], [103], [134], for more refined results of the kind. We inherit the notation of §4.1, and consider a sequence Z1 , Z2 , . . . of independent replicae of Z. We set further n 1 Zi . (4.13) Yn = √ n i=1 The following notation will be needed. For each ε > 0 and x ∈ X, we set Vε,x = {z ∈ X : z − x < ε} . For each ε > 0 and A ∈ BX , A = ∅, we set Aε = {z ∈ X : ∃x ∈ A, z − x < ε} =
(4.14) (
Vε,x .
(4.15)
x∈A
For convenience, we set ∅ε = ∅. Theorem 4.3. Under (G.1–2), with probability 1 for any ε > 0, there exists an nε such that, for all n ≥ nε , (2 log2 n)−1/2 Yn ∈ Kε .
(4.16)
Moreover, for each z ∈ K, we have, with probability 1, lim inf |(2 log2 n)−1/2 Yn − z|X = 0. n→∞
(4.17)
Topics on Empirical Processes
167
Prior to the proof of Theorem 4.3, we will establish the following useful result, usually referred to as Ottaviani’s lemma. Lemma 4.2. Let ζ0 and ξ1 , . . . , ξN denote independent random vectors taking value in the Banach space (X, | · |X ). Set ζi = ζ0 + ξ1 + · · · + ξi for i = 1, . . . , N . Consider a Borel subset A of X and select a ε > 0. Suppose that, for each 1 ≤ i ≤ N , P(|ζN − ζi |X ≤ ε) ≥ 1/2. Then, we have (4.18) P ∃i = 0, . . . , N : ζi ∈ A2ε ≤ 2P (ζN ∈ Aε ) . Proof. Introduce the events Ai (ε)
= {ζi ∈ Aε }
Bi (ε)
= {|ζN − ζi |X < ε}
for
i = 0, . . . , N, for
i = 0, . . . , N,
Denote by A = Ω − A the complement of A, and set A−1 (2ε) = ∅. We obtain readily that P
N (
Ai (2ε) =
i=0
N
P Ai (2ε) ∩ Ai−1 (2ε) i=0 N
P Ai (2ε) ∩ Ai−1 (2ε) P |ζN − ζi |X ≤ ε
≤
2
=
N
2 P Ai (2ε) ∩ Ai−1 (2ε) ∩ Bi (ε)
≤
N
2 P Ai (2ε) ∩ Ai−1 (2ε) ∩ AN (ε) ≤ 2P (AN (ε)) ,
i=0
i=0
i=0
which is (4.18).
Proof of Theorem 4.3. Part 1 – Outer Bounds. We select a γ > 0 and an ε > 0. Introduce the sequence of integers nk = #(1 + γ)k $ for k = 0, 1, . . ., and set ak = ε 2nk log2 nk . Consider the events
1/2 Ck = (4.19) ∃n ∈ (nk−1 , nk ] : n1/2 Yn ∈ {nk 2 log2 nk K}2ak ,
1/2 1/2 Dk = (4.20) nk Ynk ∈ {nk 2 log2 nk K}ak . Our assumptions imply that, ultimately as k → ∞, for each n ∈ (nk−1 , nk ],
1/2 = P (nk − n)1/2 |Z|X ≥ ε 2nk log2 nk P n1/2 Yn − nk Ynk ≥ ak X
n 1/2 k 2 log2 nk ≤ P |Z|X ≥ ε nk−1
1
≤ P |Z|X ≥ ε(1 + 12 γ) 2 log2 nk ≤ . 2
168
P. Deheuvels
We may therefore apply Ottaviani’s lemma (Lemma 4.2) to show that, for all k sufficiently large, P(Ck ) ≤ 2P(Dk ). (4.21) We now turn to the evaluation of P(Dk ). Toward this end, me make use of the Isoperimetric Inequality (Theorem 4.2) with choices of A, B and r which will are specified below. We set, for a constant m > 0 which will be precised later on, A = B =
D(0, m) := {x ∈ X : |x|X < m}, X − {A + rK} = X − {x + h : |x|X < m, |h|H ≤ r}.
In view of (G.1), we may now choose m > 0 so large that PZ (A) = P(|Z|X < m) > 1/2, which, in turn, implies that δ := Φ−1 (PZ (A)) > 0. This, when combined with (4.10) and (4.12), shows that, for all r > 0,
1 PZ (B) ≤ 1 − Φ Φ−1 (PZ (A)) + r ≤ exp − (δ + r)2 . (4.22) 2
We now choose r = 2 log2 nk in (4.22). In view of the definition (4.20) of Dk , we observe that, ultimately as k → ∞, ε√2 log2 nk
m
P(Dk ) = P Z ∈ ≤ P Z ∈ 2 log nk K 2 log nk K
2
1
δ + 2 log2 nk = PZ (B) ≤ exp − 2
δ2
1 1 − δ 2 log2 nk = o . exp − = log nk 2 (log nk )(log2 nk )2 Since log nk = (1 + o(1))k log(1 + γ) as k → ∞, we infer from this last expression, when combined with (4.21), that ∞
P (Ck ) < ∞.
k=1
The Borel-Cantelli lemma implies in turn that, with probability 1 for all k sufficiently large, we have, uniformly over all n such that nk−1 < n ≤ nk , 2ε √nk log2 nk /{√nk−1 log2 nk−1 } Yn nk log2 nk 1/2
∈ K nk−1 log2 nk−1 2 log2 n 2ε(1+γ) ∈ (1 + γ)K . (4.23) Now, we make use of (G.2) to show that, for any fixed > 0, we may select γ > 0 and ε > 0 so small that 2ε(1+γ) (1 + γ)K ⊆ K . This, when combined with (4.23), suffices for (4.16).
Topics on Empirical Processes
169
Part 2 – Inner Bounds. We first note that the assumption (G.2) implies that M := sup |g|X < ∞.
(4.24)
g∈K
Let, as in Part 1, nk = #(1 + γ)k $, for k = 0, 1, . . . . Under the assumptions of the theorem, it is enough to show that, for each specified ε > 0 and h ∈ K with 1 − 2η := |h|2H < 1, there exists a γ > 0 such that, almost surely, lim inf |(2 log2 nk )−1/2 Ynk − h|X < ε. k→∞
(4.25)
In view of (4.24), and making use of Part 1, we obtain readily that, almost surely, √ lim sup |(2nk log2 nk )−1/2 nk−1 Ynk−1 |X k→∞
=
(2nk log2 nk )−1/2 −1/2 lim sup |(2 log n ) Y − h| k−1 n X 2 k−1 k→∞ (2nk−1 log2 nk−1 )−1/2 k→∞ lim
≤ (1 + γ)−1/2 sup |g − h|X ≤ 2M (1 + γ)−1/2 . g∈K
Therefore, there exists a γ1 > 0 such that, whenever γ ≥ γ1 , almost surely, √ lim sup (2nk log2 nk )−1/2 nk−1 Ynk−1 < 12 ε. X
k→∞
(4.26)
Consider next the mutually independent events
−1/2 √ √ Ek := 2(nk −nk−1 )log2 (nk −nk−1 ) nk Ynk − nk−1 Ynk−1 −h < 18 ε . X (4.27) As follows from (4.7) in combination with (G.1), we obtain therefore that, ultimately as k → ∞, 1/2 1/2
P(Ek ) = P |Z − 2 log2 (nk − nk−1 ) h|X < 14 ε 2 log2 (nk − nk−1 )
1/2
≥ exp − 12 |h|2H × 2 log2 (nk − nk−1 ) P |Z|X < 14 ε 2 log2 (nk − nk−1 )
1 ≥ 12 exp − 12 (1 − 2η) × 2 log2 (nk − nk−1 ) ≥ 1−η . k Since this entails that ∞ k=1
P(Ek ) = ∞,
170
P. Deheuvels
the Borel-Cantelli lemma implies, in turn, that P(Ek i.o.) = 1. Now, by the definition (4.27) of Ek , it follows that, almost surely, −1/2 √ √ nk Ynk − nk−1 Ynk−1 − h lim inf 2nk log2 nk X
k→∞
(2nk log2 nk )−1/2 1 ≤ lim 8ε k→∞ (2(nk − nk−1 ) log2 (nk − nk−1 ))−1/2 (2nk log2 nk )−1/2 +|h|X 1 − lim k→∞ (2(nk − nk−1 ) log2 (nk − nk−1 ))−1/2 + + γ γ ≤ 14 ε +M 1− . 1+γ 1+γ
(4.28)
In view of (4.28), we see that there exists a γ2 > 0 such that, whenever γ ≥ γ2 , we have, almost surely, −1/2 √ √ lim inf 2nk log2 nk nk Ynk − nk−1 Ynk−1 − h < 12 ε.(4.29) X
k→∞
By combining (4.26) with (4.29), we obtain readily that (4.25) holds for all γ ≥ γ1 ∨ γ2 . This concludes the proof of Theorem 4.3. Exercise 4.1. Prove (3.87) and (3.88). 4.3. Karhunen-Lo`eve expansions We recall the following facts about Karhunen-Lo`eve [KL] expansions, (see, e.g., [3], [7], [119] and [122]). Denote by {Z(t) : 0 < t < 1} a centered Gaussian process with covariance function {R(s, t) = E(Z(s)Z(t)) : 0 < s, t < 1}, fulfilling 1 0< R(t, t)dt < ∞. (4.30) 0
Then, there exist nonnegative constants {λk : k ≥ 1}, λk ↓, together with functions {ek (t) : k ≥ 1} ⊂ L2 (0, 1) of t ∈ (0, 1) such that the properties (K.1–2–3–4) below hold. (K.1) For all i ≥ 1 and k ≥ 1, 1 1 if i = k, ei (t)ek (t)dt = 0 if i = k. 0 The {λk , ek (·) : k ≥ 1} form a complete set of solutions of the Fredholm equation in (λ, e(·)), λ = 0. 1 1 λe(t) = R(s, t)e(s)ds for 0 < t < 1 and e2 (t)dt = 1. (4.31)
(K.2)
0
0
The λk (resp. ek (·)) are the eigenvalues (resp. eigenfunctions) of the Fredholm transformation 1 2 2 f ∈ L (0, 1) → T f ∈ L (0, 1) : T f (t) = R(s, t)f (s)ds, t ∈ (0, 1). 0
Topics on Empirical Processes (K.3)
The series expansion λk ek (s)ek (t) for R(s, t) =
171
0 < s, t < 1,
(4.32)
k≥1
is convergent in L2 (0, 1)2 . (K.4)
There exist independent and identically distributed [i.i.d.] N (0, 1) random variables {ωk : k ≥ 1} such that the Karhunen-Lo`eve [KL] expansion Z(t) = λk ωk ek (t) 0 < t < 1, (4.33) k≥1
of Z(·) holds, with the series (4.33) converging a.s., and in integrated mean square on (0, 1). Remark 4.1. 1◦ ) The sequence {λk , ek (·) : k ≥ 1} in (K.1–2–3–4) may very well be finite. Below, we will implicitly exclude this case and specialize in infinite KL expansions with k ranging through IN∗ = {1, 2, . . .}, with λ1 > λ2 > · · · > 0. 2◦ ) If, in addition to (4.30), Z(·) is a.s. continuous on [0, 1] with covariance function R(·, ·) continuous on [0, 1]2 , then, we may choose the functions {ek (·) : k ≥ 1} in the KL expansion (4.33) to be continuous on [0, 1]. The series (4.32) is then absolutely and uniformly convergent on [0, 1]2 , and the series (4.33) is a.s. uniformly convergent on [0, 1] (see, e.g., [3]). There are few Gaussian processes of interest with respect to statistics for which the KL expansion is known through explicit values of {λk : k ≥ 1}, and with simple forms of the functions {ek (·) : k ≥ 1} (see, e.g., [152] for a review). It is useful to have a precise knowledge of the λk ’s, since we infer from (4.33) that 1 Z 2 (t)dt = λk ωk2 . (4.34) D2 = 0
k≥1
This readily implies (see, e.g., (6.23), p. 200 in [119]), that the moment-generating function of the distribution of D2 is given by ∞ 1/2 1 1 for Re(z) < . (4.35) ψD2 (z) = E(exp(zD2 )) = 1 − 2zλk 2λ1 k=1
We have |E(exp(zD2 ))| < ∞ for all z ∈ C with Re(z) < additional conditions that (i) and (ii)
1 2λ1 ,
subject to the
λ1 > λ2 > · · · > 0, ∞ k=1
∞ 1 λk = < ∞, γk k=1
(4.36) 1 where γk = λk
for k ≥ 1.
172
P. Deheuvels
Since D2 is a weighted sum of independent χ21 components, its distribution is easy to compute under (4.36) via the Smirnov formula (see, e.g., [43], [150], [151], [189], [192]). For t > 0, γ2k −tu/2 ∞
e du 1
P D2 > t = (−1)k+1 (4.37) π u |F(u)| γ2k−1 k=1 1 + o(1) γ2 e−tu/2 du
= as t → ∞, π γ1 u |F(u)| where F(u) is the Fredholm determinant defined, under (4.36), by ∞ ∞ u F(u) = 1 − uλk = 1− for u ≥ 0. γk k=1
(4.38)
k=1
In view of (4.35)–(4.38), we note that 1 . (4.39) 2λ1 We refer to Martynov ([152], [150]) for a study of the convergence of the series (4.37), together with versions of this formula holding when some the consecutive terms of the sequence λ1 ≥ λ2 ≥ · · · > 0 are equal. ψD2 (z) = {F(2z)}−1/2
for
Re(z)
0. 1◦ ) Show that y(·) is continuous on [0, 1], then, by a recursion, that y is twice continuously differentiable on (0, 1) and such that y(0) = 0 and y (1) = 0. 2◦ ) Show that y is a solution of the differential equation λy + y = 0, and that the only possible values of λ are of the form λ = 1/{(k − 12 )π}2 for k = 1, 2, . . . . Conclude to the validity of (4.40).
174
P. Deheuvels
4.5. KL expansions for weighted Wiener processes and Brownian bridges In this section, we provide KL expansions for weighted Wiener processes and Brownian bridges due to Deheuvels and Martynov [63]. Throughout, {W (t) : t ≥ 0} and {B(t) : 0 ≤ t ≤ 1} denote, respectively, a Wiener process, and a Brownian bridge. These processes are centered with covariance functions E(W (s)W (t))s ∧ t E(B(s)B(t)) = s ∧ t − st
and
for s, t ≥ 0, for 0 ≤ s, t ≤ 1.
Denote by {ψ(t) : 0 < t < 1} a positive and continuous function on (0, 1), whose definition will, at times, be extended by continuity to (0, 1] or [0, 1]. Below, we will work under additional conditions taken among the following. (L.1) (L.2) (C.3)
ψ(·) is continuous on (0, 1]; 1 (i) lim tψ(t) = 0; (ii) tψ 2 (t)dt < ∞; t↓0
(i)
0
lim tψ(t) = lim(1 − t)ψ(t) = 0; (ii) t↓0
t↑0
1
t(1 − t)ψ 2 (t)dt < ∞. 0
It is readily checked that (L.2)(ii) (resp. (L.3)(ii)) is the version of (4.30) corresponding to Z(t) = Z1 (t) (resp. Z(t) = Z2 (t)), where Z1 (t) = ψ(t)W (t)
and Z2 (t)
= ψ(t)B(t)
for 0 < t < 1.
To obtain the KL expansions of Z1 (·), Z2 (·), we will use the following theorems, in the spirit of Kac and Siegert ([121], [122]), and Kac (see, e.g., pp. 199-200 in [119] and Section 2 in [120]). Theorem 4.4. Assume (C.1-2). Set Z(t) = Z1 (t) = ψ(t)W (t) for 0 < t ≤ 1. Then, the {(λk , ek (·)) : k ≥ 1} in the KL expansion of Z(·) are obtained by setting λ = 1/γ and e(t) = y(t)ψ(t), where y(·) is a continuous on [0, 1] and twice continuously differentiable on (0, 1] solution of the differential equation y (t) + γψ 2 (t)y(t) = 0,
(4.48)
subject to γ > 0 and with limit conditions y(0) = 0
and
y (1) = 0.
(4.49)
Theorem 4.5. Assume (L.3). Set Z(t) = Z1 (t) = ψ(t)B(t) for 0 < t < 1. Then, the {(λk , ek (·)) : k ≥ 1} in the KL expansion of Z(·) are obtained by setting λ = 1/γ and e(t) = y(t)ψ(t), where y(·) is a continuous on [0, 1] and twice continuously differentiable on (0, 1) solution of the differential equation y (t) + γψ 2 (t)y(t) = 0,
(4.50)
subject to γ > 0 and with limit conditions y(0) = 0
and
y(1) = 0.
(4.51)
Topics on Empirical Processes
175
Proof. The details of proofs of Theorems 4.4 and 4.5 are given in Deheuvels and Martynov [63]. In the sequel, we will concentrate on the particular case where, for some constant β ∈ R, ψ(t) = tβ for 0 < t ≤ 1. (4.52) We note that (L.1–2–3) hold under (4.52) iff β > −1. In particular, 1 1 β > −1 ⇔ tψ 2 (t)dt < ∞ ⇔ t(1 − t)ψ 2 (t)dt < ∞. 0
0
For ν > −1, consider the Bessel function Jν (·) of first order and index ν (see §4.6 below for details on the definition and properties of Jν (·)). For ν > −1, the positive zeros of Jν (·) (solutions of Jν (z) = 0) form an infinite sequence, denoted hereafter by 0 < zν,1 < zν,2 < · · · . These zeros are interlaced with the zeros 0 < zν+1,1 < zν+1,2 < · · · of Jν+1 (·) (see, e.g., [207], p. 479), in such a way that 0 < zν,1 < zν+1,1 < zν,2 < zν+1,2 < zν,3 < · · · .
(4.53)
The next theorems provide KL expansions for {t W (t) : 0 ≤ t ≤ 1} and {t B(t) : 0 ≤ t ≤ 1}. β
β
Theorem 4.6. Let {W (t) : t ≥ 0} denote a Wiener process. Then, for each β = 1 eve 2ν − 1 > −1, or equivalently, for each ν = 1/(2(1 + β)) > 0, the Karhunen-Lo` expansion of {tβ W (t) : 0 < t ≤ 1} is given by 1
tβ W (t) = t 2ν −1 W (t) =
∞ λk ωk ek (t),
(4.54)
k=1
where {ωk : k ≥ 1} are i.i.d. N (0, 1) random variables, and, for k = 1, 2, . . ., 1 2ν 2 t 2ν 1 − 12 Jν zν−1,k 2ν for 0 < t ≤ 1.(4.55) λk = , ek (t) = t √ zν−1,k νJν zν−1,k Theorem 4.7. Let {B(t) : 0 ≤ t ≤ 1} denote a Brownian bridge. Then, for each 1 β = 2ν − 1 > −1, or equivalently, for each ν = 1/(2(1 + β)) > 0, the KarhunenLo`eve expansion of {tβ B(t) : 0 < t ≤ 1} is given by 1
tβ B(t) = t 2ν −1 B(t) =
∞ λk ωk ek (t),
(4.56)
k=1
where {ωk : k ≥ 1} are i.i.d. N (0, 1) random variables, and, for k = 1, 2, . . ., J z t 2ν1 2ν 2 1 1 ν ν,k − for 0 < t ≤ 1. (4.57) λk = , ek (t) = t 2ν 2 √ zν,k νJν−1 zν,k The KL expansion of {tβ B(t) : 0 < t ≤ 1} in Theorem 4.7 is known for β = 0 (⇔ ν = 1/2) (see, e.g., [5] and §4.4, pp. 30-32 in [85]). For β = −1/2 (⇔ ν = 1), we refer to Scott [179]. In the general case where β > −1/2 (⇔ 0 < ν < 1), this KL expansion turns out to be equivalent to a KL expansion given in Li [140].
176
P. Deheuvels
√ 1 The simple observation that the { ρ t 2 (ρ−1) ek (tρ ) : k ≥ 1} are orthonormal in 2 L (0, 1) whenever such is the case for the {ek (t) : k ≥ 1} (see, e.g., [145]), allows us to derive, via (4.58)–(4.59)–(4.60) below, through the change of scale t → tρ , a series of variants of the KL expansions of Theorems 4.6–4.7. We will make use of the convention of writing these KL expansions under the form Z(t) =
∞
λk
1/2
ωk ek (t),
0 < t ≤ 1,
(4.58)
k=1
where, as in (4.33), λk and ek (·) for k ≥ 1, are, respectively, the eigenvalues and eigenfunctions pertaining to {Z(t) : 0 < t ≤ 1}. The arguments above show that, whenever the KL expansion (4.58) holds, then, for each choice of ρ > 0, the KL 1 expansion of t 2 (ρ−1) Z(tρ ) on (0, 1] is given by 1
∞
1/2 √ 1 (ρ−1) λk /ρ ωk ρ t2 ek (tρ ) ,
0 < t ≤ 1,
(4.59)
with eigenvalues λ∗k and eigenfunctions e∗k (·) for k = 1, 2, . . ., given by √ 1 1/2 λ∗k = λk /ρ and e∗k (t) = ρ t 2 (ρ−1) ek (tρ ) .
(4.60)
t 2 (ρ−1) Z(tρ ) =
k=1
By combining (4.58)–(4.59)–(4.60) with Theorems 4.6–4.7, we obtain readily in (4.61)–(4.62) below the KL expansions of tθ W (tρ ) and tθ B(tρ ) (under the form (4.58)). For any ρ > 0, θ > − 21 (ρ + 1), and ν = ρ/(2θ + ρ + 1) > 0, we get namely ρ ∞ √ ρ 1 J z 2ν ν ν−1,k t 2ν θ ρ −2 2ν , (4.61) t W (t ) = ρt √ √ ωk zν−1,k ρ νJν zν−1,k k=1 ρ ∞ 2ν √ ρ − 1 Jν zν,k t 2ν θ ρ . (4.62) ρ t 2ν 2 √ t B(t ) = √ ωk zν,k ρ νJν−1 zν,k k=1 In particular, by setting ρ = 1, θ = β and ν = 1/(2(β + 1)) in (4.61)–(4.62), we get (4.54)–(4.55) and (4.56)–(4.57). Of special interest here is the choice of ρ = 2ν and θ = 12 − ν in (4.61)–(4.62). For these values of the constants ρ, θ, we obtain that, for each ν > 0, the KL 1 1 expansions of t 2 −ν W (t2ν ) and t 2 −ν B(t2ν ) are given by. ∞ √ 1 2ν √ 1 Jν zν−1,k t −ν 2ν , (4.63) ωk 2 t2 t 2 W (t ) = zν−1,k Jν zν−1,k k=1 ∞ √ 1 2ν √ 1 Jν zν,k t . (4.64) ωk 2 t2 t 2 −ν B(t2ν ) = zν,k Jν−1 zν,k k=1 1
By multiplying both sides of (4.63) by t− 2 , we get a Dini series expansion of t−ν W (t2ν ) on (0, 1). Proceeding likewise with (4.64) one obtains the Fourier-Bessel series expansion of t−ν B(t2ν ) on (0, 1) (see, e.g., pp. 96–103 in [132]). Recall that,
Topics on Empirical Processes
177
under suitable conditions on the functions f (·) and g(·) on (0, 1), it is possible to ∞ expand f (·) into the Fourier-Bessel expansion f (t) = k=1 ak Jν (zν,k t), with 1 2 tf (t)Jν (zν,k t)dt, k ≥ 1, (4.65) ak = 2 Jν−1,k (zν,k ) 0 ∞ and g(·) into a Dini expansion g(t) = k=1 bk Jν (zν−1,k t), with 1 2 tg(t)Jν (zν−1,k t)dt, k ≥ 1. (4.66) bk = 2 Jν,k (zν−1,k ) 0 By setting f (t) = t−ν B(t2ν ) and g(t) = t−ν W (t2ν ) in (4.65)–(4.66), we get ak =
2νωk zν,k Jν−1 (zν,k )
and bk =
2νωk zν−1,k Jν (zν−1,k )
for k ≥ 1. (4.67)
Put θ = 0 and ν = ρ/(ρ + 1) in (4.61)–(4.62). Set, for notational simplicity, zν,k = zρ/(ρ+1),k and zν−1,k = zρ/(ρ+1)−1,k . We so obtain the KL expansions ρ+1 √ ∞ J 2 z ρ t 2 ρ ρ/(ρ+1) ν−1,k ωk (4.68) , ρ + 1 t2 W (tρ ) = zν−1,k (ρ + 1) Jρ/(ρ+1) zν−1,k k=1 ρ+1 √ ∞ J ρ 2 ρ ρ/(ρ+1) zν,k t 2 ρ 2 ωk . (4.69) B(t ) = ρ+1t zν,k (ρ + 1) Jρ/(ρ+1)−1 zν,k k=1 The KL expansion (4.69) has been obtained by Li ([140]) (see the proof of Theorem 1.6, pp. 24-25 in [140]), up to the normalizing factor, for k = 1, 2, . . .,
ck = ρ + 1/{Jρ/(ρ+1)−1 zν,k }, of the eigenfunction in (4.69) (with the notation (4.58)) ρ+1 ek (t) = ck tρ/2 Jρ/(ρ+1) zν,k t 2 , left implicit in his work. In spite of the fact that it is possible to revert the previous arguments, starting with (4.69), in order to obtain an alternative proof of Theorem 4.6 based on [140], this does only work for the values of ν = ρ/(ρ+1) with 0 < ν < 1 (since we must have ρ > 0). 4.6. Bessel functions 4.6.1. Definition of Bessel functions. For each real constant ν ∈ R, we denote by Jν (·) the Bessel function of the first kind of index ν. For our needs, it will be useful to recall some important properties of these functions (refer to [132] and [207] for details). The second order homogeneous differential equation x2 y + xy + (x2 − ν 2 )y = 0,
(4.70) ∞ has a fundamental set of solutions on (0, ∞) of the form y = Cxν k=0 ak xk , where C is a constant. These solutions are proportional to the Bessel function of
178
P. Deheuvels
the first kind (see, e.g., 9.1.69 in [1]), explicitly defined, for an arbitrary ν ∈ R, by ∞ ( 12 x)ν (− 14 x2 )k ν 1 2 1 (4.71) . Jν (x) = 0 F1 (ν + 1; − 4 x ) = ( 2 x) Γ(ν + 1) Γ(ν + k + 1)Γ(k + 1) k=0
When ν = −n is a negative integer, Γ(ν + k + 1) = Γ(n + k + 1) = ∞ for k = 0, . . . , n − 1 so that, making use of the convention a/∞ = 0 when a ∈ R, the n first terms in the series (4.71) vanish. In this case, we have the relation J−n (x) = (−1)n Jn (x).
(4.72)
In (4.71), we made use of the generalized hypergeometric function 0 F1 (b; z) =
∞ 1 zk (b)k k!
for z ∈ C,
(4.73)
k=0
where the Pochhammer symbol (b)k is defined for k ∈ IN by (b)k = Γ(b + k)/Γ(b) when b = 0, −1, −2, . . . , and, for an arbitrary b ∈ R, by (b)0 = 1 and (b)k = b(b + 1) . . . (b + k − 1) for k ≥ 1.
(4.74)
The roots (or zeros) of Jν (·) have the following properties, in addition to (4.53) (see, e.g., Ch.XV, pp. 478–521 in [207], p. 96 in [132]). For any ν > −1, Jν (·) has only real roots. Moreover, in this case, the positive roots of Jν (·) are isolated and form an increasing sequence 0 < zν,1 < zν,2 < . . . ,
(4.75)
such that, for any fixed k ≥ 1, zν,k is a continuous and increasing function of ν > −1. In addition, for any specified ν > −1, as k → ∞, 1
4ν 2 − 1 zν,k = k + 12 (ν − 12 ) π − + O . k3 8 k + 12 (ν − 12 ) π Remark 4.2. For ν = − 12 and ν = 12 , we have (see (4.79) below) z− 12 ,k = k − 12 π and z 21 ,k = kπ for k = 1, 2, . . . ,
(4.76)
(4.77)
so that, in either of these cases, zν,k reduces to the first term in (4.76). An alternative definition of the Bessel function Jν (·) makes use of Euler’s formula (see, e.g., (2)–(3) p. 498 in [207]) Jν (z) =
∞ ( 12 z)ν z2 1− 2 Γ(ν + 1) zν,k k=1
for z > 0.
(4.78)
Topics on Empirical Processes
179
4.6.2. Some special cases. The expression (4.71) of the first-order Bessel function Jν (·) can be simplified when ν = m + 12 for an integer m = −1, 0, 1, . . .. In particular, for m = −1 and m = 0, " " 2 2 J− 12 (x) = πx cos x and J 12 (x) = πx sin x. (4.79) For m ≥ 0, we get
" m sin x
2 d . (4.80) Jm+ 12 (x) = (−1)m x πx m dx x In general, for an arbitrary integer m ≥ −1, Jm+ 12 (·) is of the form, " 1
1
πx 1 (4.81) 2 Jm+ 2 (x) = Qm x sin x − Pm x cos x, where Pm (·) and Qm (·) are polynomials. The first terms of the sequence are P−1 (u) = −1,
Q−1 (u) = 0,
P0 (u) = 0,
Q0 (u) = 1.
(4.82)
Lemma 4.3. For an arbitrary m ≥ 0, we have the recurrence formulas Qm+1 (u) =
(2m + 1)wQm (w) − Qm−1 (w),
(4.83)
Pm+1 (u) =
(2m + 1)wPm (w) − Pm−1 (w).
(4.84)
Proof. We have "
" 2m + 1 " πx πx 1 (x) − 1 J m+ 2 2 2 Jm− 2 (x), x so that (4.83)–(4.84) is straightforward. πx 2
Jm+ 32 (x) =
By combining (4.81)–(4.82) with (4.83)–(4.84), we get " sin x 2 − cos x , J 32 (x) = πx x " 3 sin x 3 cos x 2 J 52 (x) = − sin x . − πx x2 x
(4.85) (4.86)
References [1] Abramowitz, M. and Stegun, I.A. (1965). Handbook of Mathematical Functions. Dover, New York. [2] Acosta, A. de and Kuelbs, J. (1983). Limit theorems for moving averages of independent random vectors. Z. Warhscheinlichkeitstheor. Verw. Geb. 64 67–123. [3] Adler, R.J. (1990). An Introduction to Continuity, Extrema and Related Topics for General Gaussian Processes. IMS Lecture Notes-Monograph Series 12. Institute of Mathematical Statistics, Hayward, California. [4] Akaike, H. (1954). An approximation of the density function. Ann. Inst. Statist. Math. 6 127–132. [5] Anderson, T.W. and Darling, D.A. (1952). Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Statist. 23 193–212.
180
P. Deheuvels
[6] Araujo, A. and Gin´e, E. (1980). The Central Limit Theorem for Real and Banach Valued Random Variables. Wiley, New York. [7] Ash, R.B. and Gardner, M.F. (1975). Topics in Stochastic Processes. Academic Press, New York. [8] Bahadur, R.R. (1967). A note on quantiles in large samples. Ann. Math. Statist. 37 577–580. [9] Bahadur, R.R. (1971). Some Limit Theorems in Statistics. Regional Conference Series in Applied Mathematics. 4. S.I.A.M., Philadelphia. [10] Bauer, H. (1981). Probability Theory and Elements of Measure Theory. Academic Press, New York. [11] del Barrio, E., Cuesta-Albertos, J.A. and Matran, C. (2000). Contributions of empirical and quantile processes to the asymptotic theory of goodness-of-fit tests. Test 9 1–96. [12] B´ artfai, P. (1966). Die Bestimmung der zu einem wiederkehrenden Prozess geh¨ orenden Verteilungsfunktion aus den mit Fehlern behafteten Daten einer einzigen Relation. Studia Sci. math. Hung. 1 161–168. [13] Bartlett, M.S. (1963). Statistical estimation of density functions. Sankhy¯ a. Ser. A 25 245–254. [14] Berkes, I. and Philipp, W. (1979). Approximation theorems for independent and weakly dependent random vectors. Ann. Probab. 7 29–54. [15] Berlinet, A. (1993). Hierarchies of higher-order kernels. Prob. Theor. Rel. Fields 94 489–504. [16] Berlinet, A. and Devroye, L. (1994). A comparison of kernel density estimates. Publ. Inst. Statist. Univ. Paris 38 3–59. [17] Bickel, P. and Rosenblatt, M. (1973). On some global measures of the deviation of density function estimates. Ann. Statist. 1 1071–1095. [18] Bickel, P. and Rosenblatt, M. (1975). Corrections to “On some global measures of the deviation of density function estimates”. Ann. Statist. 3 1370. [19] Billingsley, P. (1968). Convergence of Probability Measures. John Wiley & Sons, New York. [20] Borell, C. (1975). The Brunn-Minkovski inequality in Gauss space. Invent. Math. 30 207–216. [21] Borell, C. (1976). Gaussian Radon measures on locally convex spaces. Math. Scand. 38 265–284. [22] Borell, C. (1977). A note on Gauss measures which agree on balls. Ann. Inst. H. Poincar´e Ser. B 13 231–238. [23] Bosq, D. and Lecoutre, J.-P. (1987). Th´eorie de l’Estimation Fonctionnelle. Economica, Paris. [24] Bowman, F. (1958). Introduction to Bessel Functions. Dover, new York. [25] Bowman, A., Hall, P. and Prvan, T. (1998). Bandwidth selection for the smoothing of distribution functions. Biometrika 85 799–808. [26] Bretagnolle, J. and Massart, P. (1989). Hungarian constructions from the nonasymptotic viewpoint. Ann. Probab. 17 239–256.
Topics on Empirical Processes
181
[27] Castelle, N., and Laurent-Bonvalot, F. (1998). Strong approximations of bivariate uniform empirical processes. Ann. Inst. Henri Poincar´e. 34 425–480. [28] Castelle, N. (2002). Approximation fortes pour des processus bivari´es. Canad. J. Math. 54 533–553. [29] Cheng, P.E., and Bai, Z. (1995). Optimal strong convergence rates in nonparametric regression. Math. Meth. of Statist. 4 405–420. [30] Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sums of observations. Ann. Math. Statist. 23 493–507. [31] Chung, K.L. (1949). An estimate concerning the Kolmogorov limit distributions. Trans. Amer. Math. Soc. 67 36–50. [32] Ciesielski, Z. and Taylor, S.J. (1962). First passage times and sojourn times for Brownian motion in space and the exact Hausdorff measure of the sample path. Trans Amer. Math. Soc. 434–450. [33] Collomb, G. (1977). Quelques propri´et´es de la m´ethode du noyau pour l’estimation non-param´etrique de la r´egression en un point fix´e. C. R. Acad. Sci. Paris 285 A 289–292. [34] Collomb, G. (1979). Conditions n´ecessaires et suffisantes de convergence uniforme d’un estimateur de la r´egression, estimation des d´eriv´ees de la regr´ession. C. R. Acad. Sci . Paris 288 161–163. [35] Collomb, G. (1981). Estimation non-param´etrique de la r´egression: revue bibliographique. Int. Statist. Rev. 49 75–93. [36] Cs´ aki, E. (1980). On the standardized empirical distribution function. Coll. Math. Soc. J´ anos Bolyai. 32 123–138. Nonparametric Statistical Inference. Akademiai Kiado, Budapest. [37] Cs¨ org˝ o, M. and Horv´ ath, L. (1986). Approximations of weighted empirical and quantile processes. Statist. and Probab. Letters. 4 275–280. [38] Cs¨ org˝ o, M. and Horv´ ath, L. (1993). Weighted Approximations in Probability and Statistics. Wiley, New York. [39] Cs¨ org˝ o, M. and R´ev´esz, P. (1978). Strong approximation of the quantile process. Ann. Statist. 6 882–894. [40] Cs¨ org˝ o, M. and R´ev´esz, P. (1981). Strong Approximations in Probability and Statistics. Academic Press, New York. [41] Cs¨ org˝ o, M., Cs¨ org˝ o, S., Horv´ ath, L. and Mason, D.M. (1986). Weighted empirical and quantile processes. Ann. Probab. 14 86–118. [42] Dall’Aglio, G., Kotz, S. and Salinetti, G. (1991). Advances in Probability Distributions with Given Marginals. Kluwer, Dordrecht. [43] Darling, D.A. (1957). The Kolmogorov-Smirnov, Cram´er-von Mises tests. Ann. Math. Statist. 28 823–838. [44] David, H.A. (1981). Order Statistics. 2nd Ed. Wiley, New York. [45] Deheuvels, P. (1974). Conditions n´ecessaires et suffisantes de convergence presque sˆ ure et uniforme presque sˆ ure des estimateurs de la densit´e. C. R. Acad. Sci. Paris. Ser. A 278 1217–1220.
182
P. Deheuvels
[46] Deheuvels, P. (1977a). Estimation non param´etrique de la densit´e par histogrammes g´en´eralis´es. Publications de l’Institut de Statistique de l’Universit´e de Paris. 22 1–24. [47] Deheuvels, P. (1977b). Estimation non param´etrique de la densit´e par histogrammes g´en´eralis´es (II). Revue de Statistique Appliqu´ee 25 5–42. [48] Deheuvels, P. (1979). Propri´et´es d’existence et propri´et´es topologiques des fonctions de d´ependance. C. R. Acad. Sci. Paris, Ser. A 288 217–220. [49] Deheuvels, P. (1986). Strong laws for the k-th order statistic when k ≤ c log2 n. Probab. Theory Related Fields. 72 133–154. [50] Deheuvels, P. (1990). Laws of the iterated logarithm for density estimators. In: Nonparametric Functional Estimation and Related Topics. Roussas, G. (Ed.) 19– 20. Kluwer, Dordrecht. [51] Deheuvels, P. (1991). Functional Erd˝ os-R´enyi laws. Studia Sci. Math. Hungar. 26 261–295. [52] Deheuvels, P. (1992). Functional laws of the iterated logarithm for large increments of empirical and quantile processes. Stochastic Processes Appl. 43 133–163. [53] Deheuvels, P. (1996). Functional laws of the iterated logarithm for small increments of empirical processes. Statistica Neerlandica. 50 261–280. [54] Deheuvels, P. (1997). Strong laws for local quantile processes. Ann. Probab. 25 2007–2054. [55] Deheuvels, P. (1998). On the approximation of quantile processes by Kiefer processes. J. Theor. Probab. 11 997–1018. [56] Deheuvels, P. (2000). Limit laws for kernel density estimators for kernels with unbounded supports. Asymptotics in Statistics and Probability. M.L. Puri (Ed.) 117– 132. VSP. International Science Publishers, Amsterdam. [57] Deheuvels, P. (2000). Strong approximations of quantile processes by iterated Kiefer processes. Ann. Probab. 28 909–945. [58] Deheuvels, P., Devroye, L. and Lynch, J. (1986). Exact convergence rates in the limit theorems of Erd˝ os-R´enyi and Shepp. Ann. Probab. 14 209–223. [59] Deheuvels, P. and Devroye, L. (1987). Limit laws of Erd˝ o-R´enyi-Shepp type. Ann. Probab. 15 1363–1386. [60] Deheuvels, P. and Einmahl, J. H. J. (2000). Functional limit laws for the increments of Kaplan-Meier product-limit processes and applications. Ann. Probab. 28 1301– 1335. [61] Deheuvels, P. and Lifshits, M.A. (1993). Strassen-type functional laws for strong topologies. Probab. Theory Related Fields. 97 151–167. [62] Deheuvels, P. and Lifshits, M.A. (1994). Necessary and sufficient conditions for the Strassen law of the iterated logarithm in nonuniform topologies. Ann. Probab. 22 1838–1856. [63] Deheuvels, P. and Martynov, G. (2003). Karhunen-Lo`eve expansions for weighted Wiener processes and Brownian bridges via Bessel functions. Progress in Probability. 55 57–93. [64] Deheuvels, P. and Mason, D.M. (1990a). Bahadur-Kiefer-type processes. Ann. Probab. 18 669–697.
Topics on Empirical Processes
183
[65] Deheuvels, P. and Mason, D.M. (1990b). Nonstandard Functional laws of the iterated logarithm for tail empirical and quantile processes. Ann. Probab. 18 1693–1722. [66] Deheuvels, P. and Mason, D.M. (1992a). Functional laws of the iterated logarithm for the increments of empirical and quantile processes. Ann. Probab. 20 1248–1287. [67] Deheuvels, P. and Mason, D.M. (1992b). A functional L.I.L. approach to pointwise Bahadur-Kiefer theorems. In Probability in Banach Spaces. (R.M. Dudley, M. Hahn and J. Kuelbs, eds.) 8 255–266. Birkh¨ auser, Boston. [68] Deheuvels, P. and Mason, D.M. (1994a). Functional laws of the iterated logarithm for local empirical processes indexed by sets. Ann. Probab. 22 1619–1661. [69] Deheuvels, P. and Mason, D.M. (1994b). Random fractals generated by oscillations of processes with stationary and independent increments. Probability in Banach Spaces. 73–90, 9 Hoffman-Jørgensen, J., Kuelbs, J. and Marcus, M.B. Eds. Birkh¨ auser, Boston. [70] Deheuvels, P. and Mason, D.M. (2004). General asymptotic confidence bands based on kernel-type function estimators. Statist. Inference for Stoch. Processes. 7 225– 277. [71] Deheuvels, P. and Steinebach, J. (1987). Exact convergence rates in strong approximation laws for large increments of partial sums. Probab. Theor. Related Fields. 76 369–393. [72] Derzko, G. and Deheuvels, P. (2002). Estimation non-param´etrique de la r´egression dichotomique - application biom´edicale. C. R. Acad. Sci. Paris Ser. I 334 59–63. [73] Deuschel, J.D. and Stroock, D.W. (1989). Large Deviations. Academic Press, New York. [74] Devroye, L. (1978). The uniform convergence of the Nadaraya-Watson regression function estimate. Canad. J. Statist. 6 179–191. [75] Devroye, L. (1977). A uniform bound for the deviation of empirical distribution functions. J. Multivariate Analysis. 7 594–597. [76] Devroye, L. (1982). Bounds for the uniform deviation of empirical measures. J. Multivariate Analysis. 12 72–79. [77] Devroye, L. (1987). A Course in Density Estimation. Birkh¨ auser-Verlag, Boston. [78] Devroye, L. and Lugosi, G. (2001). Combinatorial Methods in Density Estimation. Springer, New York. [79] Devroye, L. and Gy¨ orfi, L. (1985). Nonparametric Density Estimation: The L1 View. Wiley, New York. [80] Doob, J.L. (1953). Stochastic Processes. Wiley, New York. [81] Dudley, R.M. and Philipp, W. (1983). Invariance principles for sums of Banach space valued random elements and empirical processes indexed by sets. Z. Wahrsch. Verw. Gebiete. 62 509–552. [82] Dudley, R.M. (1999). Uniform Central Limit Theorems. Cambridge University Press, Cambridge. [83] Dudley, R.M. (2002). Real Analysis and Probability. Cambridge University Press, Cambridge. [84] Dugundji, J. (1966). Topology. Allyn and Bacon, Boston.
184
P. Deheuvels
[85] Durbin, J. (1973). Distribution Theory for Tests Based on the Sample Distribution Function. Regional Conference Series in Applied Mathematics, 9 S.I.A.M., Philadelphia. [86] Dvoretzky, A., Kiefer, J. and Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist. 33 642–669. [87] Eggermont, P.P.B. and La Riccia, V.N. (2001). Maximum Penalized Likelihood Estimation. Springer, New York. [88] Eicker, F. (1979). The asymptotic distribution of the suprema of the standardized empirical processes. Ann. Statist. 7 116–138. [89] Einmahl, J.H.J. (1987). Multivariate Empirical Processes. C.W.I. Tract 32. Mathematisch Centrum, Amsterdam. [90] Einmahl, U. (1986). A refinement of the KMT-inequality for partial sum strong approximation. Technical Report Series of the Laboratory for Research in Statistics and Probability – Carleton University. 88, Ottawa, Canada. [91] Einmahl, U. (1988). Strong approximations for partial sums of i.i.d. B-valued r.v.’s in the domain of attraction of a Gaussian law. Probab. Theor. Rel. Fields. 77 65–85. [92] Einmahl, U. (1989). Extensions of results of Koml´ os, Major and Tusn´ ady to the multivariate case. J. Multivariate Anal. 28 20–68. [93] Einmahl, J.H.J. and Mason, D.M. (1985). Bounds for weighted multivariate empirical distribution functions. Z. Wahrsch. Verw. Gebiete. 70 563–571. [94] Einmahl, J.H.J. and Mason, D.M. (1988). Strong limit theorems for weighted quantile processes. Ann. Probab. 16 1623–1643. [95] Einmahl, U. and Mason, D.M. (2000). An empirical process approach to the uniform consistency of kernel-type function estimators. J. Theoretical Prob., 13, 1–37. [96] Epanechnikov, V.A. (1969). Nonparametric estimation of a multivariate probability density. Theor. Probab. Appl. 14 153–158. [97] Erd˝ os, P. and R´enyi, A. (1970). On a new law of large numbers. J. Analyse Math. 23 103–111. [98] Finkelstein, H. (1971). The law of the iterated logarithm for empirical distributions. Ann. Math. Statist. 42 607–615. [99] Gaenssler, P. (1983). Empirical Processes. Vol. 3, IMS Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward. [100] Gaenssler, P. and Stute, W. (1979). Empirical process: a survey of results for independent and identically distributed random variables. Ann. Probab. 7 193–243. [101] Gasser, T., M¨ uller, H.G. and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. J. R. Statist. Soc. Ser. B 47 238–252. [102] Gikhman, I.I. (1957). On a nonparametric criterion of homogeneity for k samples. Theor. Probab. Appl. 2 369–373. [103] Goodman, V., Kuelbs, J. and Zinn, J. (1981). Some results on the LIL in Banach space with application of weighted empirical process. Ann. Probab. 8 713–752. [104] Gy¨ orfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
Topics on Empirical Processes
185
[105] Hall, P. (1991). On iterated logarithm laws for linear arrays and nonparametric regression estimators. Ann. Probab. 19 740–757. [106] Hall, P. and Heyde, C.C. (1980). Martingale Limit Theory and its Application. Academic Press, New York. [107] Hall, P. and Marron, J.S. (1991). Lower bounds for bandwidth selection in density estimation. Probab. Theor. Rel. Fields. 90 149–173. [108] Hall, P., Marron, J.S. and Park, B.U. (1992). Smoothed cross-validation. Probab. Theor. Rel. Fields 92 1–20. [109] H¨ ardle, W. (1984). A law of the iterated logarithm for nonparametric regression function estimators. Ann. Statist. 12 624–635. [110] H¨ ardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. [111] H¨ ardle, W., Janssen, P. and Serfling, R. (1988). Strong uniform consistency rates of estimators of conditional functionals. Ann. Statist. 16 1428–1449. [112] Hartman, P. and Wintner, A. (1941). On the law of iterated logarithm. Amer. J. Math. 63 169–176. [113] H¨ ogn¨ as, G. (1977). Characterization of weak convergence of signed measures on [0, 1]. Math. Scand. 41 175–184. [114] Hurevicz, W. and Wallman, G. (1948). Dimension Theory. Princeton University Press, Princeton. [115] Izenman, A.J. (1991). Recent developments in nonparametric density estimation. J. Amer. Statist. Assoc. 86 205–224. [116] Jaeschke, D. (1979). The asymptotic distribution of the supremum of the standardized empirical distribution on subintervals. Ann. Statist. 7 108–115. [117] Jones, M.C., Marron, J.S. and Park, B.U. (1991). A simple root n bandwidth selector. Ann. Statist. 19 1919–1932. [118] Jones, M.C., Marron, J.S. and Sheather, S.J. (1996). A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc. 91 401–407. [119] Kac, M. (1951). On some connections between probability theory and differential and integral equations. Proc.Second Berkeley Sympos. Math. Statist. Probab. 180– 215. [120] Kac, M. (1980). Integration in Function Spaces and Some of its Applications. Lezioni Ferniane, Academia Nazionale dei Lincei. Pisa. [121] Kac, M. and Siegert, A.J.F. (1947). On the theory of noise in radio receivers with square law detectors. J. Appl. Physics. 18 383–397. [122] Kac, M. and Siegert, A.J.F. (1947). An explicit representation of a stationary Gaussian process. Ann. Math. Statist. 18 438–442. [123] Kiefer, J. (1959). K-sample analogues of the Kolmogorov-Smirnov and Cram´er-V. Mises tests. Ann. Math. Statist. 30 420–447. [124] Kiefer, J. (1967). On Bahadur’s representation of sample quantiles. Ann. Math. Statist. 38 1323–1342. [125] Kiefer, J. (1970). Deviations between the sample quantile process and the sample d.f. In Nonparametric Techniques in Statistical Inference. (M. Puri, ed.) 299–319. Cambridge Univ. Press.
186
P. Deheuvels
[126] Kiefer, J. (1972a). Iterated logarithm analogues for sample quantiles when pn ↓ 0. Proc. Sixth Berkeley Symp. Math. Statist. Probab. 1 227–244. Univ. California Press, Berkeley. [127] Kiefer, J. (1972b). Skorohod embedding of multivariate rv’s and the sample df. Z. Wahrsch. Verw. Gebiete. 24 1–35. [128] Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giorn. Inst. Ital. Attauri. 4 83–91. [129] Koml´ os, J., Major, P. and Tusn´ ady, G. (1975). An approximation of partial sums of independent r.v.’s and the sample df. I. Z. Wahrsch. Verw. Gebiete. 32 111–131. [130] Koml´ os, J., Major, P. and Tusn´ ady, G. (1975). An approximation of partial sums of independent r.v.’s and the sample df. II. Z. Wahrsch. Verw. Gebiete. 34 33–58. [131] Konakov, V.D. and Piterbarg, V.I. (1984). On the convergence rate of maximal deviation distribution of kernel regression estimates. J. Mult. Appl. 15 279–294. [132] Korenev, B.G. (2002). Bessel Functions and their Applications. Taylor & Francis, London. [133] Krzy˙zak, A., and Pawlak, M. (1984). Distribution-free consistency of a nonparametric kernel regression estimate and classification. IEEE Trans. on Information Theory. 30 78–81. [134] Lai, T.L. (1974). Reproducing kernel Hilbert spaces and the law of the iterated logarithm for Gaussian processes. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 29 7–19. [135] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin. [136] Ledoux, M. (1996). On Talagrand’s deviation inequalities for product measures. ESAIM: Probab. Statist. 1 63–87. http//www.emath.fr/ps/ [137] L´evy, P. (1937). Th´eorie de l’Addition des Variables Al´eatoires. Gauthier-Villars, Paris. [138] L´evy, P. (1951). Wiener random functions and other Laplacian random functions. Proc. 6th Berkeley Sympos. Probab. Theory Math. Statist. 2 171–186. [139] L´evy, P. (1953). La mesure de Hausdorff de la courbe du mouvement brownien. Giorn. ist. Ital. Attuari. 16 1–37. [140] Li, W.V. (1992a). Comparison results for the lower tail of Gaussian seminorms. J. Theor. Probab. 5 1–31. [141] Li, W.V. (1992b). Limit theorems for the square integral of Brownian motion and its increments. Stoch. Processes Appl. 41 223–239. [142] Li, W.V. (1992c). Lim inf results for the Wiener process and its increments under the L2 -norm. Prob. Th. Rel. Fields. 92 69–90. [143] Loader, C.R. (1999). Bandwidth selection: classical or plug-in? Ann. Statist. 27 415–438. [144] Lynch, J. and Sethuraman, J. (1987). Large deviations for processes with independent increments. Ann. Probab. 15 610–627. [145] Maccone, C. (1984). Eigenfunctions and energy for time-rescaled Gaussian processes. Boll. Un. Mat. ital. 6 213–219.
Topics on Empirical Processes
187
[146] Major, P. (1976a). The approximation of partial sums of independent r.v.’s. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 35 213–220. [147] Major, P. (1976b). The approximation of partial sums of i.i.d.r.v.’s. when the summands have only two moments. Z. Wahrscheinlichkeitstheorie Verw. Gebiete. 35 221–230. [148] Major, P. (1979). An improvement of Strassen’s invariance principle. Ann. Probab. 7 55–61. [149] Marron, J.S. and Nolan, D. (1988). Canonical kernels for density estimation. Statist. Probab. Letters. 7 195–199. [150] Martynov, G.V. (1975). Computation of the distribution function of quadratic forms in normal random variables. Theor. Probab. Appl. 20 797–809. [151] Martynov, G.V. (1977). Generalization of the Smirnov formula for the quadratic forms distributions. Theor. Probab. Appl. 22 614–620. [152] Martynov, G.V. (1992). Statistical tests based on empirical processes and related questions. J. Soviet. Math. 61 2195–2275. [153] Mason, D.M. (1984). A strong limit theorem for the oscillation modulus of the uniform empirical quantile process. Stoch. Processes Appl. 17 126–136. [154] Mason, D.M., Shorack, G. and Wellner, J.A. (1983). Strong limit theorems for oscillation moduli of the empirical process. Z. Wahrscheinlichkeit. verw. Gebiete. 61 369–373. [155] Mason, D.M. and van Zwet, W.R. (1987). A refinement of the KMT inequality for the uniform empirical process. Ann. Probab. 15 871–884. [156] Mason, D.M. (1988). A strong invariance theorem for the tail empirical process. Ann. Inst. H. Poincar´e Probab. Statist. 24 491–506. [157] Massart, P. (1990). About the constant in the DKW inequality. Ann. Probab. 18 1269–1283. [158] M¨ uller, H.G. (1988). Nonparametric Regression Analysis of Longitudinal Data. Springer, New York. [159] Nadaraya, E.A. (1964). On estimating regression. Theor. Probab. Appl. 9 141–142. [160] Nadaraya, E.A. (1970). Remarks on nonparametric estimates for density function and regression curves. Theor. Probab. Appl. 15 134–137. [161] Nadaraya, E.A. (1989). Nonparametric Estimation of Probability Densities and Regression Curves. Kluwer, Dordrecht. [162] Park, B.U. and Marron, J.S. (1990). Comparison of data-driven bandwidth selectors. J. Amer. Statist. Assoc. 85 66–72. [163] Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statist. 33 1065–1076. [164] Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York. [165] Prakasa Rao, B.L.S. (1983). Nonparametric Functional Estimation. Academic Press, New York. [166] R´ev´esz, P. (1979a). On nonparametric estimation of the regression function. Problems of Control and Information Theory. 8 297–302.
188
P. Deheuvels
[167] R´ev´esz, P. (1979b). A generalization of Strassen’s functional law of the iterated logarithm for Gaussian processes. Z. Warhscheinlichkeitstheor. Verw. Geb. 50 257– 264. [168] Rice, J. (1984). Bandwidth choice for nonparametric regression. Ann. Statist. 12 1215–1230. [169] Rosenblatt, M. (1952). Remarks on a multivariate transformation. Ann. Math. Statist. 23 470–472. [170] Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist. 27 832–837. [171] Roussas, G. (1990). Nonparametric Functional Estimation and Related Topics. NATO ASI Series 355. Kluwer, Dordrecht. [172] Rudin, W. (1979). Real and Complex Analysis. 3rd ed., McGraw Hill, New York. [173] Rudin, W. (1973). Functional Analysis. Tata McGraw-Hill, New Delhi. [174] Schilder, M. (1966). Asymptotic formulas for Wiener integrals. Trans. Amer. Math. Soc. 125 63–85. [175] Schucany, W.R. (1989a). On nonparametric regression with higher-order kernels. J. Stat. Plann. Inference. 23 145–151. [176] Schucany, W.R. (1989b). Locally optimal window widths for kernel density estimation with large samples. Statist. Probab. Letters 7 401–405. [177] Schucany, W.R. and Sommers, J.P. (1977). Improvements of kernel type density estimators. J. Amer. Statist. Assoc. 72 420–423. [178] Scott, D.W. (1992). Multivariate Density Estimation – Theory, Practice and Visualization. Wiley, New York. [179] Scott, W.F. (1999). A weighted Cram´er-von Mises statistic, with some applications to clinical trials. Commun. Statist. Theor. Methods. 28 3001–3008. [180] Scott, D.W. and Terrel, G.R. (1987). Biased and unbiased cross-validation in density estimation. J. Amer. Statist. Assoc. 82 1131–1146. [181] Sheather, S.J. and Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. J. Roy. Statist. Soc. Ser.B 53 683–690. [182] Shorack, G.R. (1982). Kiefer’s theorem via the Hungarian construction. Z. Wahrsch. Verw. Gebiete. 34 33–58. [183] Shorack, G.R. and Wellner, J.A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York. [184] Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. [185] Singh, R.S. (1979). Mean squared errors of estimates of a density and its derivatives. Biometrika 66 177–180. [186] Singh, R.S. (1987). MISE of kernel estimates of a density and its derivatives. Stat. and Probab. Letters 5 153–159. [187] Sklar, A. (1959). Fonctions de r´epartition ` a n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris. 8 229–231. [188] Skorohod, A.V. (1976). On a representation of random variables. Theory Probab. Appl. 21 628–632. [189] Smirnov, N.V. (1936). Sur la distribution de ω 2 . C.R. Acad. Sci. Paris. 202 449–452.
Topics on Empirical Processes
189
[190] Smirnov, N.V. (1937). On the distribution of the ω 2 criterion. Rec. Math. (Mat. Sbornik). 6 3–26. [191] On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. de l’Universit´e de Moscou. 2. [192] Smirnov, N.V. (1948). Table for estimating the goodness of fit of empirical distributions. Ann. Math. Statist. 19 279–281. [193] Spiegelman, J. and Sacks, J. (1980). Consistent window estimation of nonparametric regression. Ann. Statist. 8 240–246. [194] Steen, L.A. and Seebach, J.A. (1978). Counterexamples in Topology. 2nd Ed. Springer, New York. [195] Stein, E.M. (1970). Singular Integrals and Differentiability Properties of Functions, Princeton University Press, Princeton, New Jersey. [196] Stone, C.J. (1977). Consistent nonparametric regression. Ann. Statist. 5 595–645. [197] Stone, C.J. (1980). Optimal rates of convergence for nonparametric estimators. Ann. Statist. 8 1348–1360. [198] Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040–1053. [199] Strassen, V. (1964). An invariance principle for the law of the iterated logarithm. Z. Wahrsch. Verw. Gebiete. 3 211–226. [200] Stute, W. (1982). The oscillation behaviour of empirical processes. Ann. Probab. 10 86–107. [201] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probability 22, 28–76. [202] Tapia, R.A. and Thompson, J.R. (1978). Nonparametric Probability Density Estimation. Johns Hopkins University Press, Baltimore. [203] Terrel, G.R. (1990). The maximal smoothing principle in density estimation. J. Amer. Statist. Assoc. 85 470–477. [204] van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes. Springer Verlag, New York. [205] Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. Chapman and Hall, London. [206] Varadhan, S.R.S. (1966). Asymptotic probabilities and differential equations. Comm. Pure Appl. Math. 19 261–286. [207] Watson, G.N. (1952). A Treatise on the Theory of Bessel Functions. Cambridge University Press, Cambridge. [208] Watson, G.S. (1964). Smooth regression analysis. Sankhy¯ a A26 359–372. [209] Watson, G.S. and Leadbetter, M.R. (1963). On the estimation of probability density, I. Ann. Math. Statist. 34 480–491. [210] Wertz, W. (1972). Fehlerabsch¨ atzung f¨ ur eine Klasse von nichtparametrischen Sch¨ atzfolgen. Metrika. 19 132–139. [211] Wertz, W. (1978). Statistical Density Estimation: A Survey. Vandenhoeck & Ruprecht, G¨ ottingen. [212] Woodroofe, M. (1966). On the maximum deviation of the sample density. Ann. Math. Statist. 41 1665–1671.
190
P. Deheuvels
[213] Zolotarev, V.M. (1961). Concerning a certain probability problem. Theor. Probab. Appl. 6 201–204. [214] Zolotarev, V.M. (1983). Probability metrics. Theory Probab. Appl. 28 278–302. Paul Deheuvels L.S.T.A., Universit´e Paris VI ((Please insert complete address))
Oracle Inequalities and Regularization Sara van de Geer Keywords. Classification, complexity penalties, complexity regularization, density estimation, empirical processes, empirical risk minimization, M-estimation, model selection, nonparametric regression, oracle inequalities.
1. Statistical models In this chapter, the construction of a statistical model is discussed. We contemplate on deviations from the model on the one hand, and simplicity of a model on the other. We introduce the concepts approximation error and estimation error. The idea of complexity regularization is illustrated in two situations: histograms in density estimation and smoothing splines in regression. Here is a brief sketch of the contents of the other chapters. In Chapter 2, we introduce penalized M-estimators – or penalized empirical risk estimators. These are obtained by minimizing a loss function (e.g., least squares loss, minus maximum likelihood, or, in classification, support vector machine loss). A roughness penalty is added to the loss function to avoid overfitting. We study the behavior of the estimators in a general context. The excess risk of an estimator is a global measure for its performance. We consider so-called oracle inequalities for the excess risk. These inequalities relate the performance of the estimator to the procedure that chooses the optimal model by trading off bias and variance (or, more generally, approximation error and estimation error). In Chapter 2, we highlight the role of empirical process theory in this context. As important particular case, we investigate high-dimensional linear spaces. The approximation error then comes from approximating curves or images by elements of a high-dimensional parameter space. Chapter 3 studies oracle inequalities in a regression framework, using the least squares estimators of the coefficients. That setup has the advantage that everything can be calculated explicitly. It serves as a preparation for more complicated situations.
192
S. van de Geer
Chapter 4 considers general least squares estimators. In Chapter 5, we look at robust regression estimators, density estimation, and binary classification. We consider there a penalty of 1 -type. Chapter 6 summarizes some tools from empirical process theory. Each chapter ends with bibliographical remarks. Most of the work will be on estimation theory and on where this theory can make use of inequalities for empirical processes. Approximation theory (for example the approximating properties of truncated series expansions) will be touched upon only briefly. We consider a data set consisting of n observations on a variable, say X, with values in some space X . These observations are denoted by X1 , . . . , Xn . We assume that the observations are independent, and that each observation follows the same probability law as X, say P (independent, non-identically distributed observations will also be considered). The probability distribution P is in whole or in part unknown. Our aim is to estimate P or certain aspects of it. A statistical model is a set of candidate distributions P for P . If “nothing” is known, one might want to choose P as the set of “all” distributions on X . However, if one has some idea about the form of P , one may want to incorporate this information into the model set P. In that way the estimation problem is made easier, i.e., the accuracy is greater. If P ∈ / P, the model is misspecified. In that case one usually has a systematic error (bias) in the estimator. 1.1. Parametric models A parametric model is of the form P = {Pθ : θ ∈ Θ}. where the parameter space Θ is a subset of Euclidean space RN . For example, if the Xi are yes/no answers to a certain question (the binary case), we know that P allows only two possibilities, say 1 and 0 (yes=1, no=0). There is only one parameter, say the probability θ of a yes answer. Here is another example. Example 1.1. Vilfredo Pareto (1897) noticed that the number of people whose income exceeds level x is often approximately proportional to x−θ , where θ is a parameter which differs from country to country. Therefore, as a model for the distribution of incomes, one may propose the Pareto distribution function 1 Pθ (X ≤ x) = 1 − θ , x > 1, x with density θ fθ (x) = θ+1 , x > 1. x More generally, in a parametric model, there are only a finite number of parameters θ = (θ1 , . . . , θN ). When the model is well specified, one has P = Pθ0 for some θ0 ∈ Θ. However, a low-dimensional, parametric model is often just a mathematical idealization. For example, there seems to be little physical reason for incomes to follow the Pareto law. I.e., a model is only an approximation.
Oracle Inequalities and Regularization
193
inaccuracy
systematic error
|
oracle
complexity complexiteit
Figure 1. The trade off between inaccuracy and systematic error
When there are infinitely many parameters in the model, it is called nonparametric. We will however not be very strict in our distinction between parametric and nonparametric, for the following reason. Throughout, the number of observations n is assumed to be “large”. We will allow that the choice of the model P depends on n, and is in fact more “rich” for larger n. This is only natural, since when we have many observations, we may want to use more flexible models and get more information out of the data. Thus, in a parametric model, Θ may depend on n, and in particular its dimension N may depend on n, and in fact grow without limit as n → ∞. This means strictly speaking that we deal with a sequence of parametric models with nonparametric limiting model. We think of such a situation as a nonparametric one. Parametric models (with N “small”) are in a sense less rich than nonparametric models, and there is also a range in the complexity of various nonparametric models. The more complex a model, the larger the inaccuracy will be. On the other hand, too simple models have large systematic error. (Here, we use a generic terminology. we will be more precise in our definitions later on, e.g., in Section 2.3.) Both inaccuracy and systematic error depend on the model, and on the truth P . The optimal model trades off the inaccuracy and systematic error (see Figure 1). However, since P is unknown, it is also not known which model this will be. Only an oracle can tell you that. Our aim will be to mimic this oracle. To evaluate the inaccuracy of a model, we will use empirical process theory. Empirical process theory is about comparing the theoretical distribution P with its empirical counterpart, the empirical distribution Pn , introduced in the next section.
194
S. van de Geer
1.2. The empirical distribution The unknown P can be estimated from the data in the following way. Suppose first that we are interested in the probability that an observation falls in A, where A ⊂ X is a certain set chosen by the researcher. We denote this probability by P (A). It can be estimated by the frequency of A, i.e., by number of times an observation Xi falls in A total number of observations number of Xi ∈ A . = n In view of the law of large numbers, Pn (A) should be close to P (A) for n large. We now define the empirical distribution Pn as the probability law that assigns to a set A the probability Pn (A). We regard Pn as an estimator of the unknown P . More generally, for a function f : X → R. we denote the expectation of f (X) n by P (f ) = Ef (X). Its empirical counterpart is denoted by Pn (f ) = i=1 f (Xi )/n. Thus when f is the indicator function of the set A, which is denoted by lA , we have P (f ) = P (A), and Pn (f ) = Pn (A). Pn (A) =
Example 1.2. The empirical distribution function. Suppose that X ⊂ R. The distribution function of X is defined as F0 (x) = P (−∞, x], and the empirical distribution function is number of Xi ≤ x . n (Here, we have a slight abuse of notation: F0 is the truth. It is not cal distribution having 0 observations, a nonsensical object.) Figure distribution function F0 (x) = 1 − 1/x2 , x ≥ 1 (smooth curve) and cal distribution function Fn (stair function) of a sample from X with n = 200. Fn (x) =
the empiri2 plots the the empirisample size
1.3. Regularization Many (nonparametric) estimation procedures involve the choice of a tuning parameter, also called regularization parameter or smoothing parameter. Here are two examples. Example 1.3. Histograms. Suppose X ∈ R has density, say f0 , with respect to Lebesgue measure. Our aim is to estimate f0 . The density f0 (x) at x is defined as the derivative of the distribution function F0 at x: f0 (x) = lim h↓0
F0 (x + h) − F0 (x − h) P (x − h, x + h] = lim . h↓0 2h 2h
Unfortunately, replacing P by Pn here does not work, as for h small enough, Pn (x − h, x + h] will be either zero or one. Therefore, instead of taking the limit
Oracle Inequalities and Regularization
195
1
F0
0.9
Fn
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6 x
7
8
9
10
11
Figure 2. Theoretical and empirical distribution function as h ↓ 0, we fix h at a (small) positive value, called the bandwidth. The estimator of f0 (x) becomes number of Xi ∈ (x − h, x + h] Pn (x − h, x + h] = . fˆn (x) = 2h 2nh A histogram is a plot of this estimator at points x ∈ {x0 , x0 + 2h, x0 + 4h, . . .}. Figure 3 shows the histogram, with bandwidth h = 0.25, for the sample of size n = 200 from the Pareto distribution with parameter θ0 = 2 (i.e., with some abuse of notation, f0 = fθ0 ). The solid line is the density of this distribution. The bandwidth h is an example of a tuning parameter. Choosing a value for it is a complicated matter, as it leads to considering variance, bias, and related concepts. Such considerations will be the major topic in these notes. The variance, var(fˆn (x)), quantifies the inaccuracy of the histogram at the point x. The systematic error can be measured by the squared bias bias2 (fˆn (x)) = (Efˆn (x) − f0 (x))2 . The mean square error is MSE(fˆn (x)) = var(fˆn (x)) + bias2 (fˆn (x)). The integrated mean square error is IMSE(fˆn ) =
MSE(fˆn (x))dx.
The optimal – in the sense of IMSE – choice of the bandwidth h minimizes the integrated mean square error, i.e., trades off variance and squared bias. However,
196
S. van de Geer 2
1.8
1.6
f0 1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
7
8
9
10
11
Figure 3. True density and a histogram to carry out this trade off, one needs to know certain aspects of f0 . But f0 is unknown! In an attempt to mimic an oracle, one often uses part of the data to estimate f0 with various choices of the bandwidth h, and then use the rest of the data to decide on the choice for h. One may also estimate IMSE(fˆn ) applying for example least squares cross validation. We will not present the details here. Instead of bandwidth selection, we will study regularization using complexity penalties, as illustrated in Example 1.4. In the intermezzo following this example, it is shown that the two approaches can be closely related. Exercise 1.1. Suppose f0 (x) = 2/x3 , x > 1. Let h > 0 be the bandwidth. At a given x > 1 + h, calculate bias and variance of the histogram Pn (x − h, x + h] . fˆn (x) = 2h Show that for x fixed, and h → 0 and nh → ∞, the bias is of order h2 (bias(fˆn (x)) = O(h2 )) and the variance is of order 1/(nh) (var(fˆn (x)) = O(1/(nh))). The optimal – in the sense of MSE – choice for h is thus hopt = O(n−1/5 ). (For a definition of order symbols, see Section 3.5.) Example 1.4. Penalized least squares. Consider the regression EYi = f0 (xi ), i = 1, . . . , n, where Yi is a response variable, xi is a co-variable (i = 1, . . . , n), and where f0 is an unknown function. We examine here the case where xi = i/n ∈ [0, 1], i = 1, . . . , n
Oracle Inequalities and Regularization
197
and f0 is defined on [0, 1]. We suppose f0 is not changing too much, in the sense that the squared first derivative of f0 is small, say in terms of the average |f (x)|2 dx. As estimator of f0 we propose n 1 1 2 2 2 fˆn = arg min |Yi − f (xi )| + λ |f (x)| dx . f n i=1 0 Here, “arg” stands for “argument”, i.e., the location where the minimum is attained. Moreover, λ is a tuning – or regularization – parameter. If λ = 0, the ˆ estimator fˆn will just interpolate the data. On the other hand, if λn = ∞, fn will be a constant function (namely, constantly equal to the average i=1 Yi /n of the observations). To the least squares loss function, we have thus added a penalty for choosing a too wiggly function. This is called (complexity) regularization. Figure 4 below plots the true f (which is f0 ) in blue together with the data (red). The aim is to recover f0 from the data. Figure 5 shows the estimator fˆn 0.02
0
0 -0.02
-0.02 -0.04
-0.04 -0.06
-0.06
-0.08
-0.08
-0.1 -0.1
-0.12 -0.12
-0.14 -0.14
-0.16
-0.16
0
10
20
30
40
50
60
70
80
90
100
-0.18
0
10
20
30
40
50
60
70
80
90
100
Noise added, noise level = 0.01
True f
Figure 4
0.02
0.02
0
0
-0.02
-0.02
-0.04
-0.04
-0.06
-0.06
-0.08
-0.08
-0.1
-0.1
-0.12
-0.12
-0.14
-0.14
-0.16
-0.16
-0.18
-0.18 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Denoised, lambda=0.2 Fit=9.0531e-04
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Denoised, lambda=0.1 Fit=3.4322e-04 Figure 5
0.8
0.9
1
198
S. van de Geer
(in green) for two choices of the tuning parameter λ. The fit of fˆn is defined as n
|Yi − fˆn (xi )|2 /n.
i=1
Obviously, the smaller value of λ gives a better fit. Figure 6 plots the estimator fˆn together with f0 , for two values of λ. The error (or “excess risk”, see Chapter 2), which is defined here as n
|fˆn (xi ) − f0 (xi )|2 /n
i=1
turns out to be smaller for the smaller value of λ. Now, in real life situations, it is not possible to make the plots of Figure 6 and/or calculate the error, since the true f is then unknown. Thus, again, we need an oracle to tell us which λ to choose. In Section 4.5, we show that by penalizing small values of λ one may arrive at an oracle inequality. 0.02
0.02
0
0
-0.02
-0.02
-0.04
-0.04
-0.06
-0.06
-0.08
-0.08
-0.1
-0.1
-0.12
-0.12
-0.14
-0.14
-0.16
-0.16
-0.18
-0.18 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Denoised, lambda=0.05 Err or=7.8683e-05
Denoised, lambda=0.1 Err or=2.8119e-04
Figure 6 Intermezzo. As a continuous version of the problem studied in Example 1.4, consider 1 1 fˆ = arg min |y(x) − f (x)|2 dx + λ2 |f (x)|2 dx . f
0
0
In fact, let us formulate an extension, namely a continuous version corresponding to the so-called white noise model dY (x) = f (x)dx + σdW (x), where W is standard Brownian motion. In that case, the derivative y(x) = dY (x)/dx does not exist, as Brownian motion is nowhere differentiable. We therefore use a formulation avoiding this derivative: 1 1 1 fˆ = arg min −2 f (x)dY (x) + f 2 (x)dx + λ2 |f (x)|2 dx . f
0
0
0
Oracle Inequalities and Regularization
199
We show in Lemma 1.1 below that the solution fˆ can be explicitly calculated (using variational calculus). This solution reveals that the tuning parameter λ plays the role of a bandwidth parameter. Lemma 1.1. The solution of the continuous version is C x 1 x u−x ˆ f (x) = cosh( ) + )dY (u), sinh( λ λ λ 0 λ where
1 1 1 1−u )du / sinh( ). C = Y (1) − Y (u) sinh( λ 0 λ λ
Proof. Partial integration shows that we have to minimize 2 2 2 Y f + f − 2Y (1)f (1) + λ |f |2 . Now, replace f by the function g. We then have the restriction f = g. We can express this restriction by adding a Lagrangian term to the function to be minimized, with Lagrange parameters in the function h, say. We then arrive at the problem of minimizing L(f, g) = 2 Y g + f 2 − 2Y (1)f (1) + λ2 g 2
h(g − f ).
−2
Invoking again partial integration, rewrite this as L(f, g) = 2 Y g + f 2 − 2Y (1)f (1) + λ2 g 2 −2
gh − 2
f h + 2h(1)f (1).
The Euler equations (see, e.g., Goldstein (1980)) now become f − h = 0 and Y − h + λ2 g = 0. Since g = f , this gives the equation h = (h − Y )/λ2 , with boundary condition h(1) = Y (1). Solving this differential equation yields the result of the lemma.
200
S. van de Geer
1.4. Bibliographical remarks Histograms are special cases of kernel estimators. There are many books on kernel density estimation. We refer to Wand and Jones (1995). The penalized least squares estimator we have considered in Example 1.4 was chosen for ease of exposition. A penalized smoothing spline with penalty on the second derivative, instead of the first, is used more frequently. The penalty is then 1 2 λ |f (x)|2 dx. 0
We refer to Silverman (1985) and Wahba (1990). A nice discussion of more general (Sobolev) penalties (involving for instance higher derivatives) can be found in the book of Green and Silverman (1994). The proof of Lemma 1.1 is a simple example of variational calculus, which is generally developed in the context of classical mechanics (Goldstein (1980)).
2. M-estimators We will start with two examples, and then present the general definition of an M-estimator. We furthermore give definitions of excess risk, estimation error and approximation error. Next, we highlight how empirical process theory can be invoked to assess the excess risk of an M-estimator. 2.1. Some examples Examples of M-estimators are least squares estimators, maximum likelihood estimators, and estimators in binary classification using 0/1 loss. Let us consider the latter two. Example 2.1. Maximum likelihood. Let P be a collection of probability measures dominated by a σ-finite measure µ. Write F = {f = dP˜ /dµ : P˜ ∈ P} for the collection of densities. Using the model class F , the maximum likelihood estimator is fˆn = arg max f ∈F
n
log f (Xi ).
i=1
Recall that “arg” stands for “argument”, i.e., fˆn is the density in F where the likelihood of the observations is maximal. Note that we may write n
log f (Xi )/n = Pn (log f ).
i=1
This is the empirical counterpart of log f dP = P (log f ). The maximum likelihood fˆn maximizes Pn (log f ) over all f ∈ F .
Oracle Inequalities and Regularization
201
Exercise 2.1. Verify that the true density f0 = dP/dµ is a maximizer of P (log f ) over all densities f . Exercise 2.2. Check that a histogram with cells chosen a priori (see Example 1.3) is a maximum likelihood estimator for the density on R, with model class F = {f =
N
αk l(ak−1 ,ak ] , αk ≥ 0, k = 1, . . . , N,
f (x)dx = 1}.
k=1
Here, the cell boundaries a0 < a1 < · · · < aN are fixed. Example 2.2. Minimum empirical risk estimation in classification. Let (X, Y ) be random variables, with X ∈ X a co-variable and Y ∈ {0, 1} a label. For example, X may be the features of a mushroom (size, shape, color), and Y indicates whether it is edible or not. A classifier is a function f : X → {0, 1}. So f is the indicator function of a set G ⊂ X , and instances x ∈ G are classified with the label 1, whereas x ∈ / G are classified with the label 0. We will usually identify sets with indicator functions. Define the regression η0 (x) = P (Y = 1|X = x). Bayes rule f0 is to classify observations X with η0 (X) > 1/2 in the class with label 1 and those with η0 (X) ≤ 1/2 in the class with label 0, i.e., f0 = lG0 , G0 = {x : η0 (x) > 1/2}. Identifying sets and indicator functions, we also call the set G0 Bayes rule. Let {(Xi , Yi )}ni=1 be a sample from (X, Y ). One may estimate Bayes rule in the following way. The number of misclassifications using the classifier G ⊂ X is #{ Xi ∈ G, Yi = 0} + #{Xi ∈ / G, Yi = 1} =
n
|Yi − lG (Xi )| := nRn (G),
i=1
where Rn (G) is the proportion of misclassifications. The prediction error of a classifier G is the theoretical counterpart R(G) of Rn (G): R(G) = P (X ∈ G, Y = 0) + P (X ∈ / G, Y = 1) = ERn (G). The empirical risk minimizer using the model class G is defined as ˆ n = arg min Rn (G). G G∈G
Exercise 2.3. Verify that Bayes rule G0 minimizes R(G) over all subsets G ⊂ X .
202
S. van de Geer
2.2. General framework Examples 2.1 and 2.2 are two examples of M-estimation. The general principle is as follows. We have observed i.i.d. copies {Xi }ni=1 of a random variable X with distribution P . We choose a model class F ⊂ F¯ for the (possibly infinite-dimensional) parameter of interest f0 ∈ F¯ . Here F¯ is a large class of candidate f ’s, and we refer to it as the class of all f ’s. Let for all f , γf : X → R be a loss function. Define the theoretical risk R(f ) = P (γf ). The loss function γ is chosen in such a way that the risk is smallest at the parameter of interest f0 : f0 = arg min R(f ). all f
Let the empirical risk be Rn (f ) = Pn (γf ). ¯ Let F ⊂ F be a model class. The M-estimator fˆn (“M” stand for Minimization (or Maximization)) is then (2.1)
fˆn = arg min Rn (f ). f ∈F
A model class F is chosen here in the empirical risk minimization, because minimizing the empirical risk over all f generally leads to overfitting. The class F should on the other hand not be chosen too small, as that may result in a large systematic error. This problem is exactly our object of study, and it will be presented more formally in Section 2.7. Complexity regularization will be used as procedure for choosing F “not too large and not too small”. In Example 2.1, the loss function is γf = − log f , with f a density. In Example 2.2 or a more general regression set-up, with i.i.d. co-variables (random design), the situation is within our framework, but we use a different notation. We let {(Xi , Yi )}ni=1 be i.i.d. copies of a random variable (X, Y ) with distribution P , and consider a loss function γf : X × R → R, f ∈ F¯ . (For instance, in Example 2.2, the loss function is γf (x, y) = |y − f (x)|, with f = lG the indicator function of a subset G ⊂ X .) The distribution of the co-variable X is denoted by Q, and the empirical distribution of X1 , . . . , Xn is denoted by Qn . When the co-variables are fixed (non-random design), the regression model concerns a situation with independent but not identically distributed observations. However, this will not really alter the general theory. The reason why we formulated this introduction for the i.i.d. case is merely for ease of exposition. 2.3. Estimation and approximation error We will now be somewhat more precise in our definitions of accuracy and systematic error. We will refer to the inaccuracy as estimation error and the systematic error as approximation error. Let γf be a loss function, R(f ) = P (γf ), and f0 the overall minimizer of R(f ). Let F be the model class, and f∗ the minimizer of R(f ) over F.
Oracle Inequalities and Regularization
203
The empirical risk is Rn (f ) = Pn (γf ). Using the M-estimator – or empirical risk minimizer – fˆn given in (2.1), its estimation error is defined as R(fˆn ) − R(f∗ ). Note that this is a random variable. Below, we will extend the concept, and for example refer to the average value of R(fˆn ) − R(f∗ ) (or a quantity of this order) as the estimation error. The approximation error is defined as R(f∗ ) − R(f0 ). This is a non-random quantity. The difference R(f ) − R(f0 ) is called the excess risk at f . Note that it is always positive. We will investigate the behavior of the excess risk R(fˆn ) − R(f0 ). This will lead to probability inequalities for R(fˆn ) − R(f0 ), or bounds for, e.g., the average excess risk ER(fˆn ) − R(f0 ). Exercise 2.4. Consider a regression model E(Y |X = x) = f0 (x). Let γf (x, y) = (y − f (x)) be the least squares loss function. Consider first the case of fixed design (i.e., work conditionally on Xi = xi , i = 1, . . . , n). Let 2
1 (Yi − f (xi ))2 n i=1 n
Rn (f ) = and
R(f ) = ERn (f ) (where the expectation is conditional given x1 , . . . , xn ). Check that the excess risk R(f ) − R(f0 ) is equal to f − f0 2n , where · n is the L2 (Qn )-norm. Suppose that F is a (finite-dimensional) linear space. Show that f∗ is the projection of f0 (in L2 (Qn )) on F and that hence
fˆn − f0 2n = fˆn − f∗ 2n + f∗ − f0 2n . Conclude similarly for the random design case, where · n is to be replaced by the L2 (Q)-norm · . 2.4. Where empirical process theory comes in To derive bounds for the excess risk of the M-estimator fˆn given in (2.1), we use the inequalities R(fˆn ) − R(f∗ ) ≥ 0, and Rn (fˆn ) − Rn (f∗ ) ≤ 0. These inequalities are true because f∗ minimizes the theoretical risk over F and fˆn minimizes the empirical risk over F .
204
S. van de Geer Thus we may write 0 ≤ R(fˆn ) − R(f∗ ) = −
Rn (fˆn ) − R(fˆn ) − (Rn (f∗ ) − R(f∗ ))
+Rn (fˆn ) − Rn (f∗ )
≤ − Rn (fˆn ) − R(fˆn ) − (Rn (f∗ ) − R(f∗ )) . ¯ We write for f ∈ F, νn (f ) =
√ √ n (Rn (f ) − R(f )) = n(Pn (γf ) − P (γf )) .
¯ The empirical process indexed by F0 is Let F0 be some subset of F. {νn (f ) : f ∈ F0 }. From the above, we know that
√ 0 ≤ R(fˆn ) − R(f∗ ) ≤ −[νn (fˆn ) − νn (f∗ )]/ n.
Adding R(f∗ ) − R(f0 ) to both sides of this inequality yields moreover √ (2.2) R(fˆn ) − R(f0 ) ≤ −[νn (fˆn ) − νn (f∗ )]/ n + [R(f∗ ) − R(f0 )] = I + II, with
√ I = −[νn (fˆn ) − νn (f∗ )]/ n
and with II the approximation error II = [R(f∗ ) − R(f0 )]. This inequality reveals the two components I and II when studying the excess risk at fˆn . The expression I is a bound for the estimation error. Empirical process theory is invoked to examine this term. Handling the approximation error II involves approximation theory. We refer to inequalities like (2.2) as basic inequalities, because such inequalities will the starting point for deriving bounds for the excess risk at fˆn . Exercise 2.5. In Exercise 2.4, define the noise variables i = Yi −f0 (xi ), i = 1, . . . , n. Verify that for the model with fixed design 2 νn (f ) − νn (f∗ ) = − √
i (f (xi ) − f∗ (xi )). n i=1 n
In the model with random design, it becomes 2 νn (f ) − νn (f∗ ) = − √
i (f (xi ) − f∗ (xi )) n i=1 n
! √ − n ( f − f0 2n − f − f0 2 ) − ( f∗ − f0 2n − f∗ − f0 2 ) .
Oracle Inequalities and Regularization
205
The two main tools for studying M-estimators are empirical process theory and approximation theory. Empirical process theory is used to investigate the estimation error of an estimator. There is no straight answer on how to invoke which parts of empirical process theory. In Lemma 2.1 below, we give an example, which highlights the main idea. Empirical process theory supplies us with inequalities for suprema of empirical processes indexed by functions. We will in fact need the behavior of increments of the empirical process. This concerns the question: if two functions f and f∗ are “close”, then how “small” is the difference |νn (f ) − νn (f∗ )|? We welcome an answer that holds uniformly over f ∈ F in a neighborhood of f∗ , because we can then apply it to a random function (an estimator) in this neighborhood. Empirical process theory gives us probability or moment bounds for the suprema of empirical processes. These can be “directly” derived inequalities. However, a strong tool is based on so-called concentration inequalities. Concentration inequalities are (exponential, or even sub-Gaussian) probability inequalities for the concentration of a random variable around its mean. One can apply them to the random variable representing the supremum of an empirical process. Then, the only task left is to find good bounds for the mean of this supremum. This is in some cases as easy as Cauchy-Schwarz, perhaps preceded by a so-called symmetrization and/or a contraction inequality. These concepts (concentration, symmetrization and contraction) will be discussed in more detail in Chapter 6. Approximation theory will be used to illustrate how the behavior of estimators depends on how well the model approximates the “truth”. Since in real life we will actually never find out what the truth is, these illustrations are purely theoretical. 2.5. Some first results, assuming ready-to-use empirical process theory We will assume two inequalities. The first one, the margin condition, is purely deterministic. It depends very much on the problem under consideration. The second one, the empirical process condition, is an assumption on the increments of the empirical process {νn (f ) : f ∈ F }. It is in an ready-to-use form. In the literature, such empirical process inequalities have indeed been established (see also Section 6.6). A very simple example can be established in Exercise 3.2. We assume that the two inequalities hold for some metric d on F. The inequalities are tied up in a technical condition on the parameters involved. Margin condition. For some constants c2 and κ ≥ 1 and for all f ∈ F , R(f ) − R(f0 ) ≥ dκ (f, f0 )/c2 . In the margin condition, κ is an identifiability parameter. If it is small, then f0 is well-identified for the metric d. Frequently, it only holds for those f with d(f, f0 ) bounded by some constant. We skip this issue here to avoid digressions. Exercise 2.6. In Exercise 2.4, check that the margin condition is met, with κ = 2 and d(f, f0 ) = f − f0 n (fixed design) or d(f, f0 ) = f − f0 (random design).
206
S. van de Geer
Empirical process condition. For some constants c1 , r > 0, and 0 ≤ β ≤ 1, we have r |νn (f ) − νn (f∗ )| E sup ≤ cr1 . dβ (f, f∗ ) f ∈F In the empirical process condition, β is a complexity parameter. If it is small, the class F is complex, and the increments of the empirical process can be large. When f = f∗ , the weighted empirical process (νn (f ) − νn (f∗ ))/dβ (f, f∗ ) is defined to be zero. We remark now that meeting the empirical process condition frequently means that the class F does not contain f ’s with distance d(f, f∗ ) not zero but very small. This may not be true for our original F . But actually, as we want to prove that d(fˆ, f∗ ) is small, this generally turns out to be only a minor technical issue. Technical condition. We have κ > β and r ≥
κ κ−β .
In Lemma 2.1, we first consider the well specified case, i.e., the case where f0 ∈ F . We then show that in the misspecified case, the approximation error occurs as an additional term. Lemma 2.1. Let fˆn be the M-estimator defined in (2.1). Suppose the margin condition, the empirical process condition, and the technical condition are met. If f0 ∈ F (well specified case), we have β
κ
ER(fˆn ) − R(f0 ) ≤ c1κ−β c2κ−β n− 2(κ−β) .
(2.3)
κ
/ F (misspecified case), we have for any 0 < δ < 1, More generally, if possibly f0 ∈ ER(fˆn ) − R(f0 ) ≤ (
(2.4)
1+δ ) {Vn + R(f∗ ) − R(f0 )} , 1−δ
where κ
Vn = 2c1κ−β (c2 /δ) κ−β n− 2(κ−β) .
(2.5)
β
κ
Proof. Define Zn = |νn (fˆn ) − νn (f∗ )|/dβ (fˆn , f∗ ). From basic inequality (2.2) it follows that √ (2.6) R(fˆn ) − R(f0 ) ≤ Zn dβ (fˆn , f∗ )/ n + R(f∗ ) − R(f0 ). Suppose first that f0 ∈ F , so that f0 = f∗ . By the margin condition, we know that β
β dβ (fˆn , f0 ) ≤ c2κ (R(fˆn ) − R(f0 )) κ .
Hence, from (2.6) β √ β R(fˆn ) − R(f0 ) ≤ c2κ (R(fˆn ) − R(f0 )) κ Zn / n.
This implies β
κ
R(fˆn ) − R(f0 ) ≤ c2κ−β Znκ−β n− 2(κ−β) . κ
Oracle Inequalities and Regularization
207
Result (2.3) now follows from κ
κ
κ
EZnκ−β ≤ (EZnr ) r(κ−β) ≤ c1κ−β .
(2.7)
More generally, in the possibly misspecified case, we use that dβ (fˆn , f∗ ) ≤ dβ (fˆn , f0 ) + dβ (f∗ , f0 ) β
β
β β ≤ c2κ (R(fˆn ) − R(f0 )) κ + c2κ (R(f∗ ) − R(f0 )) κ .
Now, use the Technical Lemma below. Use it twice. We find β √ β R(fˆn ) − R(f0 ) ≤ c2κ (R(fˆn ) − R(f0 )) κ Zn / n β √ β +c2κ (R(fˆ∗ ) − R(f0 )) κ Zn / n + R(f∗ ) − R(f0 )
≤ δ(R(fˆn ) − R(f0 )) + δ(R(f∗ ) − R(f0 )) κ
+2(c2 /δ) κ−β Znκ−β n− 2(κ−β) + R(f∗ ) − R(f0 ). β
κ
Conclude invoking (2.7).
The next lemma was used in the proof of Lemma 2.1. It will in fact also be of help in proofs in Chapter 5. Technical Lemma. We have for all positive v, t, and δ, and κ > β, vt κ ≤ δt + v κ−β δ − κ−β . β
Proof. Suppose first that v/δ ≤ t
k
κ−β κ β
. Then obviously
vt κ = Conversely, if v/δ ≥ t
κ−β κ
β
v βκ δt ≤ δt. δ κ
, then t ≤ (v/δ) κ−β . So then β κ κ v β vt κ ≤ v( ) κ−β = v κ−β δ − κ−β . δ
Recall that we defined the estimation error as R(fˆn ) − R(f∗ ). Now, in a general context, the estimation error and approximation error cannot be studied separately. This primarily because the margin condition is assumed at f0 , not at f∗ . Also, we will generally bound the estimation error by the term I in (2.2) involving the empirical process. For this term we will in turn consider a (probability) bound. To understand more clearly the meaning of the formulas, we will be flexible in our use of terminology, and refer to such upper bounds as estimation error. For example, in the context of Lemma 2.1, Vn in (2.5) is (an upper bound for) the (average) estimation error, and we simply refer to it as estimation error.
208
S. van de Geer
2.6. Balancing estimation and approximation error Lemma 2.1 gives an upper bound for the average excess risk, involving the (bound) for the estimation error and the approximation error. More generally, suppose we have shown that for some constants C and Vn ER(fˆn ) − R(f0 ) ≤ C{Vn + R(f∗ ) − R(f0 )}. Suppose that the constant C does not depend on P or on the model class F . Clearly, the estimation error Vn depends on the model F , for example in (2.5) through the parameter β. Moreover, Vn may also depend on P , for example as in (2.5) through the parameter κ. It is clear that also the approximation error, which we write for short as Bn2 = R(f∗ ) − R(f0 ), depends on F and P . To express these dependencies, let us write Vn = Vn (F , P ) and Bn2 = Bn2 (F , P ). Given a collection of models {F}, the optimal model F∗ would now be F∗ = arg min Vn (F , P ) + Bn2 (F , P ) . {F }
But since the optimal model F∗ depends on P , only an oracle knows what F∗ is. The oracle is mimicked if we can construct an estimator with excess risk at most that of the estimator when F∗ were known. In our theory, we will not be able to arrive at exact oracle behavior. Instead, a trade off up to constants independent of the sample size n, or possibly up to (log n)-factors, is established. Having essentially the large-n situation in mind, we will not be too much concerned about such constants and (log n)-factors. In conclusion, we look for an estimator fˆn satisfying up to (log n)-terms ER(fˆn ) − R(f0 ) = O Vn (F∗ , P ) + Bn2 (F∗ , P ) . The approach to this end will be to add a penalty to the empirical risk, so that complicated models are penalized. Such a method is called penalized empirical risk minimization, and it is a form of complexity regularization. 2.7. Bibliographical remarks For the application of empirical process theory to evaluate M-estimators, we refer to van der Vaart and Wellner (1996), and van de Geer (2000) and the references in these books. We remark that much recent work uses refined concentration inequalities (e.g., Massart (2000b), see also Section 6.1). The empirical process condition is a condition on the weighted empirical process. See Section 6.4 for techniques. The margin condition is from Tsybakov (2004) (see also Mammen and Tsybakov (1999)). It in particular plays an important role in the classification problem (see Example 2.2), but also in other contexts (see van de Geer (2003)) and Chapter 5). The Technical Lemma is as in Tsybakov and van de Geer (2005), albeit with larger constant but simpler proof.
Oracle Inequalities and Regularization
209
3. The sequence space formulation The aim of this chapter is to reveal the main ingredients of complexity regularization yet keeping technicalities to a minimum. We will consider a situation where the average excess risk can be computed exactly and without much effort. We examine orthogonal series expansions of a regression function. The coefficients are estimated by least squares. The question we address is: which coefficients should we keep? Keeping many coefficients results in an estimator of the regression curve with large variance. On the other hand, killing many coefficients might lead to a large bias. An oracle can choose the optimal trade off between bias and variance. Hard- and soft-thresholding are presented as estimation methods that mimic this oracle. In Section 3.2 the concept of sparseness is introduced as motivation for considering the collection of models given in Section 3.3. Section 3.4 presents the model an oracle would select from this collection. Section 3.5 states that hardand soft-thresholding estimators mimic the oracle, and presents a proof this for the soft-thresholding case, using the empirical process result of Section 3.6. The consequences of the theory for sparse signals is left as exercises (Exercise 3.4 and 3.5). 3.1. Reformulation of the regression problem Consider the regression Yi = f0 (xi ) + i , i = 1, . . . , n, where 1 , . . . , n are independent and centered noise variables, and f0 is an unknown function on X . Our objective is to put forward the basic idea of an oracle inequality. We want to facilitate the exposition as much as possible. In this spirit, we make the simplifying assumption of normally distributed errors, that is,
1 , . . . , n are assumed to be i.i.d. N (0, σ 2 )-distributed. We may collect the observations Y = (Y1 , . . . , Yn ) in a (random) vector in n R . The regression model takes the mean of Y as (possibly partly) unknown vector in Rn . The co-variables x1 , . . . , xn are considered as fixed (nonrandom design) in this chapter. Recall that Qn denotes the empirical distribution of x1 , . . . , xn . Let (ψ1 , . . . , ψn ) be certain given functions, chosen in such a way that they form an orthonormal basis in L2 (Qn ). Thus ψj : X → R, j = 1, . . . , n, and 1 2 ψ (xi ) = 1, ∀ j, n i=1 j n
i.e, the functions have length 1, and 1 ψj (xi )ψk (xi ) = 0, ∀ k = j, n i=1 n
i.e., the functions are orthogonal to each other.
210
S. van de Geer Define now the inner products n 1 Yi ψj (xi ), j = 1, . . . , n, Y˜j = n i=1
and
1 f0 (xi )ψj (xi ), j = 1, . . . , n. n i=1 n
θj,0 = We then have (3.1) where
j , j = 1, . . . , n, Y˜j = θj,0 + ˜ 1
i ψj (xi ), j = 1, . . . , n.
˜j = n i=1 n
n are i.i.d. and N (0, σ 2 /n)-distributed. Exercise 3.1. Show that ˜1 , . . . , ˜ We call (3.1) the sequence space formulation. If the regression function f0 is completely unknown, the expectation θ0 = (θ1,0 , . . . , θn,0 ) of the random vector ˜ = (Y˜1 , . . . , Y˜n ) is completely unknown. So then there are n unknowns, that is as Y many unknowns as there are observations. Nevertheless, when the signal is sparse, one can estimate the vector θ0 (in a global sense). With sparseness, we mean that most of the elements in θ0 are zero or almost zero. In literature, the sparseness of a representation is often defined as the number of zero coefficients, so that one representation is sparser than another iff it has less non-zero coefficients. Section 3.2 gives a formally somewhat different definition, giving the possibility of a more refined comparison. We end this section with some remarks. In practice, one has to make a choice for the orthonormal basis {ψj }. This is like choosing a language to interpret a given (noisy) text. Here, the special properties of the data set are of importance. For example, some signals are well described by taking a Fourier basis, others by wavelets or trigonometric series. The problem is closely related to data compression. One hopes to choose a basis such that θ0 is indeed sparse, i.e., that one has an economical approximate representation with only a few non-zero coefficients. In literature, most systems of basis functions are orthonormal for Lebesgue measure instead of Qn . We assume orthonormality in L2 (Qn ) because it makes it possible to avoid technical calculations. We note moreover that in many applications, sparse representations can only be obtained using non-orthogonal (and perhaps overcomplete) representations (for example for representing sharp edges in a picture). The mapping x → ψ from X to Rn is sometimes called the feature mapping. For example, if f0 is a picture, then ψ may represent features like angles, directions, and shapes. In some other situations ψ is of the form ψj (xi ) = K(|xi − xj |/h) where | · | is some metric on X , K is some kernel and h a regularization parameter, often called
Oracle Inequalities and Regularization
211
the width. In that case, sparseness may be less an issue. It is rather the choice of the parameter h that is of importance here. If there are many sets of basis functions to choose from, we call them dictionaries. One may use a data dependent method to choose a dictionary, but we will not consider this issue. In the next section, we take the sequence space formulation as a starting point, and we will omit the “tilde” in our notation. 3.2. Estimating the mean of a normal vector In this section we study the estimation of the vector θ0 ∈ Rn from observations Yj = θj,0 + j , j = 1, . . . , n, with 1 , . . . , n independent N (0, σ 2 /n)-distributed. Denote the Euclidean norm on Rn as n 2
θ n = |θj |2 , θ ∈ Rn . j=1
A sparse signal......
Figure 7 Now, we want to find an estimator θˆn such that θˆn −θ0 n is small. In general, one cannot guarantee that a good estimator exists. But if we believe that most of the coefficients θj,0 are zero or at least not too big, we can construct an estimator
212
S. van de Geer
that exploits this as much as possible. A signal with only few large coefficients is called sparse. It is like the starry night sky: if you consider the sky as a set of pixels, then at most pixels there is no star (no signal), or it is too far away, and the light you see there is mainly background noise. As mathematical description of sparseness, we take: Definition. The signal θ0 ∈ Rn is called sparse if for some 0 ≤ r < 2, n
(3.2)
|θj,0 |r ≤ 1.
j=1
Note that sparseness depends on the parameter r, which is a roughness parameter. Very small values of r correspond to very sparse signals. The extreme case is r = 0, in which case there is at most one non-zero coefficient (using the convention 00 = 0 and z 0 = 1 for z > 0). The constant 1 in the right-hand side of (3.2) is chosen for ease of exposition. It could be replaced by any other constant by a rescaling argument. Lemma 3.1 below states that our definition of sparseness implies that a sparse signal has only a few large coefficients. In Lemma 3.2, it is shown that such a signal can be well approximated by one with only a few non-zero coefficients. Lemma 3.1. Suppose θ0 is sparse. Then #{|θj,0 | > λ} ≤ λ−r . Proof. By the definition of sparseness, n
|θj,0 |r ≤ 1.
j=1
Conversely
n
|θj,0 |r ≥
|θj,0 |r ≥ #{|θj,0 | > λ}λr .
|θj,0 |>λ
j=1
Lemma 3.2. Let
θj,∗ =
θj,0 0
if |θj,0 | > λ . if |θj,0 | ≤ λ
If θ0 is sparse we have
θ∗ − θ0 2n ≤ λ2−r . Proof. By the definition of θ∗ , we have
θ∗ − θ0 2n =
|θj,0 |≤λ
|θj,0 |2 ≤ λ2−r
n
|θj,0 |r ≤ λ2−r ,
j=1
using in the last inequality the assumed sparseness of θ0 .
Oracle Inequalities and Regularization
213
3.3. A collection of models Recall that Yj = θj,0 + j , j = 1, . . . , n, with 1 , . . . , n i.i.d. N (0, σ 2 /n). Now, we have in mind the situation where we believe that most coefficients in θ0 are small. But we do not know which coefficients are small, and we certainly do not know whether the signal is sparse with a given roughness parameter r. We stress that here the purpose of some definition of sparseness is to motivate our choice of model class. In the rest of this chapter, sparseness is only used as illustration of the implications of an oracle inequality, see Exercises 3.4 and 3.5. So we believe and hope most coefficients can be best estimated as being zero, but we don’t know which ones. Suppose we decide to set all coefficients with index j∈ / J (J ⊂ {1, . . . , n}) to zero. In the language of the previous chapter, our model class is then F = Θ(J ) = {θ ∈ Rn : θj = 0 ∀ j ∈ / J }. As empirical risk, we take the least squares loss Rn (θ) =
n
(Yj − θj )2
i=1
with theoretical counterpart R(θ) = ERn (θ) = θ − θ0 2n + σ 2 . Note thus that R(θ0 ) = σ 2 , and that the excess risk is equal to R(θ) − R(θ0 ) = θ − θ0 2n . The least squares estimator over the model Θ(J ) is θˆn = arg min Rn (θ). θ∈Θ(J )
It is clear that this estimator has coefficients Yj if j ∈ J ˆ . θj,n (J ) = 0 if j ∈ /J We will now show that the (average) estimation error of the estimator is in this case its variance and its approximation error equals its squared bias. The excess risk is therefore the mean square error. Define the best approximation within the class θ∗ (J ) = arg min R(θ). θ∈Θ(J )
This vector has coefficients
θj,∗ (J ) =
θj,0 0
if j ∈ J . if j ∈ /J
214
S. van de Geer The estimation error is
ER(θˆn (J )) − R(θ∗ (J )) = E θˆn (J ) − θ∗ (J ) 2n =
E(Yj − θj,0 )2 =
j∈J
Indeed, this is the variance Eθˆn (J ) − The approximation error is
σ 2 |J | . n
Eθˆn (J ) 2n .
R(θ∗ (J )) − R(θ0 ) = θ∗ (J ) − θ0 2n =
2 θj,0 .
j ∈J /
Here, we indeed recognize the squared bias Eθˆn (J ) − θ0 2n . The average excess risk is thus the mean square error σ 2 |J | 2 + θj,0 . ER(θˆn (J )) − R(θ0 ) = n j ∈J /
Exercise 3.2. Let us compare this result with the one of Lemma 2.1. With the notation used there, one has model class F = Θ(J ), and empirical process n √ νn (θ) = −2 n
j θj . j=1
˜ = θ − θ
˜ n . Check that the margin condition holds The metric we take is d(θ, θ) with κ = 2 and c2 = 1. The empirical process condition is met for r = 2, and for β = 1 and c21 = 4σ 2 |J |. The estimation error Vn defined in (2.5) becomes with these values 8σ 2 |J | . Vn = δn Now, verify the technical condition. The result (2.3) of Lemma 2.1 becomes 2 8σ |J | 1+δ 2 ER(θˆn (J )) − R(θ0 ) ≤ + θj,0 . 1 − δ δn j ∈J /
Thus, up to constants, Lemma 2.1 produces the same answer as direct calculations. 3.4. The model an oracle would select An oracle chooses J as the set J∗ which minimizes the mean square error. I.e., σ 2 |J | 2 + J∗ = arg min θj,0 . J ⊂{1,...,n} n j ∈J /
Exercise 3.3. Show that the index set an oracle would select is
J∗ = {j : |θj,0 | > σ 2 /n}. (You can use similar arguments as in the proof of Lemma 3.5.)
Oracle Inequalities and Regularization
215
Exercise 3.4. Suppose that θ0 is sparse. Show that for the model Θ(J∗ ) chosen by the oracle, 2 2−r 2 σ 2 ˆ . E θn (J∗ ) − θ0 n ≤ 2 n 3.5. Hard- and soft-thresholding The oracle estimates only the coefficients θj,0 which are in absolute value bigger
2 than σ /n. The idea is now to replace the unknown coefficients θj,0 by the observations Yj , j = 1, . . . , n. First of all, it should then be noted that the noise level σ 2 is generally unknown. However, this problem is minor, as one may construct estimators of σ 2 . Here, we assume for simplicity that σ 2 is known. A more severe problem is that we now estimate the θ0,j in order to construct an estimator of the θ0,j ! It actually turns out that this procedure works if we make the threshold a bit larger than σ 2 /n, namely, it should be chosen % 2(log n)σ 2 /n. Then the oracle is “almost” mimicked (see Lemma 3.3). We now first present the definitions of the hard-thresholding and softthresholding estimator. Lemmas 3.3 and 3.4 give the oracle inequality for these estimators, and Lemmas 3.5 and 3.6 put them in the framework of penalized Mestimation. We then have a clearer picture of the type of oracle inequalities we may expect for more general M-estimators. Let λn ≥ 0 be some threshold. This threshold will be the regularization parameter in the present context. Definition of the hard-thresholding estimator. Yj if |Yj | > λn θˆj,n (hard) = , j = 1, . . . , n. 0 if |Yj | ≤ λn Definition of the soft-thresholding estimator. Yj − λn if Yj > λn ˆ θj,n (soft) = Yj + λn if Yj < −λn , j = 1, . . . , n. 0 if |Yj | ≤ λn It is shown in Lemmas 3.3 and 3.4 that the estimators θˆn (hard) and θˆn (soft) have similar oracle properties, i.e., they have up to (log n)-terms the same mean square error as when using the model Θ(J∗ ) given by the oracle. Lemmas 3.3 and 3.4 are proved in Donoho and Johnstone (1994b), by direct calculations. We will reconsider the oracle properties of the soft-thresholding estimator in Lemma 3.7, in a fashion that allows extension to other M-estimation contexts. Lemma 3.3 for the hard-thresholding estimator is stated in asymptotic sense. We use the following notation and terminology. Let {zn } and {δn } be sequences of positive numbers. We say that zn = o(δn ) (zn is of smaller order than δn ) if zn /δn → 0 as n → ∞. Moreover, zn = O(δn ) (zn is of order δn ) means that lim supn→∞ zn /δn < ∞, and zn % δn (zn is asymptotically equal to δn ) means
216
S. van de Geer
zn /δn → 1. If {ϑˆn } is a sequence of estimators of some parameter ϑ0 in some metric space with metric d, we write d(ϑˆn , ϑ0 ) = OP (δn ) (ϑˆn converges (to ϑ0 ) with rate δn ) if d(ϑˆn , ϑ0 )/δn remains bounded in probability. Note that sufficient for the latter is Ed(ϑˆn , ϑ0 ) = O(δn ). Lemma 3.3. Take for some 0 < α < 1, (1 − α)(log log n)σ 2 /n ≤ λ2n − 2(log n)σ 2 /n ≤ o((log n)σ 2 /n). Then
σ 2 (|J | + 1) ∗ 2 + E θˆn (hard) − θ0 2n ≤ Ln θj,0 , n j ∈J / ∗
where Ln % 2 log n. Lemma 3.4. Take λ2n = 2(log n)σ 2 /n. Then σ 2 (|J | + 1) ∗ 2 E θˆn (soft) − θ0 2n ≤ (2 log n + 1) + θj,0 . n j ∈J / ∗
Exercise 3.5. Suppose that θ0 is sparse. Show that when the threshold is chosen as in Lemma 3.3 (Lemma 3.4), the squared rate of convergence for the hardthresholding (soft-thresholding) estimator is 2 2−r σ log n 2 . n Compare with Exercise 3.4. Donoho and Johnstone (1994b) also prove that the soft-thresholding estimator can be improved by choosing the threshold λn more carefully. We will not cite that result here, because our focus is not so much on the constants. In fact, in Lemma 3.7, we will treat the soft-thresholding estimator again, using an indirect method. This gives worse constants, but the approach has the advantage that the method is applicable to other situations as well, and in particular does not rely on the assumption of normally distributed errors. The hard- and soft-thresholding estimators are penalized M-estimators, as is shown in Lemmas 3.5 and 3.6. This point of view allows one to define hardand soft-thresholding type estimators for other loss functions as well. The hardthresholding estimator comes a penalty on the number of non-zero coefn from 0 ficients, #{θj = 0} = |θ | which we refer to as as the 0 -penalty. The j=1 j soft-thresholding case corresponds to a penalty on the 1 -norm nj=1 |θj |, and we refer to it as the 1 -penalty. Lemma 3.5. The hard-thresholding estimator θˆn (hard) minimizes n (Yj − θj )2 + λ2n #{θj = 0}. j=1
Oracle Inequalities and Regularization
217
Proof. Let θ·,n be the minimizer of n
(Yj − θj )2 + λ2n #{θj = 0}.
j=1
We can carry out the minimization coefficient-by-coefficient. So (for j = 1, . . . , n), θj,n minimizes Sj2 (θj ) = (Yj − θj )2 + λ2n l{θj = 0}. Clearly, if θj,n = 0, we must take it equal to Yj , and then Sj2 (θj,n ) = λ2n . If θj,n = 0, we have Sj2 (θj,n ) = Yj2 . So λ2n if θj,n = Yj 2 Sj (θj,n ) = . Yj2 if θj,n = 0 Since Sj2 (θj ) is minimized over θj , we have Sj2 (θj,n ) = min(λ2n , Yj2 ). Hence, θj,n = θˆj,n (hard).
Lemma 3.6. The soft-thresholding estimator θˆn (soft) minimizes n n 2 (Yj − θj ) + 2λn |θj |. j=1
j=1
Proof. Let θ·,n be the minimizer of n n (Yj − θj )2 + 2λn |θj |. j=1
j=1
The minimization can be done coefficient-by-coefficient, as in the previous lemma. So (for j = 1, . . . , n), θj,n minimizes Sj2 (θj ) = (Yj − θj )2 + 2λn |θj |. If θj,n > 0, the function Sj2 (θj ) is differentiable near θj,n , with derivative −2(Yj − λn ) + 2θj . Setting this derivative to zero shows that θj,n = Yj − λn . So we conclude that θj,n > 0 can only happen when Yj − λn > 0, and in that case θj,n = Yj − λn . Similarly when θj,n < 0. When θj,n = 0, we find Sj2 (θj,n ) = Yj2 . Hence 2 +2λn Yj − λn ifθj,n = Yj − λn > 0 2 Sj (θj,n ) = −2λn Yj − λ2n < mboxif θj,n = Yj + λn < 0 . 2 Yj if θj,n = 0 Obviously, we always have ±2λn Yj − λ2n ≤ Yj2 , so θj,n = 0 whenever |Yj | ≥ λn . In other words, θj,n = θˆj,n (soft).
218
S. van de Geer
Now, we will reprove Lemma 3.4, albeit with less economic constants. The idea is writing down a basic inequality, similar to (2.1) but now for the penalized case. The basic inequality with penalty takes the form √
θˆn − θ0 2 ≤ −[νn (θˆn ) − νn (θ∗ )]/ n + θ∗ − θ0 2 + pen(θ∗ ) − pen(θˆn ). n
n
Here, θ∗ is the best penalized approximation of θ0 . It means that θ∗ is defined as in Lemma 3.2 with an appropriate choice of the threshold λ. The empirical process takes the simple form n √ νn (θ) = −2 n
j θj . j=1
We will bound the increments by n √ |νn (θ) − νn (θ∗ )| ≤ 2 n |θj − θj,∗ | max | j |. j=1,...,n
j=1
Finally, we only consider the soft-thresholding estimator, that is, the penalty considered is the 1 -penalty n |θj |. pen(θ) = 2λn j=1
For the regularization parameter λn , a value proportional to σ 2 log n/n can be taken (see Lemma 3.8). Lemma 3.7 then states that the estimator with 1 -penalty satisfies an oracle inequality, where the oracle concerns the 0 -penalty. Lemma 3.7. Let θˆn = θˆn (soft). Let 0 ≤ δ < 1 be arbitrary. Set Vn (θ) = 16λ2n #{θj = 0}/δ. On the set Ξn = {max1≤j≤n | j | ≤ λn }, we have 2+δ ) min Vn (θ) + θ − θ0 2n .
θˆn − θ0 2n ≤ ( θ 2−δ Proof. At first, let θ∗ be arbitrary. Write pen(θ) = 2λn
n
|θj |
j=1
= pen1 (θ) + pen2 (θ), with pen1 (θ) = 2λn
|θj |,
θj,∗ =0
pen2 (θ) = 2λn
|θj |.
θj,∗ =0
We use the short hand notation N (θ) = #{θj = 0}.
Oracle Inequalities and Regularization
219
Then, on Ξn , √
θˆn − θ0 2n ≤ −[νn (θˆn ) − νn (θ∗ )]/ n + pen(θ∗ ) − pen(θˆn ) + θ∗ − θ0 2n =2
n
j (θˆj,n − θj,∗ ) + pen(θ∗ ) − pen(θˆn ) + θ∗ − θ0 2n
j=1
≤ 2λn
n
|θˆj,n − θj,∗ | + pen(θ∗ ) − pen(θˆn ) + θ∗ − θ0 2n
j=1
= pen1 (θˆn − θ∗ ) + pen2 (θˆn ) + pen1 (θ∗ ) − pen1 (θˆn ) − pen2 (θˆn ) + θ∗ − θ0 2n ≤ 2pen1 (θˆn − θ∗ ) + θ∗ − θ0 2n ≤ 4λn ≤ 4λn
N (θ∗ ) θˆn − θ∗ n + θ∗ − θ0 2n
N (θ∗ ) θˆn − θ0 n + 4λn N (θ∗ ) θ∗ − θ0 n + θ∗ − θ0 2n
√ Now, we proceed as in Lemma 2.1. Since for a and b non-negative, ab ≤ (a + b)/2 (compare with the technical lemma at the end of Chapter 2), we have 4λn
δ 8λ2 N (θ∗ ) . N (θ∗ ) θˆn − θ0 n ≤ θˆn − θ0 2n + n 2 δ
Here, we may also replace θˆn − θ0 n by θ∗ − θ0 n . So we find on Ξn ,
θˆn − θ0 2n ≤
δ ˆ δ 16λ2n N (θ∗ )
θn − θ0 2n + θ∗ − θ0 2n + + θ∗ − θ0 2n , 2 2 δ
or δ δ 16λ2n N (θ∗ ) + (1 + ) θ∗ − θ0 2n (1 − ) θˆn − θ0 2n ≤ 2 δ 2 16λ2n N (θ∗ ) δ 2 + θ∗ − θ0 n . ≤ (1 + ) 2 δ To conclude the proof, take θ∗ = arg min{16λ2n N (θ)/δ + θ − θ0 2n }. θ
We thus see that for λn % σ 2 log n/n, we arrive at the same oracle rates as in Lemmas 3.3 and 3.4, provided the set P(maxj | j | > λn ) → 0 for such a choice of λn . Indeed, this is shown to be okay in Lemma 3.8.
220
S. van de Geer
3.6. A probability inequality for the empirical process Recall that we proved Lemma 3.7 on the set where {max1≤j≤n | n | ≤ λn }. In order to finish our oracle inequality, we need to show that for appropriate thresholds λn , this set has large probability. Lemma 3.8. Let Z be N (0, 1)-distributed. Then for all a > 0, 1 P(Z > a) ≤ exp[− a2 ]. 2 Moreover, if Z1 , . .√. , ZN are N (0, 1)-distributed (and not necessarily independent), then for all a ≥ 2 log N 1 P max Zj > a ≤ exp[− a2 ]. 1≤j≤N 4 Proof. We have for all a > 0, and γ > 0, by Chebyshev’s inequality P (Z > a) ≤
E exp[γZ] exp[γa]
1 = exp[ γ 2 − γa]. 2 Take γ = a to arrive at 1 P (Z > a) ≤ exp[− a2 ]. 2 To prove the second assertion of the lemma, we note that 1 P max |Zj | > a ≤ N exp[− a2 ]. 1≤j≤N 2 √ Take a ≥ 2 log N to get 1 P max Zj > a ≤ exp[− a2 ]. 1≤j≤N 4
It clearly follows from Lemma 3.8, that if 1 , . . . , n are independent N (0, σ 2 /n), we have for all t ≥ 2,
t2 log n ]. P max | j | > t σ 2 log n/n ≤ 2 exp[− 1≤j≤n 4 Z2 , . . . , be independent N (0, 1)-distributed, and j = Exercise 3.6. Let Z1 , √ σZj / n. Take λn = t σ 2 log n/n, with t > 2. Verify that max1≤j≤n | j | ≤ λn almost surely for all n sufficiently large.
Oracle Inequalities and Regularization
221
3.7. Bibliographical remarks In Donoho and Johnstone (1994a), one can find minimax theory for p -spaces (0 < p < ∞). In Donoho and Johnstone (1994b), the relation with wavelet shrinkage is also there: it explicitly takes the step from the original regression problem via wavelets to the sequence space formulation. Roughly speaking, when using an appropriate wavelet basis to expand a function f0 with coefficients θ0 , the roughness parameter r corresponds to the effective smoothness s of f0 via the relation r = 2/(2s + 1). For more details we refer to Donoho and Johnstone (1996) and wavelet theory (e.g., Edmunds and Triebel (1992), H¨ardle, Kerkyacharian, Picard and Tsybakov (1998)). The oracle terminology is from Donoho and Johnstone (1994b). Soft-thresholding is also studied in Donoho (1995). The 1 -penalty corresponding to soft-thresholding is called the LASSO in Tibshirani (1996). It is used for instance for regularization of least squares estimators in linear models with many co-variables, possibly with linear dependencies between the variables. We refer to the book of Hastie, Tibshirani and Friedman (2001). Indeed, linear dependent systems can often provide better sparse approximations, for instance wedgelets (see Donoho (1999)) or curvelets (see Cand`es and Donoho (2004)), which are sparse systems for representing edges or other singularities. The similarity between the behavior of hard- and soft-thresholding type penalties is investigated in Donoho (2004a,b). There, it is shown that the 1 -penalized solution of an overdetermined least squares problem is often also (an approximation of) the 0 -penalized problem.
4. Overruling the variance We revisit the regression problem of the previous chapter. One has observations {(xi , Yi )}ni=1 , with x1 , . . . , xn fixed co-variables, and Y1 , . . . , Yn response variables, satisfying the regression Yi = f0 (xi ) + i , i = 1, . . . , n, where 1 , . . . , n are independent and centered noise variables, and f0 is an unknown function on X . The errors are assumed to be N (0, σ 2 )-distributed. Let F¯ be a collection of regression functions. The penalized least squares estimator is n 1 2 fˆn = arg min |Yi − f (xi )| + pen(f ) . ¯ n i=1 f ∈F Here pen(f ) is a penalty on the complexity of the function f . Let Qn be the empirical distribution of x1 , . . . , xn and · n be the L2 (Qn )-norm. Define f∗ = arg min f − f0 2n + pen(f ) . ¯ f ∈F
Our aim is to show that (4.1) E fˆn − f0 2n ≤ const. f∗ − f0 2n + pen(f∗ ) .
222
S. van de Geer
When this aim is indeed reached, we loosely say that fˆn satisfies an oracle inequality. In fact, what (4.1) says it that fˆn behaves as the noiseless version f∗ . That means so to speak that we “overruled” the variance of the noise. In Section 4.1, we put our objectives of this chapter in the framework of Chapter 2. In particular, we recall there the definitions of estimation and approximation error. Section 4.2 calculates the estimation error when one employs least squares estimation, without penalty, over a finite model class. The estimation error turns out to behave as the log-cardinality of the model class. Section 4.3 shows that when considering a collection of nested finite models, a penalty pen(f ) proportional to the log-cardinality of the smallest class containing f will indeed mimic the oracle over this collection of models. In Section 4.4, we consider general penalties. It turns out that the (local) entropy of the model classes plays a crucial rule. Recall that in Chapter 3, the hard-thresholding estimator corresponds to a penalty proportional to the dimension (number of non-zero coefficients) of the model class. Indeed, the local entropy a finite-dimensional space is proportional to its dimension. For a finite class, the entropy is (bounded by) its log-cardinality. Whether or not (4.1) holds true depends on the choice of the penalty. In Section 4.4, we show that when the penalty is taken “too small” there will appear an additional term showing that not all variance was “killed”. Section 4.5 presents an example. Throughout this chapter, we assume the noise level σ > 0 to be known. In that case, by a rescaling argument, one can assume without loss of generality that σ = 1. In general, one needs a good estimate of an upper bound for σ, because the penalties considered in this chapter depend on the noise level. When one replaces the unknown noise level σ by an estimated upper bound, the penalty in fact becomes data dependent. 4.1. Estimation and approximation error Let F be a model class. Consider the least squares estimator without penalty 1 fˆn (·, F ) = arg min |Yi − f (xi )|2 . f ∈F n i=1 n
The excess risk fˆn (·, F ) − f0 2n of this estimator is the sum of estimation error and approximation error. Now, if we have a collection of models {F}, a penalty is usually some measure of the complexity of the model class F . With some abuse of notation, write this penalty as pen(F ). The corresponding penalty on the functions f is then pen(f ) = min pen(F ). F : f ∈F
We may then write fˆn = arg
min
fˆn (·,F ): F ∈{F }
n 1 |Yi − fˆn (xi , F )|2 + pen(F ) , n i=1
Oracle Inequalities and Regularization
223
where fˆn (·, F ) is the least squares estimator over F . Similarly, f∗ = arg
min
{ f∗ (·, F ) − f0 2n + pen(F )},
f∗ (·,F ): F ∈{F }
where f∗ (·, F ) is the best approximation of f0 in the model F . As we will see, taking pen(F ) proportional to (an estimate) of a bound for the estimation error of fˆn (·, F ) will (up to constants and possibly (log n)-factors) balance estimation error and approximation error. Exercise 4.1. Let {ψj }nj=1 be an orthonormal basis in L2 (Q). Define fθ = n n j=1 θj ψj , where θ = (θ1 , . . . , θn ) is a vector in R . Check that
fθ 2n =
n
|θj |2 := θ 2n .
j=1
Let F¯ = {fθ : θ ∈ Rn }, and F ⊂ F¯ be a linear subspace, so that, as in Exercise 2.4,
fˆn (·, F ) − f0 2n = fˆn (·, F ) − f∗ (·, F ) 2n + f∗ (·, F ) − f0 2n . , and likewise f∗ = fθ (F ) . Then by Pythagoras’ rule, the Write fˆn (·, F ) = f ˆ θn (F )
∗
excess risk at fˆn (·, F ) can be written as
θˆn (F ) − θ0 2n = θˆn (F ) − θ∗ (F ) 2n + θ∗ (F ) − θ0 2n . Consider now the hard-thresholding estimator fθˆn with θˆn = θˆn (hard) defined in Section 3.5. Recall that in that case, pen(fθ ) = λ2n × #{θj = 0}, with λn the threshold (see Lemma 3.5). Compare the oracle inequality (4.1) with the oracle inequality of Lemma 3.3. Exercise 4.1 highlights again that the hard-thresholding estimator, which is based on an 0 -penalty overruling the variance, has oracle behavior. We remark here that the 1 -penalty is of a different nature, yet can still yield oracle behavior. Indeed, we will see in Chapter 5 that 1 -type penalties can lead to the right trade off between estimation error and approximation error. The 1 -penalty “kills” the variance in a rather implicit way. It is not based on (an estimate of) the estimation error. This is very useful in contexts other than penalized least squares, because generally, the estimation error can depend on the unknown underlying probability measure P in a rather complicated way. In this chapter, the empirical process takes the form n 1 νn (f ) = √
i f (xi ), n i=1 with the function f ranging over (some subclass of) F¯ . Probability inequalities for the empirical process are derived using Lemma 3.8. The latter is for normally distributed random variables. It is exactly at this place where our assumption of normally distributed noise comes in. Relaxing the normality assumption is straightforward, provided a proper probability inequality, an inequality of sub-Gaussian
224
S. van de Geer
type, goes through. In fact, at the cost of additional, essentially technical, assumptions, an inequality of exponential type on the errors is sufficient as well (see van de Geer (2000)). 4.2. Finite models Let F be a finite collection of functions, with cardinality |F | ≥ 2. Consider the least squares estimator over F 1 fˆn = arg min |Yi − f (xi )|2 . f ∈F n i=1 n
In this section, F is fixed, and we do not explicitly express the dependency of fˆn on F . Define
f∗ − f0 n = min f − f0 n . f ∈F
The dependence of f∗ on F is also not expressed in the notation of this section. Alternatively stated, we take here 0 ∀f ∈ F pen(f ) = . ∞ ∀f ∈ F¯ \F The result of Lemma 4.1 below implies that the estimation error is proportional to log |F |/n, i.e., it is logarithmic in the number of elements in the parameter space. We present the result in terms of a probability inequality. An inequality for, e.g., the average excess risk follows from this (see Exercise 4.2). Lemma 4.1. We have for all t > 0 and 0 < δ < 1, 1+δ 4 log |F | 4t2 P fˆn − f0 2n ≥ ( ) f∗ − f0 2n + + ≤ exp[−nt2 ]. 1−δ δ δ Proof. We have the basic inequality 2
fˆn − f0 2n ≤
i (fˆn (xi ) − f∗ (xi )) + f∗ − f0 2n . n i=1 n
By Lemma 3.8, for all t > 0, 1 n
i=1 i (f (xi ) − f∗ (xi )) n P max > 2 log |F |/n + 2t2
f − f∗ n f ∈F , f −f∗ n >0 n
≤ |F | exp[−(log |F | + nt2 )] = exp[−nt2 ].
If n1 i=1 i (fˆn (xi ) − f∗ (xi )) ≤ (2 log |F |/n + 2t2 )1/2 fˆn − f∗ n , we have, √ using 2 ab ≤ a + b for all non-negative a and b,
fˆn − f0 2n ≤ 2(2 log |F |/n + 2t2 )1/2 fˆn − f∗ n + f∗ − f0 2n ≤ δ fˆn − f0 2n + 4 log |F |/(nδ) + 4t2 /δ + (1 + δ) f∗ − f0 2n .
Oracle Inequalities and Regularization
225
Exercise 4.2. Using Lemma 4.1, and the formula ∞ P(Z ≥ t)dt EZ = 0
for a non-negative random variable Z, derive bounds for the average excess risk E fˆn − f0 2n . 4.3. Nested, finite models Let F1 ⊂ F2 ⊂ · · · be a collection of nested, finite models, and let F¯ = ∪∞ m=1 Fm . We assume log |F1 | > 1. As indicated in Section 4.1, it is a good strategy to take the penalty proportional to the estimation error. In the present context, this works as follows. Define F (f ) = Fm(f ) , m(f ) = arg min{m : f ∈ Fm }, and for some 0 < δ < 1, 16 log |F(f )| . nδ In coding theory, a similar penalty is used. When encoding a message using an encoder from Fm , one needs to send, in addition to the encoded message, log2 |Fm | bits to tell the receiver which encoder was applied. Let n 1 2 ˆ |Yi − f (xi )| + pen(f ) , fn = arg min ¯ n i=1 f ∈F pen(f ) =
and
f∗ = arg min f − f0 2n + pen(f ) . ¯ f ∈F
Lemma 4.2. We have, for all t > 0 and 0 < δ < 1, 1+δ 2 2 2 ˆ ) f∗ − f0 n + pen(f∗ ) + 4t /δ P fn − f0 n > ( ≤ exp[−nt2 ]. 1−δ Proof. Write down the basic inequality 2
fˆn − f0 2n + pen(fˆn ) ≤
i (fˆn (xi ) − f∗ (xi )) + f∗ − f0 2n + pen(f∗ ). n i=1 n
Define F¯j = {f ∈ F¯ : 2j < | log F (f )| ≤ 2j+1 }, j = 0, 1, . . .. We have for all t > 0, using Lemma 3.8, n 1 P ∃ f ∈ F¯ :
i (f (xi ) − f∗ (xi )) > (8 log |F(f )|/n + 2t2 )1/2 f − f∗ n n i=1 ∞ n 1 j+3 2 1/2 P ∃ f ∈ F¯j ,
i (f (xi ) − f∗ (xi )) > (2 /n + 2t ) f − f∗ n ≤ n i=1 j=0
226
S. van de Geer ≤
∞
exp[2j+1 − (2j+2 + nt2 )] =
j=0
≤
∞
∞
exp[−(2j+1 + nt2 )]
j=0
exp[−(j + 1 + nt2 )] ≤
∞
exp[−(x + nt2 )] = exp[−nt2 ].
0
j=0
n
But if i=1 i (fˆn (xi ) − f∗ (xi ))/n ≤ (8 log |F(fˆn )|/n + 2t2 )1/2 fˆn − f∗ n , the basic inequality gives
fˆn − f0 2n ≤ 2(8 log |F(fˆn )|/n+ 2t2 )1/2 fˆn − f∗ n + f∗ − f0 2n + pen(f∗ )− pen(fˆn ) ≤ δ fˆn − f0 2n + 16 log |F(fˆ)|/δ − pen(fˆ) + 4t2 /δ + (1 + δ) f∗ − f0 2n + pen(f∗ ) = δ fˆn − f0 2n + 4t2 /δ + (1 + δ) f∗ − f0 2n + pen(f∗ ),
by the definition of pen(f ).
4.4. General penalties In the general case with possibly infinite model classes F , we may replace the log-cardinality of a class by its entropy. Definition. Let u > 0 be arbitrary and let N (u, F , · n ) be the minimum number of balls with radius u necessary to cover F . Then {H(u, F , · n) = log N (u, F , · n) : u > 0} is called the entropy of F (for the metric induced by the norm · n ). Recall the definition of the estimator n 1 2 ˆ fn = arg min |Yi − f (xi )| + pen(f ) , ¯ n i=1 f ∈F and of the noiseless version
f∗ = arg min f − f0 2n + pen(f ) . ¯ f ∈F
We moreover define F (t) = {f ∈ F¯ : f − f∗ 2n + pen(f ) ≤ t2 }, t > 0. Consider the entropy H(·, F (t), · n ) of F (t). Suppose it is finite for each t, and in fact that the square root of the entropy is integrable, i.e., that for some continuous ¯ F (t), · n ) of H(·, F (t), · n ). one has upper bound H(·, t" ¯ (4.2) H(u, F (t), · n )du < ∞, ∀t > 0. Ψ(t) = 0
This means that near u = 0, the entropy H(u, F (t), · n ) is not allowed to grow faster than 1/u2 . Assumption (4.2) is related to asymptotic continuity of the empirical process {νn (f ) : f ∈ F(t)}. If (4.2) does not hold, one can still prove inequalities for the excess risk. To avoid digressions we will skip that issue here.
Oracle Inequalities and Regularization
227
Lemma 4.3. Suppose that Ψ(t)/t2 does not increase as t increases. There exists constants c and c such that for √ 2 (4.3) ntn ≥ c (Ψ(tn ) ∨ tn ) , we have c E fˆn − f0 2n + pen(fˆn ) ≤ 2 f∗ − f0 2n + pen(f∗ ) + t2n + . n Lemma 4.3 is from van de Geer (2001). Comparing it to, e.g., Lemma 4.2, one sees that there is no arbitrary 0 < δ < 1 involved in the statement of Lemma 4.3. This is just because van de Geer (2001) has fixed δ at δ = 1/3 for simplicity. √ When Ψ(t)/t2 ≤ n/C for all t, for some constant C, condition (4.3) is fulfilled if tn ≥ cn−1/2 , and, in addition, C ≥ c. Thus, by choosing the penalty carefully, one can indeed ensure that the variance is overruled. 4.5. Application to the “classical” penalty of Chapter 1 We now come to an extension of Example 1.4. Suppose X = [0, 1]. Let F¯ be the class of functions on [0, 1] which have derivatives of all orders. The sth derivative of a function f ∈ F¯ on [0, 1] is denoted by f (s) . Define for a given 1 ≤ p < ∞, and given smoothness s ∈ {1, 2, . . .}, 1 p I (f ) = |f (s) (x)|p dx, f ∈ F¯ . 0
We consider two cases. In Subsection 4.5.1, we fix a smoothing parameter λ > 0 and take the penalty pen(f ) = λ2 I p (f ). After some calculations, we then show that in general the variance has not been “overruled”, i.e., we do not arrive at an estimator that behaves as a noiseless version, because there still is an additional term. However, this additional term can now be “killed” by including it in the penalty. It all boils down in Subsection 4.5.2 to a data dependent choice for λ, 2 ˜ 2 I 2s+1 ˜ >0 or alternatively viewed, a penalty of the form pen(f ) = λ (f ), with λ depending on s and n. This penalty allows one to adapt to small values for I(f0 ). 4.5.1. Fixed smoothing parameter. For a function f ∈ F¯ , we define the penalty pen(f ) = λ2 I p (f ), with a given λ > 0. Lemma 4.4. The entropy integral Ψ can be bounded by + 2ps+2−p 1 1 − Ψ(t) ≤ c0 t 2ps λ ps + t log( ∨ 1) t > 0. λ Here, c0 is a constant depending on s and p.
228
S. van de Geer
Proof. This follows from the fact that H∞ (u, {f ∈ F¯ : I(f ) ≤ 1, |f | ≤ 1) ≤ Au−1/s , u > 0 where the constant A depends on s and p (see Birman and Solomjak (1967)). Here, H∞ denotes the entropy for the sup-norm (see Section 6.6 for a definition). For f ∈ F (t), we have p2 t , I(f ) ≤ λ and
f − f∗ n ≤ t. We therefore may write f ∈ F (t) as f1 + f2 , with |f1 | ≤ I(f1 ) = I(f ) and
f2 − f∗ n ≤ t + I(f ). It is now not difficult to show that for some constant C1 2 t ps t − 1s ) , 0 < u < t. H(u, F (t), · n ) ≤ C1 ( ) u + log( λ (λ ∧ 1)u Corollary 4.5. By applying Lemma 4.3, we find that for some constant c1 , E{ fˆn − f0 2n + λ2 I p (fˆn )} ≤ 2 min{ f − f0 2n + λ2 I p (f )} f
+c1
1 2
2ps 2ps+p−2
+
log
1 λ
∨1
n
nλ ps
.
4.5.2. Overruling the variance in this case. For choosing the smoothing parameter λ, the above suggests the penalty 2ps 2ps+p−2 C 0 pen(f ) = min λ2 I p (f ) + + , 2 λ nλ ps with C0 a suitable constant. The minimization within this penalty yields 2s
2
pen(f ) = C0 n− 2s+1 I 2s+1 (f ), where C0 depends on C0 and s. From the computational point of view (in particular, when p = 2), it may be convenient to carry out the penalized least squares as in the previous subsection, for all values of λ, yielding the estimators n 1 2 2 p fˆn (·, λ) = arg min |Yi − f (xi )| + λ I (f ) . f n i=1 ˆ n ), where Then the estimator with the penalty of this subsection is fˆn (·, λ n 2ps 2ps+p−2 1 C 0 2 ˆn = arg min . λ |Yi − fˆn (xi , λ)| + 2 λ>0 n i=1 nλ ps From the same calculations as in the proof of Lemma 4.4, one arrives at the following corollary.
Oracle Inequalities and Regularization
229
Corollary 4.6. For an appropriate, large enough, choice of C0 (or C0 ), depending on c, p and s, we have for a constant c0 depending on c, c , C0 (C0 ), p and s. 2s 2 E fˆn − f0 2n + C0 n− 2s+1 I 2s+1 (fˆn ) c 2s 2 ≤ 2 min f − f0 2n + C0 n− 2s+1 I 2s+1 (f ) + 0 . f n Thus, the estimator adapts to small values of I(f0 ). For example, when s = 1 and I(f0 ) = 0 (i.e., when f0 is the constant function), the excess risk of the estimator converges with parametric rate 1/n. If we knew that f0 is constant, we n would of course use the i=1 Yi /n as estimator. Thus, this penalized estimator mimics an oracle. 4.6. Bibliographical remarks When {F} is a collection of finite-dimensional linear models, a classical penalty is the dimension of F , properly normalized. This also has information-theoretic meaning as code length. More generally, one may think of taking the number of parameters in the penalty, or some other measure of “degrees of freedom”. Important ideas go back to Akaike (1973) and Schwarz (1978). There is much literature on penalized least squares and other loss functions. We mention in particular Barron, Birg´e and Massart (1999) for very general results.
5. The 1 -penalty Generally, using almost exactly the same arguments, the results of the previous chapter for least squares estimation, can be extended to other M-estimation procedures provided the margin behavior is known. However, when the margin behavior is not known (i.e., when the parameters c2 and especially κ in the margin condition of Section 2.5 are not known), a simple extension is no longer possible. The reason is that the estimation error depends on this margin behavior. “Overruling the variance” is thus not straightforward as this “variance” is not known. In this chapter, we propose an 1 -penalty (i.e., a soft-thresholding type penalty) to overcome the margin problem. Let F ⊂ F¯ be a class of functions on X . We assume that each f ∈ F can be written as a linear combination f (x) =
m
θj ψj (x), x ∈ X .
j=1 Here, {ψj }m j=1 is a given system of m functions on X , and θ = (θ1 , . . . , θm ) is an (in whole or in part) unknown vector. We consider the situation where the number of parameters m is large, possibly larger than the number of observations n.
230
S. van de Geer The 1 -penalty on fθ =
(5.1)
m
j=1 θj ψj
is
pen(fθ ) = λn
m
|θj |,
j=1
with λn a regularization parameter. We study M-estimation with this penalty. We will denote the 1 -norm of a vector θ ∈ Rm by I(θ) =
m
|θj |.
j=1
To simplify the exposition, we assume all coefficients θj are penalized. In practice, the first few coefficients are often left free (for instance the “constant term” or “constant + linear term”, etc.). In Section 5.1, robust estimation procedures in regression is studied, such as least absolute deviations. There, we consider the case of fixed design. Section 5.2 investigates exponential family approximations of a density. Section 5.3 considers the classification problem with random design, and we apply support vector machine loss. We recall the notation of Chapter 2. The loss function considered is denoted by γf , f ∈ F¯ . We write for each f ∈ F¯ , Rn (f ) = Pn (γf ) and R(f ) = EPn (γf ). We define (5.2)
fˆn = arg min {Rn (f ) + pen(f )} , f ∈F
and f0 = arg min R(f ). ¯ f ∈F
Note that our notation in this chapter differs somewhat from the previous chapter, ¯ in the sense that in the definition of fˆn , we minimize over f ∈ F , instead of f ∈ F. (It can be brought in line with that notation by taking pen(f ) = ∞ for f ∈ F¯ \F.) The empirical process, indexed by some subset F0 of F¯ , is √ {νn (f ) = n(Pn (γf ) − P (γf )) : f ∈ F0 }. Throughout, we fix some (arbitrary) f∗ ∈ F . In the results (Lemmas 5.1, 5.2 and 5.4) one may choose f∗ as being an (almost) oracle. Because the exact definition of this (almost) oracle is somewhat involved, we leave the choice of f∗ open in our general exposition. One can see a particular choice explained after the statement of the result for general f∗ (see below Lemma 5.1). We use the notation √ (5.3) Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2n ]}. The set Ξn will play exactly the same role as in Lemma 3.7. Recall that in this chapter, pen(fθ ) = λn I(θ). We will choose the smoothing parameter λn in such a way that (for all f∗ ∈ F) the probability of the set Ξn
Oracle Inequalities and Regularization
231
is very large (tending to 1 as n tends to ∞). Under the given conditions on the system {ψj }, this is true with the choice + log m . λn = c n But what is the value of c here? Let |f |∞ = sup |f (x)|. x
The constant c depends in Section 5.1 and 5.3 on an assumed finite upper bound K for supf ∈F |f |∞ and, in Sections 5.2 and 5.3, on an upper bound for the density of X. To show that with such a choice for c, the set Ξn has large probability, we need empirical process theory. Chapter 6 is devoted to that. In fact, Section 6.5 uses Theorem 6.2 and then presents all details for (robust) regression or classification with random design. In Section 5.1, we consider fixed design. One may then argue similarly, replacing Theorem 6.2 by Theorem 6.1. This is elaborated on in Loubes and van de Geer (2002) and van de Geer (2003). We remark here that in the latter two papers, the dependency of c on K is removed using convexity arguments. This however necessitates more subtle formulations. We skip the issue here to keep the exposition transparent. The margin condition of Section 2.5 also appears in all three sections. It is a ¯ lower bound for the excess risk R(f ) − R(f0 ), f ∈ F , in terms of a metric d on F. Let us repeat this margin condition here. Margin condition. For some constants c2 and κ ≥ 1, we have for all f ∈ F , R(f ) − R(f0 ) ≥ dκ (f, f0 )/c2 . The metric d is taken as the one induced by the L2 (Qn )-norm in Section 5.1 (robust regression with fixed design). In Section 5.2, F¯ is a class of densities with respect to a σ-finite measure µ, and the metric d will be the one induced by the L2 (µ)-norm. Section 5.3 takes the metric d induced by the L1 (Q)-norm. In general, for ν some measure on X , and 1 ≤ p < ∞, we denote the Lp (ν)-norm by 1/p |f |p dν , f ∈ Lp (ν),
f p,ν = and we write
· = · 2,Q , · n = · 2,Qn . The value of κ and c2 is not assumed to be known, although in Section 5.2 (density estimation), κ is shown to be generally equal to 2. The latter means that in density estimation, we get oracle inequalities very similar to those for least squares estimators in regression. The situation is then completely analogous to the one of Lemma 3.7. The 1 -penalty allows one to adapt to other margin behavior as well (i.e., to the situation where the excess risk is bounded from below by some more general increasing function of the appropriate metric).
232
S. van de Geer
The proofs of the oracle inequalities all follow from one single argument, which was basically already used in Lemma 3.7. We give this argument in Section m 5.4. It makes use of an assumed relation between the 1 -norm I(θ) = j=1 |θj | and the metric d in the margin condition. Condition on the metrics. For some constant c1 , α ≥ 0 and 0 ≤ β < κ, we have for all J ⊂ {1, . . . , m}, |θj − θ˜j | ≤ c1 |J |α dβ (fθ , fθ˜), ∀fθ , fθ˜ ∈ F . j∈J
In all Sections 5.1, 5.2 and 5.3, the condition on metrics is met with α = 1/2. Sections 5.1 and 5.2, have β = 1, and Section 5.3 has β = 1/2. 5.1. Robust regression Consider independent observations Y1 , . . . , Yn , satisfying, for i = 1, . . . , n, the regression Yi = f0 (xi ) + i , where γ (Yi − c), f0 (xi ) = min E¯ c∈R
with γ¯ : R → [0, ∞) a given loss function. The error is defined as i = Yi − f0 (xi ), i = 1, . . . , n. We examine the estimator n 1 fˆn = arg min γ¯ (Yi − f (xi )) + pen(f ) , f ∈F n i=1 where with fθ =
m
F ⊂ {fθ : θ ∈ Rm },
j=1 θj ψj ,
and where pen(fθ ) = λn I(θ), I(θ) =
m
|θj |.
j=1
For technical reasons, we assume sup |f |∞ ≤ K,
f ∈F
where K is finite (possibly depending on n). We need this condition to handle the empirical process part of the problem. Throughout this section, we assume that γ¯ is Lipschitz: |¯ γ (z1 ) − γ¯ (z2 )| ≤ |z1 − z2 |, z1 , z2 ∈ R. This is related to a certain robustness of the estimator. We call the estimator fˆn of this section the 1 -penalized robust regression estimator. The Lipschitz assumption also makes it possible to apply the contraction principle (see Section 6.3). This means there is enough machinery to obtain a good bound for the empirical process.
Oracle Inequalities and Regularization
233
Example 5.1. Least absolute deviations. If γ¯ (z) = |z|, z ∈ R, then f0 (xi ) is the median of Yi (whenever it exists). Hence, in that case i has median zero. We call fˆn the 1 -penalized least absolute deviations estimator. Example 5.2. Quantile regression. More generally, if for a given 0 < α < 1, γ¯ (z) = α|z|l{z < 0} + (1 − α)|z|l{z ≥ 0}, z ∈ R, then f0 (xi ) is the α-quantile of the distribution of Yi . The estimator fˆn is then called the 1 -penalized quantile estimator. As usual, we employ the notation n 1 γ¯ (Yi − f (xi )), Rn (f ) = n i=1 R(f ) = ERn (f ), νn (f ) =
√ n[Rn (f ) − R(f )],
and
1 2 f (xi ). n i=1 n
f 2n =
Margin condition. For some constants c2 and κ > 1 and for all f ∈ F , R(f ) − R(f0 ) ≥ f − f0 κn /c2 . The value of κ and c2 may depend on the density of the errors i , i = 1, . . . , n. Typically κ = 2 as is shown in Exercise 5.1. Exercise 5.1. Consider least absolute deviations estimation, i.e., the case γ(z) = |z|. Suppose that 1 , . . . , n are i.i.d. and that their common distribution has density g with respect to Lebesgue measure. Suppose also that g(u) > t > 0 for all |u| ≤ K. Show that when also |f0 | ≤ K, the margin condition holds with κ = 2 and c2 = 2/t. Positive eigenvalue condition on the system {ψj }. Define 1 ψ(xi )ψ(xi ) , n i=1 n
Σn =
with ψ(xi ) = (ψ1 (xi ), . . . , ψm (xi )) , i = 1, . . . , n. Let λ2min be the smallest eigenvalue of Σn . Assume that for some constant 0 < c1 < ∞, λmin ≥ 1/c1 . (Note that this condition on the system {ψj }m j=1 implies m ≤ n.) m Let for f= j=1 θj ψj , N (f ) = #{θj = 0}.
234
S. van de Geer
Define for 0 < δ < 1, ! κ 1 Vn (f ) = 2(c2 /δ) κ−1 4c21 λ2n N (f ) 2(κ−1) . We will show in Lemma 5.1 that on the set √ Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2 ]}, n
the excess risk of the estimator fˆn satisfies an inequality involving the estimation error Vn (f∗ ) and approximation error R(f∗ ) − R(f0 ). Lemma 5.1. Assume the margin condition, and the positive eigenvalue condition on the system {ψj }. Then on the set Ξn we have R(fˆn ) − R(f0 ) ≤ (
1+δ ) Vn (f∗ ) + R(f∗ ) − R(f0 ) + λ2n . 1−δ
Proof. Use the positive eigenvalue condition on the system {ψj } to see that the condition on metrics holds with d(f, f˜) = f − f˜ n , α = 1/2, and β = 1. Now, invoke the margin condition and apply Lemma 5.5. In Lemma 5.1, we now choose f∗ = arg min {Vn (f ) + R(f ) − R(f0 )} f ∈F
(where we assume for simplicity that the minimum is attained). This choice balances the estimation error Vn (f∗ ) and approximation error R(f∗ ) − R(f0 ), so that one arrives at an oracle inequality on Ξn . It can be shown that for any f∗ , the set Ξn has large probability for the choice λn = c log n/n, with c a large enough constant depending only on K. This can be done using the tools from Section 6 (the concentration inequality of Theorem 6.1, symmetrization (Theorem 6.3), the contraction principle (Theorem 6.4), and the peeling device described in Section 6.4). The details are in Loubes and van de Geer (2002) and van de Geer (2003). Thus, a good choice for λn does not require knowledge of the constants κ or c2 appearing in the margin condition. Since the estimation error Vn (f∗ ) does depend on these constants, one may conclude that the 1 -penalty yields a trade off between estimation error and approximation error, without (directly) estimating the estimation error. Exercise 5.2. Suppose that for all integers N , min
f ∈F , N (f )≤N
[R(f ) − R(f0 )] ≤ c3 N −κs .
Here s > 0 is a “smoothness parameter”. Verify that the trade off between estimation error and approximation error gives 2κs
R(fˆn ) − R(f0 ) ≤ c4 λn2(κ−1)s+1 . The constant c4 is determined by the values of c1 , c2 and c3 and those of δ, κ and s.
Oracle Inequalities and Regularization
235
5.2. Density estimation Let X1 , . . . , Xn be i.i.d., with distribution P on X . Let p0 = dP/dµ be the density with respect to some given σ-finite measure µ. Define f0 = log p0 − log p0 dµ. Take
F¯ = {f :
f dµ = 0,
ef dµ < ∞}.
Define for f ∈ F¯ ,
pf = exp[f − b(f )], b(f ) = log
ef dµ.
Then for each f ∈ F¯ , the function pf is a density with respect to µ. We examine ¯ the penalized maximum likelihood estimator fˆn over the class F ⊂ F: n 1 log pf (Xi ) − pen(f ) . fˆn = arg max f ∈F n i=1 This means we take the loss function γf (x) = −f (x) + b(f ). The empirical risk is Rn (f ) = −Pn (f ) + b(f ), and the theoretical counterpart is R(f ) = −P (f ) + b(f ). Exercise 5.3. Verify that R(f ) − R(f0 ) = K(pf |pf0 ), where K(p|p0 ) is the Kullback-Leibler information between the densities p and p0 : K(p|p0 ) = log(p0 /p)p0 dµ. As stated in the beginning of this chapter, we assume that each f ∈ F can be written in the form m f= θj ψj , j=1
for some coefficients θj , j = 1, . . . , m. In this section this means we are considering an m-parameter exponential family. To have an identifiable representation, the functions ψj can be assumed to be centered with respect to µ ψj dµ = 0, j = 1, . . . , m.
236
S. van de Geer
The penalty is again taken to be the 1 -penalty pen(fθ ) = λn I(θ), I(θ) =
m
|θj |.
j=1
Define now
f 22,µ
=
f 2 dµ.
Margin condition. For some κ > 1 and c2 , and for all f ∈ F , K(pf |p0 ) ≥ f − f0 κ2,µ /c2 . In fact, the value κ = 2 is quite natural in this case, as can be seen in the next exercise. Exercise 5.4. Let {pf : f ∈ F } be a class of densities, with respect to Lebesgue measure on [0, 1], and suppose that p0 ≡ 1 is the density of the uniform distribution. Note that f0 = 0 and b(f0 ) = 0 in this case. Suppose that supf ∈F |f |∞ ≤ K. Show that R(f ) − R(f0 ) =
f 2 p1 dµ − [
f p1 dµ]2 ,
where p1 = exp[f1 − b(f1 )], and |f1 | ≤ |f |. Moreover f p1 dµ = f f1 p2 dµ − f p2 dµ f1 p2 dµ, where p2 = exp[f2 − b(f2 )], and |f2 | ≤ |f1 |. Conclude that R(f ) − R(f0 ) ≥ e−2K
f 2 dµ − 2e4K ( f 2 dµ)2 .
The margin condition is thus met with κ = 2 when F satisfies the requirement that f 2,µ is sufficiently small for all f ∈ F. (Convexity arguments can remove this requirement on F .) Positive eigenvalue condition on the system {ψj }. Let Σ = ψψ dµ. Let λ2min be the smallest eigenvalue of Σ. Assume that for some constant 0 < c1 < ∞, λmin ≥ 1/c1 . m Now, we proceed exactly as in the previous section. Let for f= j=1 θj ψj , N (f ) = #{θj = 0},
Oracle Inequalities and Regularization
237
and for 0 < δ < 1, ! κ 1 Vn (f ) = 2(c2 /δ) κ−1 4c21 λ2n N (f ) 2(κ−1) . Moreover, let f∗ be any function in F . Recall the set √ Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2 ]}, n
Lemma 5.2. Assume the margin condition, and the positive eigenvalue condition on the system {ψj }. Then on Ξn we have R(fˆn ) − R(f0 ) ≤ (
1+δ ){Vn (f∗ ) + R(f∗ ) − R(f0 ) + λ2n }. 1−δ
Proof. This again straightforward application of Lemma 5.5, as in the proof of Lemma 5.1. It is in this case easy to handle the set Ξn . Note that −[Rn (fθ ) − R(fθ )] − [Rn (fθ∗ ) − R(fθ∗ )] = Pn (fθ − fθ∗ ) − P (fθ − fθ∗ ) ≤ max |Pn (ψj ) − P (ψj )|I(θ − θ∗ ). j
Normalization condition on the system {ψj }. Assume that for all j = 1, . . . , m, + n |ψj |∞ ≤ log m (where we recall the notation |ψj |∞ = supx |ψj (x)|), and |ψj |2 dµ ≤ 1. Also assume that p0 ≤ c0 , where c0 ≥ 1 is given. Lemma 5.3. Suppose the normalization condition on {ψj }. Then for all t ≥ 1, + log m 2 ≤ 2 exp − tc0 log m . P max |Pn (ψj ) − P (ψj )| ≥ 3tc0 j n 7 Proof. This follows from Bernstein’s inequality (Bernstein (1924), Bennet (1962)), which says that for each a > 0, P (|Pn (ψj ) − P (ψj )| > t) ≤ 2 exp[− where σj2 = P |ψj |2 − |P ψj |2 .
na2 ], 2a|ψj |∞ + σj2
In view
of Lemma 5.3, we now know that for sequences of sets Ξn with λn = 3c0 log m/n, with m = mn → ∞ as n → ∞, we have lim P(Ξn ) = 1.
n→∞
238
S. van de Geer
5.3. Binary classification Let (X, Y ) be random variables, with X ∈ X a feature and Y ∈ {−1, +1} a label. A classifier is a function f : X → R. Using the classifier f , we predict the label +1 when f (X) ≥ 0, and the label −1 when f (X) < 0. Thus, a classification error occurs when Y f (X) < 0. We regard (X, Y ) as random variables with distribution P , and denote the distribution of X by Q. Moreover, we write η0 (x) = P (Y = 1|X = x), x ∈ X . Bayes rule is the classifier
f0 =
+1 if η0 ≥ 1/2 . −1 if η0 < 1/2
In Exercise 2.3, we have seen that f0 minimizes the probability of a classification error. Let (X1 , Y1 ), . . . , (Xn , Yn ) be observed i.i.d. copies of (X, Y ). These observations are called the training set. Let F be a collection of classifiers. In empirical risk minimization (see Vapnik (1995, 1998)), one chooses the classifier in F that has the smallest number of classification errors. However, if F is a rich set, this classifier will be hard to compute. Indeed, we will again consider a very highdimensional class F in this section We will use support vector machine (SVM) loss instead of the number of misclassifications, to overcome computational problems. The empirical SVM loss function is n 1 Rn (f ) = γ¯ (Yi f (Xi )), n i=1 where γ¯ (z) = (1 − z)+ is the hinge function, with z+ denoting the positive part of z ∈ R. Thus, our loss function γf is now γf (x, y) = (1 − yf (x))+ . The 1 -penalized SVM loss estimator fˆn is n 1 ˆ fn = arg min (1 − Yi f (Xi ))+ + pen(f ) , f ∈F n i=1 where F ⊂ {fθ : θ ∈ Rm }, and pen(fθ ) = λn I(θ), I(θ) =
m
|θj |.
j=1
As in Section 5.1, we need to assume, for technical reasons, that sup |f |∞ ≤ K,
f ∈F
where K is a given finite constant.
Oracle Inequalities and Regularization Define the theoretical SVM loss R(f ) = ERn (f ) =
239
(1 − yf (x))+ dP (x, y).
Exercise 5.5. Verify that SVM loss is consistent, in the sense that Bayes rule f0 satisfies f0 = arg min R(f ). all f
To handle the empirical process, note first that γ¯ (z) = (1 − z)+ is Lipschitz. This allows us to again apply the contraction principle of Section 6.3. It means that we can invoke the tools of Chapter 6 (in particular, the concentration inequality of Theorem 6.2, symmetrization, the contraction principle and the peeling device) to handle the empirical process part of the problem. Margin condition. For some constants κ ≥ 1 and c2 , we have for all f ∈ F , R(f ) − R(f0 ) ≥ f − f0 κ1,Q /c2 . Note that we assumed the margin condition with an L1 -norm, instead of the L2 -norm. It turns out that under some mild assumptions on η0 and Q, indeed the margin condition is met with L1 (Q)-norm. Positive eigenvalue condition on the system {ψj }. Let Σ = ψψ dQ. Let λ2min be the smallest eigenvalue of Σ. Assume that λmin > 0. The rest is as in the previous two sections, albeit that the condition on metrics is now met with the value β = 1/2 (instead of the value β = 1 of Sections 5.1 and m 5.2). Let for f= j=1 θj ψj , N (f ) = #{θj = 0}, and for 0 < δ < 1, 1
Vn (f ) = 2(c2 /δ) 2κ−1 8Kλ2n N (f )/λ2min
κ ! 2κ−1
.
Moreover, let f∗ be any function in F . Recall the set √ Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2 ]}. n
Assuming a proper normalization condition on the system
{ψj } (see Section 6.5), this set has large probability for the choice λn = c log n/n with c a suitable constant (see the conclusion at the end of Section 6.5). Lemma 5.4. Assume the margin condition, and the positive eigenvalue condition on the system {ψj }. Then on Ξn we have R(fˆn ) − R(f0 ) ≤ (
1+δ ){Vn (f∗ ) + R(f∗ ) − R(f0 ) + λ2n }. 1−δ
240
S. van de Geer
Proof. This follows from the fact that for fθ ∈ F and fθ˜ ∈ F
m |θj − θ˜j | ≤ |J |1/2 ( |θj − θ˜j |2 )1/2 j=1
j∈J
√ 1/2 ≤ |J |1/2 fθ − fθ˜ 2,Q /λmin ≤ |J |1/2 2K fθ − fθ˜ 1,Q /λmin . So the condition on metrics holds with d(f, f˜) = f − f˜ 1,Q √ and with α = β = 1/2 and c1 = 2K/λmin . Apply Lemma 5.5.
5.4. The behavior on Ξn We use straightforward calculus to show the inequality for the excess risk on Ξn . The arguments follow exactly those in the proof of Lemma 3.7. Lemma 5.5. Assume the margin condition and the condition on the metrics, formulated in the beginning of this chapter. Then on the set Ξn , we have for any 0 < δ < 1, R(fˆn ) − R(f0 ) ≤ ( where
1+δ ){Vn + R(f∗ ) − R(f0 ) + λ2n }, 1−δ β
κ
κ
Vn = 2(c2 /δ) κ−β (2c1 ) κ−β (|J∗ |α λn ) κ−β , with J∗ = {θj,∗ = 0}. Proof. The basic inequality says that
√ R(fˆn ) − R(f0 ) ≤ −[νn (fˆn ) − ν( f∗ )]/ n + pen(f∗ ) − pen(fˆn ) + R(f∗ ) − R(f0 ),
so that on Ξn , R(fˆn ) − R(f0 ) ≤ pen(fˆn − f∗ ) + λ2n + pen(f∗ ) − pen(fˆn ) + R(f∗ ) − R(f0 ). Now, use the same arguments as in the proof of Lemma 3.7. Let pen1 (fθ ) = λn |θj |, j∈J∗
and pen2 (fθ ) = λn
|θj |.
j ∈J / ∗
Then pen(fθ ) = pen1 (fθ ) + pen2 (fθ ), and one easily sees pen(fˆn − f∗ ) = pen1 (fˆn − f∗ ) + pen2 (fˆn ) and pen(f∗ ) − pen(fˆn ) ≤ pen1 (fˆn − f∗ ) − pen2 (fˆn ).
Oracle Inequalities and Regularization
241
So on Ξn , R(fˆn ) − R(f0 ) ≤ 2pen1 (fˆn − f∗ ) + λ2n + R(f∗ ) − R(f0 ). By the condition on the metrics, on Ξn , R(fˆn ) − R(f0 ) ≤ 2c1 λn |J |α dβ (fˆn , f∗ ) + λ2n + R(fˆn ) − R(f∗ ). After application of the triangle inequality d(fˆn , f∗ ) ≤ d(fˆn , f0 ) + d(f∗ , f0 ), the margin condition gives us that on Ξn , β
β
β β R(fˆn ) − R(f0 ) ≤ 2c1 λn |J |α c2κ (R(fˆn ) − R(f0 )) κ + 2c1 λn |J |α c2κ (R(f∗ )) − R(f0 )) κ
+λ2n + R(f∗ ) − R(f0 ). Next, invoke the Technical Lemma of Chapter 2. Then, on Ξn , β
κ
κ
R(fˆn ) − R(f0 ) ≤ 2(c2 /δ) κ−β (2c1 ) κ−β (|J∗ |α λn ) κ−β + δ(R(fˆn ) − R(f0 )) +λ2n + (1 + δ)(R(f∗ ) − R(f0 )).
5.5. Bibliographical remarks For quantile regression, we refer to Koenker and Bassett (1978) and, in the nonparametric case, Koenker, Ng and Portnoy (1992, 1994) and Portnoy (1997). A nice comparison of least squares and least absolute deviations is in Portnoy and Koenker (1997). The result of Section 5.1 is from Loubes and van de Geer (2002) and van de Geer (2003). There, also convexity is used to show that one may take λn not depending on the bound K for the functions f in F. See also van de Geer (2002) for convexity arguments in density estimation (and high-dimensional exponential families). A good reference for classification is the book by Devroye, Gy¨ orfi and Lugosi (1996). Also the book of Hastie, Tibshirani and Friedman (2001) has classification as one of its subjects, including SVM’s, but also other methods and algorithms. Adaptive results for empirical risk minimizers in classification, using for instance Rademacher complexities, are in, e.g., Koltchinskii (2001), Koltchinskii and Panchenko (2002), Koltchinskii (2003), and Lugosi and Wegkamp (2004). Support vector machines (SVM’s) have been introduced by Boser, Guyon and Vapnik (1992). An important book on SVM’s is Sch¨ olkopf and Smola (2002). Lin (2002) shows that the SVM is consistent. See Bartlett, Jordan and McAuliffe (2003) for results for more general loss functions. The 1 -penalty is often referred to as the LASSO (see Tibshirani (1996)). Adaptivity of the SVM with 1 -penalty is studied in Tarigan and van de Geer (2005).
242
S. van de Geer
6. Tools from empirical process theory Let X1 , . . . , Xn be independent random variables with values in X and let Γ be some family of real-valued functions on X . Define Pn (γ) =
n
γ(Xi )/n, P (γ) = EPn (γ).
i=1
We review, in Sections 6.1–6.4, some general results for the empirical process √ n(Pn − P )(γ) : γ ∈ Γ. In Section 6.5, we apply the results of Sections 6.1 - 6.4 to arrive at a uniform probability bound for the case where the empirical process is indexed by a class {γf = γ¯ ◦ f : f ∈ F } with γ¯ : R → R and F a subset of a collection of linear functions {fθ = j θj ψj }. There, we moreover assume the Lipschitz condition |¯ γ (z1 ) − γ¯ (z2 ))| ≤ |z1 − z2 |, z1 , z2 ∈ R. The probability bound of Section 6.5 can be used to handle the set Ξn introduced in Chapter 5. Section 6.6 studies empirical processes indexed by functions in a function class Γ satisfying an entropy bound. It considers the modulus of continuity of the empirical process. The result says that a version of the empirical process condition of Section 2.5 holds. 6.1. Concentration inequalities Concentration inequalities are exponential probability inequalities for the concentration of the supremum of the empirical process around its mean. Remarkable is that the amount of concentration does not depend on the dimensionality of the problem. Define Z = sup |Pn (γ) − P (γ)|. γ∈Γ
We start out with a Hoeffding-type concentration inequality. Theorem 6.1 (Massart (2000a)). Suppose Γ is a finite collection, satisfying ai,γ ≤ γ(Xi ) ≤ bi,γ for some real numbers ai,γ and bi,γ and for all 1 ≤ i ≤ n and γ ∈ Γ. Define n L2 = sup (bi,γ − ai,γ )2 /n. γ∈Γ i=1
Then for any positive t, P(Z ≥ EZ + t) ≤ exp[−
nt2 ]. 2L2
The next theorem is a Bernstein-type concentration inequality.
Oracle Inequalities and Regularization
243
Theorem 6.2 (Massart (2000a)). Let Γ be a countable family of functions, satisfying |γ|∞ (= supx |γ(x)|) ≤ b < ∞ for every γ ∈ Γ. Let τ 2 = sup
n
var(γ(Xi ))/n.
γ∈Γ i=1
Then for any positive t,
√ P(Z ≥ 2EZ + τ t 8 + 69bt2 /2) ≤ exp[−nt2 ].
6.2. Symmetrization Definition. A Rademacher sequence 1 , . . . , n is a sequence of i.i.d. random variables with values in {±1}, and with P( i = 1) = P( i = −1) = 1/2 (i = 1, . . . , n). Theorem 6.3 (van der Vaart and Wellner (1996)). Let 1 , . . . , n be a Rademacher sequence independent of X1 , . . . , Xn . Then n 1 E sup |Pn (γ) − P (γ)| ≤ 2E sup |
i γ(Xi )| . γ∈Γ γ∈Γ n i=1 6.3. Contraction principle Theorem 6.4 (Ledoux and Talagrand (1991)). Let x1 , . . . , xn be non-random elements of X , and γ¯ : R → R be Lipschitz, i.e., |¯ γ (z1 ) − γ¯ (z2 )| ≤ |z1 − z2 |, z1 , z2 ∈ R. Furthermore, let F be a class of functions on X . Then for any f∗ ∈ F , n n E sup
i [¯ γ (f (xi )) − γ¯ (f∗ (xi ))] ≤ 2E sup
i (f (xi ) − f∗ (xi )) . f ∈F f ∈F i=1
i=1
6.4. Weighted empirical processes Let Γ = {γf : f ∈ F } be a class of functions indexed by a set F . Define the empirical process √ νn (f ) = n(Pn − P )(γf ), f ∈ F . Consider a function d∗ on F taking non-negative values. Suppose we are interested in the weighted empirical process νn (f ) , dβ∗ (f ) where β > 0 is given. We have for all a > 0 and t > 0, ∞ |νn (f )| β(j−1) β P sup |νn (f )| > a2 t . P sup >a ≤ β f ∈F , d∗ (f )>t d∗ (f ) f ∈F , d∗ (f )≤2j t j=1 Each term is the summand in the right-hand side can be studied using results for the non-weighted empirical process. Because the index set F is “peeled” into annuli f ∈ F , 2j−1 t < d∗ (f ) ≤ 2j t, this method for obtaining probability bounds is sometimes referred to as the peeling device (van de Geer (2000)). We apply
244
S. van de Geer
the peeling device in the proof of Lemma 6.4. It is also handy for obtaining the modulus of continuity of the empirical process indexed by functions satisfying an entropy bound (see Section 6.6). 6.5. The case of a Lipschitz transformation of a linear space Let γ¯ : R → R be a Lipschitz function, i.e., |¯ γ (z1 ) − γ¯ (z2 ))| ≤ |z1 − z1 |, z1 , z2 ∈ R. Let F be a collection of functions on X and define γf = γ¯ ◦ f, f ∈ F . Recall the notation for the empirical process √ νn (f ) = n(Pn − P )(γf ), f ∈ F . Regression and classification example. In regression and classification, we replace X with values x ∈ X by the pair (X, Y ) with values (x, y) ∈ X × R. In robust regression, the loss function is γ¯ (y − f (x)) with γ¯ a given Lipschitz function. In binary classification, one has y ∈ {±1} and the SVM loss is γ¯(yf (x)) where γ¯ (z) = (1 − z)+ is the hinge loss, which is clearly also Lipschitz. Suppose now that F a bounded subset of a collection of linear functions n F = {fθ = θj ψj : |fθ | ≤ K0 /2}, j=1
{ψj }nj=1
where is a given system of functions. The constant K0 is assumed to be finite and satisfy K0 ≥ 1. We assume there are exactly n functions in the system to facilitate the exposition. Denote the 1 -norm of a vector θ ∈ Rn by n I(θ) = |θj |. j=1
Normalization condition on the system {ψj }. Assume that for all j = 1, . . . , n, + n , max |ψj (x)| ≤ x log n and Fix an arbitrary f∗ = fθ∗
P |ψj |2 ≤ 1. in F . Define for all M > 0,
FM = {fθ ∈ F : I(θ − θ∗ ) ≤ M }, and
√ ZM = sup |νn (f ) − νn (f∗ )|/ n. fθ ∈FM
We first apply, in Lemma 6.1, the concentration inequality of Theorem 6.1 to ZM , with M fixed. The result is a probability inequality for the (one-sided) deviation
Oracle Inequalities and Regularization
245
of ZM from its mean EZM . Next task is to obtain an upper bound for EZM . This is done in Lemmas 6.2 and 6.3. The combination of Lemmas 6.1–6.3 yields the probability inequality (6.1) (below Lemma 6.3) for ZM with M fixed. Lemma 6.4 finally uses this result and the peeling device of Section 6.4, to derive a probability bound for the weighted empirical process. We use the notation K0 ∨ M = max(K0 , M ), K0 ∧ M = min(K0 , M ). Lemma 6.1. Assume the normalization condition on the system {ψj }. For all M satisfying + + log n n K0 ≤ M ≤ K0 , n log n we have + log n 2 ≤ exp[−(K0 ∨ M )2 log n]. P ZM ≥ 2EZM + 36K0 M n Proof. By the second part of the normalization condition on the system {ψj }, we know that P |fθ − f∗ |2 ≤ I 2 (θ − θ∗ ), In the concentration inequality of Theorem 6.2, we may take τ 2 ≤ (K0 ∧ M )2 . We moreover take there t2 = (K0 ∨ M )2 log n/n, and b = K0 . Then
√ τ t 8 + 69bt2 /2 ≤ 36K02M log n/n.
Lemma 6.2. We have for all M > 0, EZM
n 1 ≤ 4M E max |
i ψj (xi )| , 1≤j≤n n i=1
where 1 , . . . , n is a Rademacher sequence (see Section 6.2 for a definition), independent of X1 , . . . , Xn . Proof. By the definition of ZM , EZM = E
sup |(Pn − P )(γfθ − γf∗ )| .
fθ ∈FM
Use the symmetrization inequality (Theorem 6.3) and then the contraction principle (Theorem 6.4). Then we get n 1 EZM ≤ 2E sup |
i (γfθ (Xi ) − γf∗ (Xi ))| fθ ∈FM n i=1 n 1
i (fθ (Xi ) − f∗ (Xi ))| . ≤ 4E sup | fθ ∈FM n i=1
246
S. van de Geer
But obviously,
n n 1 1
i (fθ (Xi ) − f∗ (Xi ))| ≤ M E max |
i ψj (xi )| . E sup | 1≤j≤n n fθ ∈FM n i=1 i=1
From now on we assume log n ≥ 1. Lemma 6.3. Assume the normalization condition on the system {ψj }. Let 1 , . . . , n be a Rademacher sequence, independent of X1 , . . . , Xn . We have + n log n 1 .
i ψj (xi )| ≤ 24 E max | 1≤j≤n n n i=1 Proof. Define
Zψ =
+ n log n 1 . max |
i ψj (xi )| / 3 1≤j≤n n n i=1
Using the same argument as in Lemma 5.2, we find for all t ≥ 1, 2 P(Zψ ≥ t) ≤ 2 exp[− t log n]. 7 Hence,
E(Zψ ) =
P(Zψ ≥ t)dt ≤ 1 + 2 1
∞
2 exp[− t log n] 7
2 7 exp[− log n] ≤ 8. =1+ log n 7 (Here, we used that log n ≥ 1.)
The combination of Lemmas 6.1, 6.2 and 6.3 yields that under the normalization condition on the system {ψj }, + log n 2 ≤ exp[−(K0 ∨ M )2 log n], (6.1) P ZM ≥ 228K0 M n
for K0 log n/n ≤ M ≤ K0 n/ log n. We now invoke the technique of Section 6.4, to study the weighted empirical process νn (fθ ) − νn (f∗ )
, fθ ∈ F . I(θ − θ∗ ) + 456K02 log n/n
Lemma 6.4. Define λn = 456K02 log n/n. Under the normalization condition on the system {ψj } we have √ |ν( fθ ) − ν( f∗ )| P sup ≥ nλn ≤ 3 exp[−K02 log n/2]. fθ ∈F I(θ − θ∗ ) + λn
Oracle Inequalities and Regularization
247
Proof. To simplify the notation, we assume that f∗ ≡ 0. We split the class F into three subclasses, namely + log n }, F (I) = {fθ : I(θ) ≤ K0 n + log n F (II) = {fθ : K0 < I(θ) ≤ K0 }, n and F (III) = {fθ : I(θ) > K0 }. Then
√ √ |ν( fθ )| |ν( fθ )| sup P sup ≥ nλn ≤ P ≥ nλn fθ ∈F I(θ) + λn fθ ∈F (I) I(θ) + λn √ √ |ν( fθ )| |ν( fθ )| sup ≥ nλn + P ≥ nλn +P sup fθ ∈F (II) I(θ) + λn fθ ∈F (II) I(θ) + λn
= PI + PII + PIII . Apply (6.1) to PI to see that PI ≤ P
sup |νn (fθ )| ≥
fθ ∈F (I)
√
nλ2n
2 4 log n √ = P ZK log n ≥ (456) K0 0 n n log n ≤ P ZK √ log n ≥ 228K03 √ 0 n n
≤ exp[−K02 log n]. Now, 0 F (II) ⊂ ∪jj=0 {fθ ∈ F : Mj+1 < I(θ) ≤ Mj }, √ −j where Mj = 2 K0 , j = 0, . . . , j0 , and where 2j0 ≤ n. Thus
PII ≤
j0
j0 P ZMj ≥ λn Mj /2 ≤ exp[−K02 log n) ≤ exp[−K02 log n/2].
j=0
j=0
Finally ¯ ¯ F(III) = ∪∞ j=1 {fθ ∈ F : Mj−1 < I(θ) ≤ Mj }, ¯ j = 2j K0 , j = 0, 1, 2, . . .. So where M PIII ≤
∞ j=1
∞
¯ 2 log n] ≤ exp[−K 2 log n]. ¯ j /2 ≤ P ZM¯ j ≥ λn M exp[−M j 0 j=1
248
S. van de Geer
Let us conclude this section by putting the result of Lemma 6.4 in the context of Chapter 5. Following the notation of that chapter, let √ Ξn = {−[νn (fˆn ) − νn (f∗ )] ≤ n[pen(fˆn − f∗ ) + λ2n ]},
where fˆn is any random function in F , λn = 456K02 log n/n and pen(fθ ) = λn I(θ). Then by Lemma 6.4, under the normalization condition on the system {ψj }, one has P(Ξn ) ≥ 1 − 3 exp[−K02 log n/2]. This, in combination with the result of Lemma 5.4 completes, for the case m = n, the proof of adaptivity of the 1 -penalized SVM loss estimator. (When m is larger than n, say m = nD , the result goes through with suitably adjusted normalization condition on the system {ψj }, or alternatively, with suitably adjusted constants depending on D.) The result of Lemma 5.1 for the 1 -penalized robust regression estimator, can be completed analogously. It suffices to replace in the arguments of this section, the concentration inequality of Theorem 6.2 by the one of Theorem 6.1. 6.6. Modulus of continuity of the empirical process Let F be a class of functions on X . Definition. Let u > 0. The u-entropy H(u, F , d) (for the metric d) is the logarithm of the minimum number of balls with radius u necessary to cover F . We define
f = P |f | = 2
2
n
Ef 2 (Xi )/n.
i=1
Let moreover |f |∞ = sup |f (x)|, x∈X
and we suppose that the entropy of F for the metric induced by | · |∞ is finite. We denote this entropy by H∞ (·, F ). Consider again a transformation γf = γ¯ ◦ f : f ∈ F , where γ¯ is Lipschitz. Theorem 6.5 (van de Geer (2000)). Let F be a class of functions with |f −f∗|∞ ≤ 1 for all f, f∗ ∈ F . Suppose that for some constants A and s > 1/2, 1
H∞ (u, F , | · |∞ ) ≤ Au− s , u > 0. Then there exists a constant c, depending only on A and s, such that for all f∗ ∈ F and all t ≥ c, ν (f ) − ν (f )| t n n ∗ ≥ t ≤ c exp[− ]. P sup 1 1− s c −
f − f∗ 2s 2s+1 f ∈F , f −f∗ >n
Oracle Inequalities and Regularization Example 6.1. Let X = [0, 1] and
F = {|f |∞ ≤ 1,
249
1
|f (s) (x)|2 dx ≤ 1}. 0
Here f (s) is the s-th derivative of f . Then (Kolmogorov and Tikhomirov (1959)), the entropy condition of Theorem 6.5 holds. Exercise 6.1. We first study the regression setup with fixed design. Suppose, for i = 1, . . . , n, we have observations (xi , Yi ) ∈ (X × R), , where xi is a fixed covariable, and Yi a response variable. In this setup, we throughout used the notation
f 2n =
n
f 2 (Xi ),
i=1
for a function f : X → R. Note that for such a function, f = f n . Consider the M-estimator over the class F defined in Example 6.1: 1 fˆn = arg min γ¯ (Yi − f (xi )). f ∈F n i=1 n
Assume the margin condition on of Section 2.5 holds, with d the metric corresponding to the · n -norm, and with margin parameter κ ≥ 2. By applying Theorem 6.5 (with F there replaced by F˜ = {(x, y) → y − f (x) : f ∈ F }), show that the bound as given in (2.4) for the average risk becomes 2κs 1 + δ − 2(κ−1)s+1 ER(fˆn ) − R(f0 ) ≤ ( ) c0 n + R(f∗ ) − R(f0 ) , 1−δ as β = 1 − 1/(2s). Here, c0 is a constant depending on the constant c of Theorem 6.5, and on δ, κ and s. Compare with the result of Exercise 5.2. This illustrates that, indeed, up to log n-terms, the estimator with 1 -penalty of Section 5.1, adapts to the smoothness (= number of derivatives in this case) s. Exercise 6.2. Generalize the situation of Exercise 6.1 to the case of random design, assuming the margin condition of Section 2.5 is met with d the metric corresponding to the · -norm. 6.7. Bibliographical remarks Hoeffding’s inequality and Bernstein’s inequality can be found throughout the literature. Original references are Hoeffding (1963), Bernstein (1924) and Bennett (1962). Concentration inequalities have been derived by Talagrand (e.g., Talagrand (1995)) and further developed by Ledoux (e.g., Ledoux (1996)). Massart (2000a) studies the constants in these inequalities. Symmetrization is an important randomization technique in empirical process theory, that goes back to Vapnik and Chervonenkis (1971). The method described in Section 6.4, for weighted empirical processes is from Alexander (1985), and referred to as the peeling device in van de Geer (2000). A recent paper on weighted empirical processes is Gin´e and Koltchinskii (2004). The results in Section 6.5 are from Tarigan and van de Geer
250
S. van de Geer
(2005). In van de Geer (2000), moduli of continuity are derived, more general as the one cited in Section 6.6. There, also their implications on rates of convergence in regression and maximum likelihood estimation is explained. Entropy results for subsets of Besov spaces, which are extensions of the space F studied in Example 6.1, can be found in Birg´e and Massart (2000).
References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Proceedings 2nd International Symposium on Information Theory, P.N. Petrov and F. Csaki, Eds., Akademia Kiado, Budapest, 267–281. Alexander, K.S. (1985). Rates of growth for weighted empirical processes. In: Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer 2 (L. LeCam abd R.A. Olshen eds.) 475–493. University of California Press, Berkeley. Barron, A., Birg´e, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Prob. Theory Rel. Fields 113 301–413. Bartlett, P.L., Jordan, M.I. and McAuliffe, J.D. (2003). Convexity, classification and risk bounds. Techn. Report 638, University of California at Berkeley. Bennet, G. (1962). Probability inequalities for sums of independent random variables. Journ. Amer. Statist. Assoc. 57 33–45. Bernstein, S. (1924). Sur un modification de l’in´egalit´e de Tchebichef. Ann. Sci. Inst. Sav. Ukraine Sect. Math. I (Russian, French summary). Birg´e, L. and Massart, P. (2000). An adaptive compression algorithm in Besov spaces. Journ. Constr. Approx. 16 1–36. ˇ and Solomjak, M.Z. (1967). Piecewise polynomial approximation of funcBirman, M.S. tions in the classes Wpα . Math. USSR Sbornik 73 295–317. Boser, B. Guyon, I. and Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers.Fifth Annual Conf. on Comp. Learning Theory, Pittsburgh ACM 142– 152. Cand`es, E.J. and Donoho, D.L. (2004). New tight frames of curvelets and optimal representations of objects with piecewise C 2 singularities. Comm. Pure and Applied Math. LVII 219–266. Devroye, L., Gy¨ orfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York, Berlin, Heidelberg. Donoho, D.L. and Johnstone, I.M. (1994a). Minimax risk for 1 -losses over p -balls. Prob. Th. Related Fields 99, 277–303. Donoho, D.L. and Johnstone, I.M. (1994b). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425–455. Donoho, D.L. (1995). De-noising via soft-thresholding. IEEE Transactions in Information Theory 41, 613–627. Donoho, D.L. and Johnstone, I.M. (1996). Neo-classical minimax problems, thresholding and adaptive function estimation. Bernoulli 2, 39–62. Donoho, D.L. (1999). Wedgelets: nearly minimax estimation of edges. Ann. Statist. 27 859–897.
Oracle Inequalities and Regularization
251
Donoho, D.L. (2004a). For most large underdetermined systems of equations, the minimal 1 -norm near-solution approximates the sparsest near-solution. Techn. Report, Stanford University. Donoho, D.L. (2004b). For most large underdetermined systems of linear equations, the minimal 1 -norm solution is also the sparsest solution. Techn. Report, Stanford University. Edmunds, E. and Triebel, H. (1992). Entropy numbers and approximation numbers in function spaces. II. Proceedings of the London Mathematical Society (3) 64, 153–169. Gin´e, E. and Koltchinskii, V. (2004). Concentration inequalities and asymptotic results for ratio type empirical processes. Working paper. Goldstein, H. (1980). Classical Mechanics. 2nd edition, Reading, MA, Addison-Wesley. Green, P.J. and Silverman, B.W. (1994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall, London. H¨ ardle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. (1998). Wavelets, Approximation and Statistical Applications. Lecture Notes in Statistics, vol. 129. Springer, New York, Berlin, Heidelberg. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer, New York. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journ. Amer. Statist. Assoc. 58 13–30. Koenker, R. and Bassett Jr. G. (1978). Regression quantiles. Econometrica 46, 33–50. Koenker, R., Ng, P.T. and Portnoy, S.L. (1992). Nonparametric estimation of conditional quantile functions. L1 Statistical Analysis and Related Methods, Ed. Y. Dodge, Elsevier, Amsterdam, 217–229. Koenker, R., Ng, P.T. and Portnoy, S.L. (1994). Quantile smoothing splines. Biometrika 81, 673–680. Kolmogorov, A.N. and Tikhomirov, V.M. (1959). -entropy and -capacity of sets in function spaces. Uspekhi Mat. Nauk 14 3-86. (English transl. in Americ. Math. Soc. Transl. (2) 17 (1961) 277–364). Koltchinskii, V. (2001) Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914. Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1-50. Koltchinskii, V. (2003) Local Rademacher complexities and oracle inequalities in risk minimization. Manuscript. Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes . Springer Verlag, New York. Ledoux, M. (1996). Talagrand deviation inequalities for product measures. ESIAM: Probab. Statist. 1 63–87. Available at: www.emath.fr/ps/ Lin, Y. (2002). Support vector machines and the Bayes rule in classification. Data mining knowledge and discovery 6 259–275. Loubes, J.-M. and van de Geer, S. (2002). Adaptive estimation in regression, using soft thresholding type penalties. Statistica Neerlandica 56 453–478. Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. To appear in Ann. Statist.
252
S. van de Geer
Mammen, E. and Tsybakov, A.B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829. Massart, P. (2000a). About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab. 28 863–884. Massart, P. (2000b). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse 9, 245–303. ´ Pareto, V. (1897). Course d’Economie Politique, Rouge, Lausanne et Paris. Portnoy, S. (1997). Local asymptotics for quantile smoothing splines. Ann. Statist. 25, 414–434. Portnoy, S. and Koenker, R. (1997). The Gaussian hare and the Laplacian tortoise: computability of squared error versus absolute-error estimators, with discussion. Stat. Science 12, 279–300. Sch¨ olkopf, B. and Smola, A. (2002). Learning with Kernels, MIT Press, Cambridge. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461–464. Silverman, B.W. (1985). Some aspects of the smoothing spline approach to nonparametric regression curve fitting (with discussion). Journ. Royal Statist. Soc. B 47, 1–52. Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Publications Math´ematiques de l’I.H.E.S 81 73–205. Tarigan, B. and van de Geer, S.A. (2005). Support vector machines with 1 -complexity regularization. Submitted. Tibshirani, R. (1996). Regression analysis and selection via the LASSO. Journal Royal Statist. Soc. B 58, 267–288. Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32, 135–166. Tsybakov, A.B. and van de Geer, S.A. (2005). Square root penalty: adaptation to the margin in classification and in edge estimation. To appear in Ann. Statist. 33. van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge University Press. van de Geer, S. (2001). Least squares estimation with complexity penalties. Mathematical Methods of Statistics 10, 355–374. van de Geer, S. (2002). M-estimation using penalties or sieves. J. Statist. Planning Inf. 108, 55–69. van de Geer, S. (2003). Adaptive quantile regression. In: Recent Advances and Trends in Nonparametric Statistics (Eds. M.G. Akritas and D.N. Politis), Elsevier, 235–250. van der Vaart, A.W. and Wellner, J.A. (1996). Weak Convergence and Empirical Processes, with Applications to Statistics. Springer, New York. Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Th. Probab. Appl. 16, 264–280. Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: Soc. for Ind. and Appl. Math. Wand, M.P. and Jones, M.L. (1995). Kernel Smoothing. Chapman and Hall, London. Sara van de Geer ((Please insert complete address))