This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
1} such that the vector UE 0,
Under the hypotheses of Theorem 4.1, for every r > 0 and
lim m->oo, K^oo
sup ave
(4.6)
(ξ2)oo, K-+00 a V e(ξ2)
0, (4.7)
2
2
ι 2
- 4p(/,ξ ,σ )| > Kp~ l ] = 0
88
Rudolf Beran
Moreover, (4.8) V
liminf
inf 2
'
r ^ >0 2
m-+oo a ve(ξ ) oo, ™ —> 0, is in principle usable. In particular, Bickel and Ren (1996) study the following situation: testing for goodness of fit with doubly censored data where the usual bootstrap as usual fails and finding a distribution approximating the truth under Ho is difficult. They propose using the m out of n bootstrap to set the critical value of the test Research partially supported by NSF Grant DMS 9504955. Research partially supported by NSF Grants DMS 9510376 and 9626535/9796229.
2
92
Peter J. Bickel and Jian-Jian Ren
and show that the proposed testing procedure is asymptotically consistent and has correct power against \/n-alternatives. There have been a number of papers in the literature detailing modifications of the bootstrap for correct use in testing, see Beran (1986), Beran and Millar (1987), Hinkley (1987, 1988), Romano (1988, 1989), among others. In particular, Hinkley indicated quite generally that bootstrapping from a distribution obeying the constraints of the hypothesis which is closest in some metric to the empirical distribution should give asymptotically correct critical values. Unfortunately, this requires an exercise in ingenuity in most cases, and as has been frequently noted, say, Shao and Tu (1995) for example, that it may in practice be very difficult to construct such a distribution. Romano showed that in an interesting class of situations, including testing goodness of fit to parametric composite hypothesis and independence, there was a natural definition of a distribution in the null hypothesis HQ closest to the empirical, and that, for natural test statistics, bootstrapping from this distribution would yield asymptotically appropriate critical values. In a prescient paper, Beran (1986) gave two general principles for construction of tests of abstract hypotheses in the presence of abstract nuisance parameters and estimation of the power functions of such tests. In Section 2, we propose a unifying principle which identifies a very broad class of hypotheses and statistics including all those considered by Romano (1988) for which a suitable application of the n out of n bootstrap yields asymptotically correct critical values, power for contiguous alternatives, and consistency under mild conditions. We state a general theorem and apply it in eight examples including all those of Romano, those of Bickel and Ren (1996), a test for change-point (Matthews, Farewell and Pyke, 1985) with censored data, and a number of others. This result, Theorem 2.1, applies only to test statistics which are regular in the sense of stabilizing on the n" 1 / 2 scale under the hypothesis. We then in Theorem 2.2 extend Theorem 2.1 to a broader class of statistics based on estimates of irregular parameters such as densities. Moreover, we show that our proposed unifying principle can fail in situations which the m out of n bootstrap can deal with. Our unifying principle, though not our point of view, can be viewed as a particular case of one of Beran's two approaches, even as Hinkley's work corresponds to the other. However, that part of Beran's formulation which is relevant to the principle we state emphasized construction of tests from confidence region for abstract parameters in the presence of nuisance parameters rather than the setting of critical values for natural test statistics. Perhaps for this reason, the abstract point of view which obscured the rather simple geometrically based special case we focus on and the general conditions whose checking is usually the heart of the matter, the broad applicability of his argument was not appreciated (even by us until a referee
The Bootstrap in Hypothesis Testing
93
brought his paper to our attention). We focus here on checkable conditions and examples. In Section 3, we state and prove a theorem showing that the m out of n bootstrap is an approach that generally provides correct significance level, asymptotic power under contiguous alternatives, and consistency. This is essentially a formalization of the discussion of BGvZ. We close with simulations and a brief appendix indicating where the regularity conditions for the examples can be found.
2
A general approach to defining semiparametric hypotheses
For simplicity, we start this section with the case where the data Λ Ί , . . . , Xn are independently and identically distributed (i.i.d.) taking values in A', usually Rk, with an unknown distribution function (d.f.) F E f . However, it should be apparent from our discussion that our approach is more generally applicable. Suppose that we want to test (2.1)
Ho:FeTovs.
HuFφFo.
We begin with considering the case that X takes on k+1 values #o? #i? only. Thus T is parametrized by
eiP=lpeRk',pj>o,
P
5
#fc
i<j h(θ) is 1-1. If h(θ) is continuously differentiate and (dhi(θ)/dθj)kxq is of full rank, then h is an embedding of Rq in Rk (Vaisman, 1984, page 11, 13 and 15). This means that for any p0 E To = h(Rq) Π JP, there exist open sets Upo and UQ in Rk and a differentiate function ηQ : UQ -¥ Rk~q, such that pQ e Upo and UPo Π T* = {p E Uo \ ηo{p) = 0}. The map ("atlas") η0 can in many cases of interest be pieced together consistently to a single η0 such that (2.2) To = {P € P I 77o(p) = 0}. 3 Thus, if the random sample X i , . . . , X n is from some p E P, which is in a neighborhood Upo of some p0 E To, and if ή = ηo(P), where p = (ΛΓi/n, Λ^/n,... ,Nk/n)τ is the empirical distribution (vector of frequencies) of XL, . . . ,X n , typical tests are of the form (or asymptotically equivalent to tests of the form): Reject if τ{y/nή) is large, where the function 3
Even if the "atlas" can not be reduced to a single function, we still can naturally base a test on ηo(p) for p as above and p 0 as the member of To closest to p.
94
Peter J. Bickel and Jian-Jian Ren
τ : Rk~q -> R+ is continuous with τ(t) = 0, iff t = 0. Typically, τ is equivalent to a norm on Rk~q. For instance, the usual Wald test is to use nηTTi~ιη^ where Σ is an estimate of the covariance matrix Σ(p) of r). This is equivalent to using τ(x) = X ΣQ X. In this situation, we can bootstrap parametrically in one of two ways: T
1
(a) Estimate θ by θo, the maximum likelihood estimator (MLE) under Ho : ηo = 0, then use the appropriate percentile of the distribution of τ{y/nff) as the critical value, where X*,... ,X* are i.i.d. p(0o); (b) Note that y/n(ή — r/0) and y/nή have the same distributions under HQ and use the appropriate percentile of the distribution of τ(\/n(ή* — ή)) as the critical value, but where now X*,..., X* may be obtained from the 'nonparametric' bootstrap, i.e., i.i.d. p. If ΘQ is uniformly consistent on Θ, it follows from, for example, a theorem of Rao (1973, page 360-362) that these bootstraps are both valid. (Note that p(0o) can be used instead of p in case (b).) If X does not have finite support, the corresponding conditions for characterization of an embedding in Hubert space are more involved. Nonetheless, as we shall see by example below, the equivalence (2.2) holds quite broadly. Sufficient conditions for use of bootstrap (b) are easily given. Suppose that for hypothesis (2.1), there exists Γ : T —> T, where T is a Banach space, possibly Rp but often a function space such as D[RP], such that (2.3)
To = {F; T(F) = 0}.
It is often convenient to think of both T and T as subsets of spaces of finite signed measures defined on spaces of bounded functions Hx^Hy on X and another space y and identify F as a member of loo{Ήχ),T(F) = G as a member of loo(Hy) via,
(2.4)
F(h) = Jh(x)dF(x), G(r) = Jr(y)dG(y).
We shall throughout assume that measurability technicalities are dealt with by the Hoffman-J0rgensen approach — see van der Vaart and Wellner (1996). Let Fn denote the empirical distribution of A Ί , . . . , Xn and r :T -> R+ be continuous with τ(t) = 0 iff t = 0. Tests for (2.1) are naturally based on rejecting Ho for large τ(y/nT{Fn)) (provided that y/n(T(Fn) - T(F)) is well behaved). In analogy to the multinomial situation, it seems natural to use either
The Bootstrap in Hypothesis Testing
95
(a) The quantiles of the distribution of τ(y/nT(F*)), where X*,... ,X* are i.i.d. from Fo G To, which is a 'uniformly consistent' estimate of F under HQ; or (b) The quantiles of the nonparametric n out of n bootstrap distribution of τ(y/fi(T(F*) - T(Fn))), where F* is the empirical distribution of X*,..., Jf *, a sample from Fn; as critical values for τ(y/nT{Fn)). In the framework of Beran (1986), this can be viewed as using the test: Accept iff 0 G C(F n ), where C(Fn) is the asymptotic 1 — α confidence region {ί; r(y/n(T(Fn) — ί)) < dn(Fn)} and dn{F) is the 1 - α quantile of the distribution of τ(y/n(T(Fn) - T(F))). We shall give sufficient conditions for the validity of alternative (b) in this abstract framework below, but before doing so we give some examples where (2.3) applies. Example 1 Goodness of fit to α single hypothesis. Here To = {Fo}, T is all distributions and we can clearly take T = loo(X), finite signed measures on X andT(F) = F-Fo. Possible r's are τ(μ) = ||/i||oo? where |H|oo = sup{|μ(/ι)| : h e Ux) for X suitable Ux. For example, X = R, Ux = {l(_oo,ί)5 * G R} gives the Kolmogorov-Smirnov test. Another possibility is weighted averages of μ2(h) over Wx. Thus, τ(μ) = f(μ(—oo,x))2dFo(x) leads to the Cramervon Mises test. Option (b) corresponds to using the bootstrap distribution of r(^/n(F* — Fn)), while (a) leads to simulating from FQ. D Example 2 Goodness of fit to α composite hypothesis. Here JF0 = {FQ; θ G
Θ}, θ G Rd, say, and To is a regular parametric model. Suppose that θ(Fn) G Θ is a regular estimate of θ in the sense of Bickel, Klaassen, Ritov and Wellner (1993) (BKRW) where θ : T -* Θ is a parameter. For instance Θ(F) = argmin||F — Fβlloo may be a possibility. Again we can take T C ioo(Wx) and T(F) = F - FΘ{F).
Note that we could take Θ(F) as any parameter defined on T such that Θ(FΘ) = θ.
This example figures prominently in Romano (1988). There he considered To describable by To = {F; F = j(F)} and recommended scheme (a), resampling from 7(F n ) for statistic \\Fn — 7(^)11 a n d Ί ( F ) = ^0(F) O U Γ scheme simply rewrites F = j(F) as F - *y(F) = 0. However we prescribe bootstrapping from the empirical for statistic y/n\\Fn - 7(^n) -F + That is we use the bootstrap distribution of ^ll^t (K) Example 3 Tests of location. Suppose X = Rq and T is a location parameter (2.5)
T(F( - θo)) = T{F) + ΘQ
96
Peter J. Bickel and Jian-Jian Ren
for all 0o e Rq Let JF0 = {F; T(F) = 0}. Thus if T(F) = fxdF, this is the hypothesis that the population mean of F is 0. If T(F) = F~ι (g), then this is the hypothesis that the population median is 0. We can similarly consider trimmed means etc. In fact our discussion applies to scale parameters and more generally transformation parameters — see BKRW (1993) — but we do not pursue this. In this case our prescription is to use the bootstrap distribution of y/n(T(F*) — T(F ? )). Here prescriptions (a) and (b) coincide since the distribution of y/n{T{F*) - T(Fn)) under Ho is by (2.5) the same as that of y/nT(F*) where F* is the empirical distribution of a sample from Fn(- — T(Fn)). Equivalently say in the case of the mean the bootstrap distribution of X* — Xn is the same as the distribution of Xn, the mean of a resample from the residuals X\ — X,..., Xn — X. The latter (a) form is the prescription of Preedman (1981) and Romano (1988), and the special case of the mean is Example 2 of Beran (1986). α We now turn to some simple results. Suppose T is a subset of a space of finite signed measures with T viewed as a subset of the Banach space loo(Ήy) as above. Suppose T is extendable to T and, (Al) T is Hadamard differentiate at all Fo G To as a map from (^*, || ||oo) t° (T, || I loo) with derivative T : T —» looiUy), a continuous linear transformation, and T a closed linear space containing T. That is (2.6) sup{||T(F0 + λΔ) - T(FQ) - λΓ(F 0 )Δ||;Δ G K} = o(λ) where K is any compact subset of Zoo(7ΐx) a n d λ —> 0. (A2) y/n{Fn — FQ) => Zp0 in the sense of weak convergence for probabilities on loo(Hx) given by Hoffman-J0rgensen and P{ZF G F} = 1 for all 0
Fo G TQ.
Theorem 2.1 Under (Al) and (A2), for all Fo G T^ (2.7)
Vϊι(T(Fn) - T(F0)) =• f
(F0)ZFQ
and with probability 1, (2.8)
v^(T(Fn*) - T(Fn)) =• Γ(F o )Z F o .
Proof By Gine and Zinn (1990), (A2) implies that
(2.9)
yft(K - Fn) =ϊ ZFo
with probability 1. Now apply a standard argument. By Hadamard differentiability
(2.10)
MT(Fn)
- T(F0)) = r(F 0 )Vn(F n - Fo) + op(l)
The Bootstrap in Hypothesis Testing
(2.11)
v^(T(Fn*) - Γ(Fo)) = t(F0)Vϊι(K
97
- Fo) + op(l)
(2.10) yields (2.7) and subtracting (2.10) from (2.11) yields (2.8). • Now letting Co be the distribution of corollary.
T(T(FQ)ZFΌ),
we have the following
Corollary 2.1 Under the assumptions of Theorem 2.1, if Co is continuous, and respectively, Ca and C% are the (1 — a)-quantiles of τ(y/n[T(F*) — T(Fn)]) and Co, then as n -> oo, (2.12)
P{τ(VET(Fn))
> C W I HΌ} -> a.
In fact, as n —>• oo,
(2.13)
P{[τ(VET(Fn))
> C*a^]A[τ(V^T(Fn)) > C°a] \ HQ} -> 0.
If {Fn} is a sequence of alternatives contiguous to Fo E T$, then (2.13) continues to hold with P replaced by Pn corresponding to Fn, and hence the power functions for the tests using Ca and Ca are the same. If (Al) and (A2) hold for all F e T not just T§ and τ(t) -> oo, as ||*||oo —>• oo, then the test based on Ca
is consistent for all F £ T§.
Proof (2.12) and (2.13) follow from C* ( r ι ) 4 c £ 0 ) for all Fo G JF0, an immediate consequence of the theorem and Polya's theorem. Contiguity preserves convergence in probability to constants so that equivalence of the power functions follows. Finally consistency follows since under the assumption Ca converges in probability under F to the (1 — α)-quantile of CF(T(T(F)ZF)). But,
4 oo since the first term in the norm is tight while the second term has norm of the order y/n since T(F) ^ 0 . • The examples 1-3 cited above all satisfy our assumptions essentially under the mild regularity conditions needed to justify that the test statistics in question have a limit law under Ho. We discuss the conditions briefly in the appendix. Now we turn to some further examples and a mild extension. Our next example falls outside of the Romano domain. Example 4 Goodness of fit test of a lifetime distribution under censoring. Suppose that for a desired observation T^, there are right censoring variable Ci and left censoring variable B{ such that T{ is independent from (B^Ci) and that the available observations are in the form X{ = (Yί,£i), where in
98
Peter J. Bickel and Jian-Jian Ren
the right censored sample case, we have Y{ = min{Ti,Ci},δi = I{T{ < C^}, and in the doubly censored sample case (Turnbull, 1974), we have Y{ =
max{mm{TuCi},Bi},δi
= I{B{ < T{ < d} + 2I{Ti > d} + 3/{Ti < J3;}
with P{B{ < Ci} — 1. Let G be the distribution function of T;, then in this frame work the goodness of fit test HQ: G = Go, for a given Go, is important. We write F as the distribution of X = (Y,δ). Then if G is identifiable, we have G = φ{F) with Gn = φ(Fn) to be the nonparametric maximum likelihood estimate (NPMLE) for G (see Bickel and Ren, 1996). Thus, we can take T(F) = φ(F) - G o = G - G o . Although Γ( ) is not Hadamard differentiate here, prescription (b) says to use the bootstrap distribution of r(y/n(G^ - Gn)). As Bickel and Ren (1996) point out, it is difficult to fulfill prescription (a) in this case for doubly censored data since it is not clear what to use as the member of T§ from which we should resample. We will return to this example subsequently in Section 4. α Example 5 U statistics. A natural generalization of Example 3 is testing Ho : T(F) = 0 where T(F) = EFφ{Xu... ,X fc ), k > 1. The statistic we would be led to is the V statistic
Typically, however, one considers the equivalent (2.15)
un =
71 i
where T(F,s) = Π t i ' Γ i i j
/••• / Xl
0 is the unknown changepoint parameter for the hazard rate which changes from λ to (1 — £)λ at time θ. If θ is confined to a finite interval [#i, #2], the following test statistic was proposed on maximum likelihood grounds by Matthews, Farewell and Pyke (1985) for the irregular hypothesis Ho : ξ = 0 vs. H1 : ξ φ 0 (2.17)
Tn=
sup \Zn(θ)\
Θ10, and r is subadditive. Suppose also that there exist {an}, {bn} scalar possibly depending on F such that an T{tn{F)n-l'2ZF)
(2.32) and an = o(nιl2).
+
bn^CF
Then, anτ{Tn(Fn))+bn^CF
and in probability an τ{Tn{F*) - Tn{Fn)) + bn =» £F. Proof By our previous argument and (Al') (2.33)
Tn(Fn*) - Tn(Fn) = T n (F)(F* - Fn) + o^n
and the corresponding statement holds for Tn(Fn) — Tn(F). Under (A27) and since r is a seminorm, anτ(Tn(F)(Fn-F))
+ bn
= =
anτ(Tn(F)n-V2ZF) +&n + O p ( α n | | T n ( F ) | | 0 0 n an τ(fn(F)ZFn-^2) +bn + o p (l) =• CF.
The last identity uses (2.31) and an = o(nιl2). anτ(Tn(Fn))
+ bn
=
But, under ifo,
anτ(Tn(Fn)-Tn(F))+bn
so that an r(Tn(Fn))+bn =*• Cp The same argument applies to an τ(Tn(F*) — Tn(Fn)) + bn and the theorem follows. • We close this section with an old example in which although our formalism applies, the conditions (Al) or (Al') of our theorems fail and our solution is incorrect. Example 8 Test of distribution support Suppose T — {F; F has support on [0,6] with unknown 6, continuous density / and f(b—) > 0}. Then, as is well known, if X^ < . . . < X{n) are the ordered -XVs, n(b — X(n)) has a limiting distribution (f(b—))~1Exp(l), where Exp(μ) denotes the exponential distribution with mean μ. Thus the natural test statistic for HQ : b = bo is Tn = n(X ( n ) - 6Q) If we let T(F) = F " 1 ^ ) - b 0 , we have put the hypothesis in our framework and have noted that under i ? -Tn = -nT(Fn) =
The Bootstrap in Hypothesis Testing
103
However, the bootstrap distribution of n(T(F*) —T(Fn)) does not converge as was already noted by Bickel and Preedman (1981) — see also BGvZ. Although Putter and van Zwet (1996) gave a method for repairing bootstrap inconsistency for a similar case in their Example 3.2, there is a much more general solution for this problem discussed in BGvZ, which we recapitulate and discuss briefly in the next section. D
3
The m out of n bootstrap hypothesis tests
This method, presented generally in Bickel, Gόtze and van Zwet (1997) (BGvZ) and in an alternative form by Politis and Romano (1996), is based on the assumption that under HQ : F G To, the test statistic, Tn — Tn(Fn) is such that (3.1) Tn =ϊ CF which is nondegenerate. The m out of n bootstrap prescribes that the appropriate quantile of the bootstrap distribution of T m ( F ^ ) be used, that is of the distribution of the statistic based on m observations resampled from Xi,... ,X n . The history of this approach which goes back to Bickel and Preedman (1981) and Bretagnolle (1981) is partially reviewed in BGvZ. If m —> oo and ™ —> 0 the prescription succeeds in giving an asymptotically correct level under very mild conditions which we detail below. Politis and Romano (1996) argue that by resampling without replacement this conclusion holds with no conditions. Bickel and Ren (1996) checked the regularity condition for the applicability of this method in Example 4 when data are doubly censored. BGvZ shows its applicability in Example 8. Here is a formal theorem. Let Tn be as above, T^ = Tm(F^) and C* be given by (3.2)
Pn{T^ > C*α} = α.
Let n = {h : M -+ 1R; \h(x) - h(y)\ —oo uniformly on bounded Lipschitz compacts contained in T§ if m -* oo, ™ -» 0. (c) swphen
\θm9n(F) - θm(F)\ = o(l), for all FeT.
Then, (i) l i m ^ o o P ^ > C* I Ho} = P{WF
>C% \ F
(ii) For alternatives Hn : F = Fn such that for some F o G To, {Fn} are contiguous to Fo, we have that under Hnj C* -> C°, as n -> oo and Λence the tests based on the critical values C* and C% have the same asymptotic power functions; (Hi) For a fixed alternative Hi : F = JF\ £ ^b,^{T n > C* | i ί j -> 1, as n —>> oo. Remark. Assumption (c) essentially says that Tm is not really perturbed by o(\/n) ties in its arguments — see BGvZ. Proof Theorem 2 in BGvZ shows that (c), m —>• oo, m/n -> 0 implies that,
(3.3)
4
where WF ~ £F. But bounded Lipschitz convergence is equivalent to weak convergence. Thus, noting that under Ho-, the identity T n ( X i , . . . , Xn\ F) = T n ( X i , . . . , Xn), (3.3) and Polya's theorem imply that C* 4 C% and another application of Polya's theorem yields (i). Assertion (ii) follows from the definition of contiguity. To argue for (iii), note that Theorem 2 of BGvZ implies that for all F (3.4) T^-μm(Fn)^CF. Therefore under F ^ To, (3.5)
C*a-μm(Fn)
= Op(l).
But Tn - μm(Fn) =Tn- μn(Fn) + {μn{Fn) - μm(Fn)) 4 oo. Thereto reject iff Tn > C* is equivalent to reject iff T n - μm{Fn) > C* - μm{Fn). The result follows from (3.5). • The proof shows that C* -> oo if F £ To and thus we expect that the power of this test is less than that of the tests proposed in Section 2 where
The Bootstrap in Hypothesis Testing
105
these are valid. We give some simulations to show that this is indeed the case. The question naturally presents itself: Is there a way of correcting the m out of n bootstrap to give results comparable to those we obtain by simulating the tests of Section 2? A systematic answer is given in Bickel and Sakov (1999) (in preparation) and the 1998 thesis of Sakov. We note that the m out of n bootstrap has the additional advantage of computational savings, see Bickel and Yahav (1988) for instance. In fact the computational savings can be garnered in the context of Section 2 also. Specifically it is clear that the conclusions of Theorem 2.1 continue to hold if the bootstrap distribution of τ(y/n(T(F*)— T(Fn))) is replaced in calculating the critical value by that oΐτ(y/fn(T(F^ι) — T(Fn))) as long as m -> oo. It is intuitively clear that m « n may give poor critical values. But, in practice, the effect as long as m is moderate seems small in the simulations we have conducted. Further investigation is necessary. 4
Simulations
In this section we present some simulation results exhibiting the success of the method given in Theorem 2.1 and Corollary 2.1 by a number of our examples and the inferior behavior of the m out of n bootstrap in all cases but Example 8. We give simulations for Example 3 — the median test, Example 4 — the goodness of fit test with doubly censored data, Example 7 and Example 8. In our studies, the following power curves are compared: P0(θ) = P{Tn > C£O|0},
PQ :
Pn(θ) = P{Tn > C*αW\θ},
Pn:
where α = 0.05, Tn is the test statistic, Cα is the true critical value obtained by the Monte Carlo method, Cα is the critical value based on the adjusted n out of n bootstrap as in Corollary 2.1, C* is the critical value based on the m out of n bootstrap as in Theorem 3.1, and θ is the parameter used to compute the power of the test. For each simulation run, Cα and Cα are based on 400 bootstrap samples, and Po{θ),Pn(θ),Pm(θ) are computed based on 400 random samples for each θ. (I). In Example 3, we consider test Ho : θ = 0 vs. Hi : θ > 0, where θ is the median of the distribution F from which X\,..., Xn is drawn. Figure 1 compares the power curves Po,Pn and P m , where n = 400, F is the normal distribution with mean θ and variance 25, and all power curves are the average of 500 simulation runs.
106
Peter J. Bickel and Jian-Jiaπ Ren
-0.4
-0.2
0
0.2 0.4 0.6 0.8 1 Median of Normal Distribution with Variance 25
Figure 1. Power curves of median test with complete sample of size 400.
Here m — \/n = 20 is used since for the median an Edgworth expansion to terms of order -4= is valid if the density F has a finite derivative F' and the Edgworth expansion for the m out of n bootstrap is valid for m = O(y/n) under the same conditions with the same leading term and error terms -7= and y/ψ (Sakov and Bickel, 1998). The "optimal" rate of m then balances \/f a n d ^ t o Sive m = y/n. (II). In Example 4, we consider the goodness of fit test HQ : G = Go vs. Hi : G φ GQ for doubly censored data using the Cramer-von Mises test statistic: 2
= nJ(Gn-Go) dGo. Denoting Exp(0) as the exponential distribution with mean 0, for n = 200,m = y/ή,Gv = Exp(l),C = Exp(3),£ = §C - 2.5 (which, under iί 0 , gives 55.7% uncensored, 25.2% right censored and 19.1% left censored observations), Figure 2 compares the power curves Po,Pn and P m , which are the average of 100 simulation runs. (III). In Example 7, we consider test Ho : F = Fo vs. Hi : F φ F o , with Fo = Exp(l). For test statistic Tn given by (2.26) with n = 400, hn = n~ι^,M = 3, K = U(—1,1) and θ as the mean of the exponential distribution, Figure 3 compares the power curves Po? Pn and P m , which are the average of 100 simulation runs. Here for m = y/n, the power curve P m by the m out of n bootstrap uses the critical value based on f^ =
The Bootstrap in Hypothesis Testing
0.6
0.8
107
1 1.2 Mean of Exponential Distribution
Figure 2. Power Curves of GOF Test with Doubly Censored Sample of Size 200.
m — / n |; \x\ < M}, which coincides with T* given by (2.27) if m — n. (IV). In Example 8, we consider test HQ : b = 1 vs. Hi : b > 1, with F = [7(0,6). For n = 400, m = n 1 / 3 and θ = 6, Figure 4 compares the power curves Po,Pn and P m , which are the average of 1000 simulation runs. In this case, the power function Pn{β), when the adjusted n out of n bootstrap is used, is a total breakdown under HQ. One should note that in Figure 4, the power Pn{θ) under Ho, i.e., when θ = 1, is always 0, while α = 0.05, although it seems that Pn(θ) and Po{θ) are quite close overall. Here the heuristics based on the asymptotic expansion the distribution of the maximum whose first error term is of order ^ and heuristics discussed in BGvZ suggest that an appropriate order of m = n 1 / 3 , in this case m = 7. 5
Appendix
We give brief arguments for the validity of the application of Theorem 2.1 in our examples. Example 1 Taking %x = Hy = indicators of rays for R and r corresponding to the Kolmogorov, Smirnov and Cramer - von Mises tests are covered by Corollary 2.1 as are the analogous tests when one takes Wx to be a universal Donsker class in higher dimensions (van der Vaart and Wellner (1996). Example 2 Suppose the model T is regular and θ(Fn) is a regular estimate in the sense of BKRW. Suppose also that θ : T -> Rd is Hadamard differen-
Peter J. Bickel and Jian-Jian Ren
108
0.7
0.8
0.9
1 1.1 1.2 Mean of Exponential Distribution
1.3
1.4
Figure 3. Power Curves of GOF Test Using Density with Complete Sample of Size 400.
1.002
1.004 1.006 b of Uniform Distribution (0,b)
1.008
Figure 4. Power Curves of Support Test with Complete Sample of Size 400.
1.01
109
The Bootstrap in Hypothesis Testing
tiable with respect to || H^ (in loo(Hx)) with derivative θ : T -> Rd. Then F —>- ify F ) is Hadamard diίferentiable since θ —>• i ^ is Hadamard differentiable from i? d to ^Ό by the regularity of the model and thus the composition F -> Θ(F) -» ify F ) is also. Example 3 The satisfaction of the conditions here on the sets T = {F : EF\X\2+δ < oo, ί > 0} and 7* = {F : / ' > 0} is well known. Example 4 The appropriateness of the conditions for right censored data may be obtained from Anderson, Borgen, Gill and Keiding (ABGK) (1993) and for the doubly censored case in Bickel and Ren (1996). Example 5 Appropriate references are cited in the example. Example 6 The arguments for the uncensored case is in Matthews et al (1985). The censored case modifications axe clear from the theory of the Kaplan-Meier for right censored data or the NPMLE for doubly censored data (see Bickel and Ren, 1996). E x a m p l e 7 The arguments based on Bickel and Rosenblatt (1973) are sketched in the example. Example 8 The arguments are given in BGvZ.
REFERENCES
Anderson, P. K., Borgan O., φ, Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer-Verlag. Arcones, M. and Gine, E. (1993). Limit theorems for U processes. of Probability 4, 1449-1452. Beran, R. (1986). Simulated power functions. Ann. Statist
Annals
14, 151-173.
Beran, R. and Millar, P. W. (1987). Stochastic estimation and testing. Ann. Statist 15, 1131-1154. Bickel, P. J. and Preedman, D. A. (1981). Some asymptotic theory for the bootstrap. Ann. Statist Vol. 9, 1196-1217. Bickel, P. J., Gδtze, F. and Van Zwet, W. R. (1997). Resampling fewer than n observations: gains, losses and remedies for losses. Statistica Sinica. (To appear). Bickel, P. J., Klaassen, C., Ritov, Y. and Wellner, J. (1993,1998). Efficient and Adaptive Estimation in Semiparametric Models. Johns Hopkins Press, Baltimore, Springer, New York.
110
Peter J. Bickel and Jiaπ-Jian Ren
Bickel, P. J. and Ren, J. (1996). The m out of n bootstrap and goodness of fit tests with doubly censored data. Robust Statistics, Data Analysis, and Computer Intensive Methods. Lecture Notes in Statistics. Springer-Verlag, 35-47. Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviations of density function estimates. An. Statist. 1,1071-1095. Bickel, P. J. and Sakov, A. (1998). On the choice of m in the m out of n bootstrap. (Preprint) Bickel, P. J. and Yahav, J. A. (1988). Richardson extrapolation and the bootstrap. J. Amer. Statist. Assoc. 83, 387-393. Bretagnolle, J. (1981). Lois limites du bootstrap de certaines fonctionelles. Ann. Inst. H. Poincare, Ser. B 19, 281-296. Csόrgδ, M. and Revesz, P. (1981). Strong Approximations in Probability and Statistics. Academic Press. New York. Freedman, D. A. (1981). Bootstrapping regression models. Ann. 12, 1218-1228.
Statist.
Galambos, J. (1978). The Asymptotic Theory of Extreme Order Statistics. John Wiley &; Sons, New York. Gehan, E. A. (1965). A generalized two-sample Wilcoxon test for doubly censored data. Biometrika 52, 650-653. Gill, R. D. (1983). Large sample behavior of the product-limit estimator on the whole line. Ann. Statist. 11, 49-58. Gine, E. and Zinn, J. (1990). Bootstrapping general empirical measures. Ann. Prob. 18, 852-869. Gu, M. G. and Zhang, C. H. (1993). Asymptotic properties of self-consistent estimators based on doubly censored data. Ann. Statist. Vol. 21, No. 2, 611-624. Hardle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regression fits. Ann. Statist. 21, 1926-1947. Hawkins, D. L., Kochar, S. and Loader, C. (1992). Testing exponentially against IDMRL distributions with unknown change point. Ann. Statist. Vol. 20, 280-290. Hinkley, D. V. (1987). Bootstrap significance tests. Proceedings of the J^Ίth Session of International Statistical Institute, Paris, 65-74. Hinkley, D. V. (1988). 321-337.
Bootstrap methods. J. R. Statist.
Soc.
B, 50,
Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist Assoc. 53, 457-481.
The Bootstrap in Hypothesis Testing
111
Komlόs, J., Maior, P. and Tusnady, K. (1976). An approximation of partial sums of independent r.v.s. and the sample df. Z. Wαhrsch. Verw. Gebiete 32, 33-58. Mammen, E. (1992). When Does Bootstrap Work? Springer-Verlag, New York. Massart, P. (1989). Strong approximations for the multivariate empirical and related processes by KMT construction. Ann. Probab. 17, 266291. Matthews, D. E., Farewell, V. T. and Pyke, R. (1985). Asymptotic scorestatistic processes and tests for constant hazard against a changepoint alternative. Ann. Statist Vol. 13, 583-591. Mykland, P. and Ren, J. (1996). Algorithms for computing the selfconsistent and maximum likelihood estimators with doubly censored data. Ann. Statist. 24, 1740-1764. Politis, D. N. and Romano, J. P. (1996). A general theory for large sample confidence regions based on subsamples under minimal assumptions. Ann. Statist 22, 2031-2050. Putter, H. and van Zwet, W.R. (1996). Resampling: consistency of substitution estimators. Ann. Statist. 24, 2297-2318. Rao, C. R. (1973). Linear Statistical Inference and its Applications. Wiley, New York.
Reeds, J. A. (1976). On the Definition of von Mises Functionals. Ph.D. Dissertation, Harvard University, Cambridge, Massachusetts. Ren, J. (1995). Generalized Cramer-von Mises tests of goodness of fit for doubly censored data. Ann. Inst Statist Math. 47, 525-549. Romano, J. P. (1988). A bootstrap revival of some nonparametric distance tests. J. Amer. Statist. Assoc. 83, 698-708. Romano, J. P. (1989). Bootstrap and randomization tests of some nonparametric hypotheses. Ann. Statist 17, 141-159. Sakov, A. (1998). Ph.D. Thesis, University of California - Berkeley. Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. Springer, New York. Shorack, G. R.and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. John Wiley & Sons, Inc. Turnbull, B. W. (1974). Nonparametric estimation of a survivorship function with doubly censored data. J. Amer. Statist Assoc. 69, 169-173. Vaisman, I. (1984).
A First Course in Differential Geometry. Marcel
Dekker, Inc., New York.
112
Peter J. Bickel and Jian-Jian Ren
van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes. Springer, New York. PETER J. BICKEL DEPARTMENT OF STATISTICS UNIVERSITY OF CALIFORNIA BERKELEY, CA 94720
USA bickeWstat. Berkeley, edu JIAN-JIAN REN DEPARTMENT OF MATHEMATICS TULANE UNIVERSITY NEW ORLEANS, LA 70118
USA [email protected]
AN ALTERNATIVE POINT OF VIEW ON LEPSKΓS METHOD
LUCIEN BlRGE Universite Paris VI Lepski's method is a method for choosing a "best" estimator (in an appropriate sense) among a family of those, under suitable restrictions on this family. The subject of this paper is to give a nonasymptotic presentation of Lepski's method in the context of Gaussian regression models for a collection of projection estimators on some nested family of finitedimensional linear subspaces. It is also shown that a suitable tuning of the method allows to asymptotically recover the best possible risk in the family. AMS subject classifications: Primary: 62G07, Secondary 62G20. Keywords and phrases: Adaptation, Lepski's method, Mallows' C p , optimal selection of estimators. 1
Introduction
The aim of this paper is threefold. First we want to emphasize the importance of what is now called "Lepski's method", which appeared in a series of papers by Lepski (see Lepskii, 1990, 1991, 1992a and b). Then we shall present this method from an alternative point of view, different from the one initially developed by Lepski. Finally we shall introduce some generalization of the method and use it to prove some nice properties of it which, as far as we know, have not yet been considered, even by its initiator. Let us first give a brief and simplified account of the classical method of Lepski. This method has been described in its general form and in great details in Lepskii (1991) and the interested reader should of course have a look at this milestone paper. Here we shall content ourselves to consider the problem within the very classical "Gaussian white noise model". According to Ibragimov and Has'minskii (1981, p.5), it has been initially introduced as a statistical model by KoteΓnikov (see KoteFnikov, 1959). Since then, it has been extensively studied by many authors from the former Soviet Union (see for instance Ibragimov and Has'minskii, 1981, Pinsker, 1980, Efroimovich and Pinsker, 1984) and more recently by Donoho and Johnstone (1994a and b, 1995, 1996) and Birge and Massart (1999), among many other references. Although not at all confined to this framework, the method has been often considered in the context of the Gaussian white noise model for the sake of simplicity. This model can be described by a stochastic differential equation of the form (1.1)
dYε{t) = s(t) dt + ε dW(t),
ε > 0,
0 < t < 1,
114
Lucien Birge
where s E Lβ([0,1]) and W is a standard Brownian motion originating from 0. One wants to estimate the unknown function s using estimators s(ε), i.e. measurable functions of Yε and ε. By "estimator", Lepski actually means a family {s(ε)} of estimators depending on the parameter ε which is assumed to be small enough. In order to measure the performances of such estimators, a classical way is to fix some distance d on L2 ([0,1]) (or some pseudo-distance if d(s,t) = 0 does not necessarily imply that s = t in 1L^([0,1])), some numq ber q > 1 and define the risk of the estimator at 5 as Έs[d (s,s(ε))]. The point of view chosen by Lepski is then definitely minimax and asymptotic. an He considers a family of parameter sets {Sθ}βeΘ d uniform rates of convergences of estimators over those parameter sets. For a given estimator 5, he defines its rate r[s, 0] on Sβ and the minimax rate TM[Θ] on SQ given respectively by r[S, θ](ε) = sup E s [dq(s, s(ε))] sesθ
and
rM[θ] = inf r[5,0], «
where the infimum is taken over all possible estimators. Comparing estimators then amounts to comparing their rates, the rate r being better than the rate r' (r ^ r') if and only if limsup ε _> o r ( ε )/ r / ( ε ) < +00 and two rates being equivalent (r x r') if r ^ r' and r1 •< r. An estimator s is "rate asymptotically minimax" on S0, and therefore optimal from this point of view, if r[S, θ] x ΓM[0] The problem that Lepski considers in his papers is the following: starting from a family of rate asymptotically minimax estimators {sg}eeGi how can one get adaptation over the family {Se}θeθ, i e. build a new estimator s which is simultaneously rate asymptotically minimax over all the sets 5^, i.e. satisfies r[5, θ] x TM[Θ\ for all θ G Θ. Let us give a brief and rough account of his solution, rephrasing and simplifying his assumptions in the following way (see Lepskii, 1991 for the precise ones). Lepski's assumptions are essentially equivalent to 1. Θ is a bounded subset of M+; 2. the family {Sβ}θeθ
ιs
nondecreasing with respect to 0;
3. the minimax rates TM[Θ] are, in a suitable sense, continuous with respect to 0; 4. for each 0 E Θ, one has available a rate asymptotically minimax estimator §Θ on SΘ] 5. for ε small enough and each 0 E Θ, d ς (s, Sfl(ε)) is suitably concentrated around its expectation.
115
Lepski's Method
Lepski then chooses, for each ε, a suitable finite discretization θ\ < ... < θn(ε) of Θ and, given some large enough constant K, defines θ(ε) = θ-(ε) where q
j = mϊ{j < n{ε) \ d (sθj(ε),sθk(ε))
< Kr[sθk,θk](ε)
for all * G (j,n(ε)]}.
He shows that s = 3$ is simultaneously rate asymptotically minimax over all the sets SQ. This problem of asymptotic adaptation can also be considered from a quite different point of view: if s G S = UΘ(ΞΘSΘ, there exists a smallest value θ(s) of θ such that s G SΘ and, since we have therefore no idea of the q behaviour of the risk E s [d (s, sθ(ε))] for θ < 0(s), among the estimators at hand, SQ^ can be considered as the best estimator for estimating s, among the family of estimators {sβ}θeΘ Prom this point of view, the problem to be solved is to find a best estimator in a family of such estimators and it still makes sense without any reference to the minimax and even to the family {$θ}θeΘ It can also be considered from a purely nonasymptotic point of view and set up as follows. Given Model (1.1) with a known value of ε and an unknown value of s, a family of estimators {se(ε)}βeQ and some loss function t, is it possible to design a method for selecting an "almost best" estimator in the family? More precisely, assuming that s G S C I^QO, 1]), does there exist a constant C, independent of ε and s G S and a random selection procedure θ based on Yε such that the estimator s = s$ satisfies (1.2)
E s [φ,s(ε))] < C inf Es[£(s,sθ(ε))]
for all 5 G S and ε > 0.
This is precisely the problem we shall deal with in this paper by a suitable modification of Lepski's initial recipe. In order to allow an easier understanding of our method and avoid technicalities, we shall stick to the Gaussian white noise model and restrict our study to the case of a family {sθ(ε)}θe& of projection estimators over a nested family of finite-dimensional linear subspaces Sβ of lfl([0,1]) indexed by some subset Θ of N. We shall show that (1.2) actually holds with S = l^([0,1]) and that one can even take C arbitrarily close to 1 when ε goes to zero under some suitable restrictions on s. The framework we use here is just the one we considered in Birge and Massart (1999) for studying penalized least squares estimators. Since penalization can also be viewed as a method for selecting estimators, this allows us to make a parallel between these two methods. Indeed, under the assumptions we use here, they are essentially equivalent. A discussion of the relative merits of the two methods within a more general framework is beyond the scope of this paper. Let us merely mention that Lepski's method allows to handle more general loss functions, while penalization allows to deal with more general families of estimators.
116
Lucien Birge
Lepski's method has been put to use in various contexts and by several authors. Let us mention here the papers by Efroimovich and Low (1994), Lepski and Spokoiny (1995), Juditsky (1997), Lepski, Mammen and Spokoiny (1997), Lepski and Levit (1998), Tsybakov (1998) and Butucea (1999). Recently, Lepski has substantially improved his method by relaxing the monotonicity assumptions he previously imposed and which were in particular inadequate to deal with estimation of multidimensional functions with anisotropic smoothness. His new method, which he explained in a series of lectures (Lepski, 1998) could analogously be carried out in the context we use below. In order to keep our presentation simple and short, we shall dispense with this extension and content ourselves to present our point of view derived from the initial method from Lepski (1991). The procedure for selecting an estimator among some family that we develop below is actually not exactly the original procedure proposed by Lepski, but rather some modification of it which is better suited to our nonasymptotic approach and avoids any reference to minimaxity. Nevertheless, the ideas underlying our construction definitely belong to Lepski.
2 2.1
Preliminary considerations The problem at hand
The problem we want to deal with is the estimation of some unknown function s G LQ ([0,1]) in the Gaussian white noise model (1.1). In order to accomplish this task, we have at our disposal a family of projection estimators {sm}m£M corresponding to some nested family {Sm}m^M of finitedimensional linear subspaces of L2 ([0,1]) with respective positive dimensions Dm. Here M C N is either N* = N\ {0} or finite and equal to [1; M] ΠN and the sequence (Dm)m£M is strictly increasing. We recall that the projection estimator sm onto Sm is derived from Yε by the formula Dr,
sm = ] Γ ιs a n
where (ψι,... ,ipDm) Our purpose is then {s<m}meMi build a new sequence (Dm)meM and
/ ψj(t)dYε(t) ψj,
arbitrary orthonormal basis of 5 m . as follows: starting from the family of estimators one, denoted by s, function of those, of ε and of the such that
with a constant C independent of s and ε. Here and in the sequel, E denotes the expectation with respect to the distribution of the process Yε, as defined by (1.1).
Lepski 's Method
117
Since the family {Sm}meM is nested, one can always find an orthonormal basis φ\,... ,ψj,... of L2QO, 1]) such that Sm is the linear span of (y>i,...,¥>Dm) for each m e M. Then, if s = Σjyiβjψj, it follows from (1.1) that, for all m E M, Dm
(2.3)
w i t h
βj = / Ψj(t)dYe(t) = βj + eZj,
where the random variables Zj, j E ΛΊ, are i.i.d. with distribution Λ/*(0,1).
2.2
Some properties of projection estimators
In order to describe some elementary properties of the projection estimators, it will be useful to introduce some notations. Setting Do = 0 and D^ = +00, we consider for all pairs (m, q) with 0 < m < q < +00 the quantities 5 ^ , Vm and Um given by = ε"
and
2
with the convention that Σj=k = ^ when k > I. Since the variables Zj are i.i.d. Λ/"(0,1), it then follows that V&, has the distribution χ2(Dq — Dm) of a chi square with Dq — Dm degrees of freedom and Um the distribution χ/2(Dq — Dm, λ/Bm) of a non-central chi square with Dq — Dm degrees of freedom and noncentrality parameter y/Bm> Therefore and
= Dq - D
One then derives from (2.3) that (2-4)
Φ — c2
||Sm - Sll 2 =
and for any pair (j, m) G Λ1 2 with j < m, (2.5)
2.3
\\sm~sj\\2 = ε2up
E [||S ro - ^ | | 2 ] = ε 2
and
f + Dm -
Optimal projection estimators
Given any sequence (xm)meM s u c h that limyn-^+oo xm = +00 when Λl is infinite, one defines in a unique way (2.6)
argmin{x m , m G ΛΊ}
xj = inf x m >
=
in
=
inf {j I Xj < xm
raEΛΊ J
for all m > j} .
118
Lucien Birge
Then, given s, a best estimator in the family {sm, m E M}, i.e. one minimizing the quadratic risk E [||Sm — s||2] = ε2 (Dm + B™) at s is % with m = argmin{Z)m + i?£^, 771 E Λ4}. More generally, given some number 7 > 0 one can define J (2.7)
= argmin{7Z)m + β~, m e = inf {j I β?° - £?~ < 7 ( £ m - Dj) for all m > 3} .
Since 5?° - B% = Bψ, it follows from (2.5) that (2.8)
J - inf {j I E [ p m - sj\\2] < (1 + 7 ) ε 2 ( A » - Dj) for all m > j } .
On the other hand, by (2.7) (2.9)
Dj + B?
3, TUQ > 1 and (4.19) then (4.17) holds.
λ m > a log Dm for m > m o ,
122
Lucien Birge
Proof Recalling from (3.16) that Σj = Σ m > j e x p ( - λ m ) , we consider some integer p > 2 such that a > 3 + 2/(p - 1). We want to prove that Σj:>1 DjΣ^~lfp < +oo. By (4.19) and the convexity of the function x \-> x'a one gets for j > UIQ -j \
-α+1
i and it follows that DjΈ-
p
p
< Dj
3 + 2/{p — 1), the series Σj>i ^3^j~
for j > TUQ. Since α >
converges and (4.17) is satisfied. •
Let us observe that (4.19) is in particular compatible with a choice of numbers Kmj satisfying for some positive constants A > a > 0, (4.20)
aDm < Kmj < ADm
for all m > 2, 0 < j < m,
which ensures that (4.18) holds. In particular, the original method of Lepski is based on the choice J = inf {j e M I \\sj - sm\\2 < Kε2Dm
for all
m>j}
with a suitably large constant K, Choosing K > 1 and 0 < 7 < K — 1 leads then to (If - 1 - η)Dm < 2KmJ = (K-1-
)Dm + (1 + Ί)Dj < KDm,
Ί
which is (4.20). Such a choice is therefore compatible with (3.12) and (4.19) for suitable values of the parameters λ m . In particular, the classical Lepski's method with a choice of K > 1 satisfies our assumptions. This is not true anymore if K < 1 and one could prove, in the same way that we proved lower bounds for the penalty term in Birge and Massart (1999), that K < 1 could lead to inconsistent estimators when ε converges to zero. One shall not insist on this here. On the other hand, if K > 1, the following theorem applies. Theorem 4.1 Under the above assumptions, there exists some constant C depending only on the various parameters involved in the construction of the estimator, but neither on ε, nor on s and such that
(4.21)
E [ p - S | | 2 ] < C inf E [ p m - S | | 2 ] . TTlt/Vl
If we fix the values of the various parameters involved in the construction of our estimator, C can then be taken as a universal constant. For instance, the particular choice of λm = 4log(Dm + l),p = 4, 7 = 1 and
123
Lepski 's Method
Kmj = \fi\mDm]ιl2 + λ m together with the assumption that An+i < for all m satisfies our requirements and, although this particular choice of the parameters has nothing special, it can cope with almost all practical situations. 5
Proof of Theorem 4.1
For the sake of simplicity we shall prove it below only under the assumption that M = N*. Only minor modifications in Section 5.5 below are needed to handle the finite case. 5.1
Basic inequality
It follows from (2.4) and the monotonicity of the sequence B™ that ε-2\\s-s\\2 =
= (VOJ + Bf) l^jy + (vj + Bf) 1{J>J} j + BJ + Bf) 1{J<J} + (VJ - Dj) 1
< Bf J O
VQJt{J>J} +
- Dj)
Vjh{J>J},
and therefore after integration
(5.22) ε- 2 E[p- S || 2 ]
J}] [\VOJ - Dj\ H { i < J } ] + E
[Vj%>J}\
We shall now bound successively each of the four expectations in the righthand side of (5.23). 5.2
Control of the first expectation
Recalling that the set Aj is defined by (3.14), we see that it only depends on the random variables U™ for m > j and therefore on the variables βm for m> j which implies that Aj is independent of VQ and therefore by (3.16) E
[VoJtίJ>J}]
< E[V0JlA 2, we define for ί > 0
Recalling that U^ has a distribution χ
/2
ίi)j- Dm, V^^)'
w e
derive from
(8.35) of Lemma 8.1 that
0<m<J
(5.23)
< (J-l)
Since E7jJ < (1 + 7)(JDJ - Dj) + 2Kj3
for J < J by the definition of J we
derive that, on the set Et Π {J < J } ,
t/
> Bj -
Using the fact that —2^/xy = [y/x — y/y) — x — y we then derive that, on the set EtΠ{J < J } , one has
[(Dj - Dj)/2 + Bj] -y/2tj
< ( 7 + l/2)(Dj - Dj) + 2Kj3 + 2ί,
and therefore
- Dj)/2 + Bj] 1,
(5.25) BJ+ηDj < (y+2aj)Dj+Δt
with Δ = (β + \/(2j + 1 + ±aj)Dj}
125
Lepski 's Method
on the set Et Π {J < J} since this inequality clearly also holds if J = J. One then derives from (5.23) that P [ ( # j +7 £>j) 1 { J < J } > (7 + 2aj)Dj + Δt] < (J - l ) e - *
for t > 1.
Integration with respect to t finally leads, for J > 2, to < (7 + 2aj)Dj + [1 + (1 V log( J which clearly remains true when J = 1. Therefore whatever J > 1,
+ (δ + 7(27 + 1 + 4αj)Aj) log(3J + 5). 5.4
Control of the third expectation
Since VQ71 has a χ 2 distribution with £>m degrees of freedom, one can use Lemmas 8.3 and 8.4 to bound E \\VQ — Dj l/j< n in the following way:
- Dj m=l J
m=l
5.5
Control of the fourth expectation
In order to control E V/lr j> j , we introduce an increasing sequence (/fc)fc>o of elements of J", starting with /o = J, to be defined below. One can therem fore write, using the monotonicity of the sequence (Vo )m>o that
k>0
fc>0
126
Lucien Birge 2
Since V* has a χ (Dk—Dj) distribution and p > 2, it follows from Lemma 8.4 that E Γ (vfY] < (Dk -Dj+p-l)Pfork> Holder Inequality and (3.16) that
j and we then derive from the
fc>0
k>0
We now have to specify how we choose Ik for k > 0. Let us introduce K = inf Ij Bj° < Ύ{DJ — p + 2) \ which exists since Z)m tends to infinity ih m and d define dfi with h = argmin {Bf + ΊΌj \ j > (J + 1) V K} /fc+i = argmin { £ f + 7 ^ | j > Ik + 1} forfc> 1. Now, if Ik > K — 1, according to the definitions of Jfc+i and If, £/~+1 + ΊDh+1
< Bfk+ι + 7 ^ + 1 < 7(2£>/fc+i - P + 2),
and therefore (5.27) DIk+1 + p - 1 < 2.0/,+! < 2/jb with ί j = sup
Dm+i/Dm.
m>J
This inequality holds for all k > 0 if J o = J > if - 1. Otherwise ΛΓ - 1 > J > 0 and (5.27) only holds for k > 1. Nevertheless, the same arguments then show that Bf^+ηD^ < η(2Dκ —p + 2). Therefore, using the definition of K one gets Dh+p-l
< 2DK < 2δjDK-ι < < 2δj{Ί-ιB?+p-2).
2δJ(Ί-1B%_1+p-2)
Therefore in both cases Dh +p- 1 < 2δj [(-f~ιBf + p - 2 ) v D j ] . It then follows from (5.26) that
J
1 1/p
+2δjΊ- {p-l)Σ f
.
On the other hand, it follows from (2.9) that
and finally from the definition of m that (5.30)
Έ[\\s-s\\2]u (Σj)j>i and (ίj)j>i are bounded and Σm>o DmΣ^lfp < +00. This completes the proof of Theorem 4.1. 6
Asymptotic optimality of a modified Lepski's method
Our purpose is now to understand what is going on when J goes to infinity. Since Dj > J and Σ m > 0 DmΣl^l'p < +00, all the terms in Cj then converge
128
Lucien Birge
to zero, except for the first one which involves α j , defined by (5.24). The assumption (4.18) only implies that aj is bounded but one could enforce it to 1
(6.31)
limsupZ)- ( sup Kmλ m
\l<j +oo is clearly compatible with (4.19) and one can choose the numbers Kmj in such a way that they satisfy (3.12) together with sup 1 < : j < m K m j = O(Dm/\ogm) which implies (6.31). It is now easy to prove the following corollary. Corollary 6.1 Let us assume that M is infinite as well as the set {m (Ξ •M I / sΨm Φ 0}. Choose the parameters λ m and Kmj in order to satisfy the assumptions of Section 4 with U 18) replaced by (6.31) and define the estimator s as before. Then (6.32) V
limsup
J
i
Proof We already noticed that all the terms in Cj, as given by (5.29) converge to zero when J goes to infinity, except for the first one, which tends to 1 V 7 " 1 since aj -> 0 by (6.31). We then remark that our assumption on s implies that ε2B™ is bounded away from 0 independently of ε whatever me M. Then E [ p m - s|| 2 ] = ε2{B% + Dm) remains bounded away from 0 for fixed m when ε —> 0 while it can be made arbitrarily small provided that both ε and m are suitably chosen. This implies that J ->• +oo when ε —> 0 and therefore Cj —> 1 V 7 " 1 when ε -> 0. The conclusion then follows from (5.30). • One should observe here that (6.31) rules out the initial choice of Lepski for the parameters Kmj which implies that (4.20) holds.
7
Conclusion
In the framework we have chosen here, an older and very popular method for choosing an optimal estimator in our family is Mallows' Cp which actually amounts to choose S = Sj with J (7.33)
=
a r g m i n { - p j | | 2 + 2ε 2 ίλ / , j (Ξ M}
=
inf {j e M I p
m
- Sjf
< 2ε2(Dm
- Dj) for all m > j } ,
since p m — Sj\\2 = \\sm\\2 — \\SJ\\ for m > j . One should observe that it is also the estimator derived from our extension of Lepski's method with 2
LepskVs Method
129
0 < 7 < 1 and 2Kmj = (1 — 7)(An — Dj). Unfortunately, such a choice of Kmj does not always satisfy (3.12) when j = m — 1 and m is large since λ m goes to infinity with m while Dm — Dm-\ may remain bounded. Nevertheless (3.12) will be satisfied with λ m = αlogJ9 m , as in Proposition 4.1 provided that Dm > Dm-ι + clogDm for some large enough c. In any case, it has been proved in Shibata (1981), that the estimator s = sj with J given by (7.33) satisfies (6.32) with 7 = 1 and by Birge and Massaxt (1999) that it also satisfies (4.21). As to the consequences of Theorem 4.1, they have been developed at length in Birge and Massart (1999) where an analogue of this result has been proved for penalized estimators. We therefore refer the interested reader to this paper for applications of this result, just mentioning here the following one. Assume that M = N* and that Dm = m, which implies that Sm is the linear span of {φi,... ,m} Given a nonincreasing sequence a = (α m ) m >χ of numbers in [0, +00] such that a\ > 0 and α m -> 0 when m -> +00, we denote by 6 (a) the ellipsoid defined by +00
with the convention that 0/0 = 0, x/0 = +00 and x/(+oo) = 0 for x > 0. Let s be any estimator satisfying (4.21), then it follows from Section 7.2 of Birge and Massart (1999) that s is minimax, up to constants, over all such ellipsoids. More precisely, there exists some constant K, such that, whatever the sequence a satisfying the above requirements, sup E [ | | 5 - s | | 2 ] < K [l V (ε/αi) 2 ] inf sup E [ p - s | | 2 ] , where the infimum is taken over all possible estimators.
8
Appendix
The following lemma is a generalization of Lemma 1 of Laurent and Massart (1998). 2
Lemma 8.1 Let X be a noncentral χ variable with D degrees of freedom and noncentrality parameter Bιl2 > 0, then for all x > 0, (8.34)
Ψ \x > {D + B) + 2yJ(D + 2B)x + 2rr] < exp(-z),
and (8.35)
P \x < {D + B) - 2y/(D + 2B)xj < exp(-x).
130
Lucien Birge
Proof Since we can write X as (JB1/2 + 17)) + V where U and V are independent with respective distributions Λf(0,1) and χ2(D — 1), the Laplace transform of X can be written as
E [ e «] = (1 - 2ίΓ D/2 exp [ ^ y for t < 1/2, which implies that
< (8.36)
(D + 2B)
= (D + 2B)
tH
i^>t>o}
_t{t 0, PLY 0, Ψ[X >z}
3
The one dimensional case is of course just the standard fluctuation of a random walk. One sees that the behavior of the random field in higher dimensions is quite different from the standard one dimensional random walk. Properties of the above type have been proved for more general cases, and not just for the harmonic one. See for instance [3]. It should be emphasized that although the random surface is localized for d > 3 in the sense that the fluctuations are of order 1 (which is in striking contrast to the one dimensional situation), the surface continues to have long range correlations: Prom the random walk representation it is evident that for points i,j which are not close to the boundary dVn (i, j G \Vn, say), the covariances En(XiXj) are of order \i — j \ ~ d + 2 . For dimensions d > 3 it is easy to see and well known, that there exists a limit P ^ , a measure on R z , which is just the centered Gaussian measure whose covariances are given by the Green's function of the discrete Laplacian G(i, j) = E ^ w (Σ%L0 lτ/fc=i) An important development in recent years led to the discovery that more general gradient fields with φ convex possess also random walk representations, see [18], [23], [9]. However, the random walks which have to be used in these cases are random walks in random environments, where the random environment is generated by an auxiliary diffusion process on M y n . Although many important properties have been extended in this way from the Gaussian case to more general ones, the fine properties of this more complicated random walk are difficult to discuss and many questions remain open. We will not pursue that line here. In mathematical physics, one has investigated questions about such surfaces which are quite natural in this context, but which have not attracted much attention in the standard literature of random walks. Some of these problems center around the interaction of the surface with {x G IR^" : x% = 0 Vi G V^}, which is sometimes called a "wall". For instance, one asks how the properties of the surface change if there is a small attraction to this wall, or if one considers only random surfaces being on one side of the wall. Often, there appear qualitative transitions of the "macroscopic" behavior if some of the parameters are changed smoothly. Examples are so called "wetting transitions", where the surface at specific values of the parameters ceases to cling to the wall. We briefly discuss the wetting transition in the last section.
137
Pinned Lattice Free Field
There is a survey paper by Michael Fisher [14], his lectures on the occasion of the Boltzmann prize, where he introduced many of these problems and discussed them for the random walk case. During the rest of this paper, we entirely stick to the harmonic case
Φ(χ) = uχ2A relatively simple question is how the random surface is influenced by the presence of a local attraction to the above wall. There are several ways to build in such an attraction. The standard one is to modify the Hamiltonian Hψ by adding a potential YJieVn V(xi) where V is a function which is symmetric and has its minimum at 0. If the potential V itself is quadratic, V(x) = %x2, μ > 0, say, we arrive at what in physical jargon is called the massive free field. We define the so called "massive Hamiltonian":
(3)
*$*(*) = &
Σ
(*i-*i)2+
Vn, \i-j\=i
and then the probability measure
(4)
Pn,μ{dx) = J - exp f-ffW (x)l i€Vn
where
(5) Pn,μ is still Gaussian. The random walk representation needs only a simple modification: We have to replace the standard random walk (ηk)k>o with one having a constant death rate. More precisely, the random walk has probability -ίpy of disappearing into a "graveyard" at every step it makes. We can also formalize that by introducing a geometrically distributed random variable ζ, and replace τyn in the formula (2) by τyn Λ ζ. ^From this, one easily checks that this massive field has exponentially decaying correlations, uniformly in n. Considerably more delicate is the case where V is flat at infinity, and does not grow. For instance, one can consider
in which case the "pinned" probability measure is no longer Gaussian: (7) Pn{dx) = ^ r n
8d
i,jevnυdvn, \i-j\=ι
Udx
ievn
138
Erwin Bolthausen and David Brydges
(We will always write P for such a locally pinned measure.) It is much less clear, but true, that also in the case of a local pinning, the field is localized in a strong sense, meaning that
(8) n
and there exist constants c, C > 0, depending on ε, α, such that (9)
supEniXiXj) 0 limit. For us, the "mass" parameter μ serves only to replace the 0-boundary condition which no longer can be implemented in the periodic case, and we want to have results which are uniform in μ, after taking the thermodynamic limit n —> oo. The random walk for which (2) is correct is then simply a random walk on the torus with killing rate ^ - . We can then perform the thermodynamic limit n —> oo, and obtain a measure P,μ on Mz . This measure evidently is a translation invariant centered Gaussian measure with exponentially decaying correlations, where the decay depends on the parameter μ, and disappears when μ —» 0 and where lim^-^o Var 0 0 j μ (Xo) = oo. What we will prove here is that if we introduce an additional pinning which acts only locally, as described in section 1, we get localization, i.e. estimates (8) and (9), which are uniform in μ. We stick to the two-dimensional case which is the most delicate. (In three and higher dimensions, there is actually a simple domination argument, as has been remarked by Dima Ioίfe, which does not work in the two dimensional case). We however change the pinning slightly, to make it purely local. Our measures will be
(10)
Pnfμ,j(ώ;) - jl— ^n^J
exp [-Hμ(x)] J ] (dXi ieτn
where
(11)
Zn^j = ίe~H^ J
Π ieτn
The parameter J G M regulates the strength of the pinning. The interesting case is when exp( J) is small, which means that the pinning is weak. Theorem 2.1 Assume d = 2. For αllJ G M the field is localized in the sense that (12)
sup lim sup 25n,Ai,j l^o| 2 < oo, μ>0
n—>oo
140
Erwin Bolthausen and David Brydges
and (13)
supliτnsupEn^j(XiXj) μ>0
for some positive constants tive J (14)
< Cjexp[-cj \i - j\]
n—> oo
CJ,CJ.
Furthermore, for sufficiently large nega-
suplimsup£ n > μ ,j|Xo| 2 < K\J\ μ>0
n—> oo
for some positive constant K.
3
Random Walk Expansion
The main advantage of using the "delta-pinned" measure (10) is that it has a more elementary random walk representation of the correlations than in the case of a general (symmetric) V in (7). It would however be only technically a bit more complicated to consider for instance the case (6) in combination with the random walk representation of [5]. On the other hand, it should be emphasized that the periodic boundary conditions are crucial for applying the chessboard estimates we are using here. For A C Tn and Ac — T n \A, let
keAc
This is a probability measure on K7™, but restricted to MA, it is just the free field on A. In particular, we will have the random walk expansions for the covariances under PA,μ exactly as (2), where only we have to replace τyn by TA Λ ζ — min(τ,4,C)> C being the geometrically distributed killing from the positivity of μ. The measure, we are interested in, namely Pn,μ,j can now easily be expanded in terms of these Gaussian measures: Expanding the product we get
The covariance is therefore
We insert the random walk expansion for the Gaussian expectation
fc=0
141
Pinned Lattice Free Field
Resumming over A under the random walk expectation noting that the constraint lTAΛζ>k is the same as requiring Ac to be disjoint from the range of the walk η^Q^ and the walk not being killed by the clock, we get
k=0
where, for any set B c T n , J
dxk f j {dxk + e δ0(dxk)) This is the random walk representation. It is essentially a special case of [5, Theorem 2.2] which applies to any even potential V. From this expression one also sees that the variables are always positively correlated.
4
Reduction to a pressure estimate
We define (17)
^=inΠiminf —
log
where Z n ? μ has been defined in (5). We prove in the next section the following Proposition 4.1 For all J E R, we have δj > 0 and
lim inf — log δj > 1 j-ϊ-00
J
As a special case of the estimates in section 4 of [5], it follows that δj > 0 implies that the variance of Xo is bounded in the thermodynamic limit. Exponential decay of the covariance is also an easy consequence. For the convenience of the reader, we prove here how the theorem now follows from this proposition, in particular as the argument is simpler in our "delta pinning"-case than for a potential V of the type (6). Following [5], we use Osterwalder-Schrader positivity in the form of the chessboard estimate [17], [15], [24], [16]
142
Erwin Bolthausen and David Brydges
so that from (16), we get rRW
l*?[offc]L '|τ»|
z •7ft 3 C>fc
k=0 oo 0 for J -» — oo. As δj -> 0, the two sums on the right hand side of the estimate are Riemann sum approximations to /•oo
zoo
logί g e~*dt-logίj g j // e~fdt
/
Jo
Jo
and therefore
α(0,0) — oo. This together with (18) proves (14) of the theorem. The decay of the correlations (13) follows by a simple modification. By translation invariance, we only have to consider α(j,0). Then in the above estimates (19), we can restrict the summation over N to N > \j\\ + \J2\ where j = (ji, J2) Therefore we get
α(i,0) < ciδj
Σ
log(δjN)e-WN-V - δjlogδj
N=\h\+\j2\
from which (13) is immediate.
5
Estimates on 5j, proof of proposition
We finish the proof of our main theorem by proving the proposition of the last section. In the sequel, c > 0 is a generic constant, not necessarily the same at different occurrences. We subdivide Tn into boxes B of side length 2L, where for convenience we assume that L divides n. The partition function Zn^j is expanded as in (15). A lower bound for Zn^j is then obtained by restricting the sum over A in the expansion to a special class 2le of sets defined by: A G 2le if, for every box £?, B Π Ac contains exactly one lattice point and this point lies within L/2 of the center of B. Thus
L
2(n/L) 2 e J(n/L)
The proof is now easily finished using the following result:
Z
A9μ
144
Erwin Bolthausen and David Brydges
Lemma 5.1 There exists Lo E N and depending on μ > 0, there exists no(μ) £ N, such that for n > no(μ) and L > Lo inf We postpone the proof of this lemma for a moment, and proceed with the proof of the proposition. Evidently, the right hand side in the above lemma is //-independent, but we have to remember that we have the restriction n > no(μ) which does not bother us, as the claimed μ-uniformity is after taking the thermodynamical limit only. Prom the lemma we get δj > L~2 (J - loglogL - c + logL 2 ) The key point is that the entropy in the sets 2lc has given rise to the last term which dominates at large L, so by optimizing over L, L2 = e- J + °( J ) as J -> -oo we achieve a strictly positive J+
J
δj > e °(- )
and this implies the statement in the proposition. Proof of Lemma 5.1 By definition of 2l£, every box B contains exactly one point, call it A;, which is not in B Π A. In the ratio ^βnΛ,μ/^β,μ of partition functions below we integrate first out all the variables except x — Xk which leads to a Gaussian law with variance a~ι = Gβ(fc, A;), and therefore
ZBφ
fe~2aχ2dx
where in the last inequality, we estimate the Green's function GB(A;, fc) using [20, Theorem 1.6.6]. This we do with every box £?, and therefore, we get TT ZBnA,μ
>
e
-(n/L) 2 (loglogL+c)
Noting that this is the right hand side of the expression in Lemma 5.1, we see that it is sufficient to prove that Q defined by (20)
145
Pinned Lattice Free Field
satisfies
Q < c{n/Lf. These Gaussian partition functions can be integrated in terms of determinants of lattice Laplacians, which then can be expressed with a random walk representation. This is reviewed in details in [1], Section 4.1. The outcome is that
keAm=l
Formally, this is just coming from expanding the logarithm in the equality det [1 + A] = exp(Tr log [/ + A]) and using a random walk representation for the resulting terms. Implementing the above expression into (20), we arrive at oo
1
fcW(%m = k,τB
2m)
because the partition function Z^μ/Z^μ involves the sum over all paths in Tn that leave A. Amongst these are paths that stay inside some box B but leave A at the single point in B Γ\ Ac. These are divided out by the denominator in Q so we are left with paths that exit A and whichever box B they started in. We can replace the random walk on the torus by the free random walk, making an error in the above expression of order n2 exp [—cμn] Σm=i ^ (C > 2m) /2m, which for any μ > 0 is at most 1, if n is large enough, n > n o (μ), say, and can therefore be neglected. After having made this replacement, we write Q = Q- + Q- corresponding to ^
= ^
+ 2
m
mL2
Thus 2
keTn πι>L 2
oo) allows for a substantial reduction to the class of tests of a specific structure, viz., tests with ellipsoidal acceptance regions, provided the priors satisfy a certain uniform negligibility condition. To explain the nature of the result we state in this section a corollary to the main theorem having a more transparent form. Thus we observe the random vector (1.1)
X =
(XU...,XN)
having normal distribution N(μw, IN) with μπ = ( μ M , . . . , μNN) £ R N and Ijy the NxN identity matrix. We test the hypothesis HNO ' βN = 0 against HN\ VN Φ 0. In the Bayesian setup we assume that μjy under HNI has a prior distribution TΓΛΓ, which is the product of N coordinate distributions, N
(1.2)
πN(dμN)
= x 2=1
πNi(dμNi),
so that //ΛΓ is a random vector with independent components having distributions TΓM, . . . , KNN. We assume throughout that TΓJV», i — 1? > -ΛΓ? are symmetric about the origin. By (1.1) for a given μjv the distribution of X has Lebesgue density N
(1.3)
152
DM. Chibisov
where φ(-) denotes the density of the standard normal distribution. We denote this distribution by P;v,μ and the corresponding expectation by E^μ. In particular, the distribution of X under HNO has density
We will refer to /?jv(μjv; ΦN) given by (1.4) as the power function of the test ΦN and to PN(^N\ΦN) given by (1.6) as the average power. For a preassigned size α, the Bayes test maximizing PN{^N\ΦN) over size a tests ΦN rejects HNO for large values of the likelihood ratio (LR) (1.7) More precisely, the level a Bayes test has critical function (1.8) with
) < cN, CN
and
^ΛΓ(X)
on {x : /iiv(x) =
CN}
defined so that
= / ^iv(x)^Λr( The level a > 0 will be kept fixed as N —> oo. In Theorem 2.4 we state conditions on the priors TΪN under which the LR IIN is asymptotically approximated by
(1.9)
gN(x) = exp[i £ bm (x] - l) - \BN] ,
where b^ > 0 are certain characteristics of TΓJVI and BN = Σ ^2Ni (which is assumed to be bounded as N —> oo). Namely, QN approximates HN in Li-norm w.r.t. PΛΓ,O, ί e., (1.10)
ENio\hN
- 9N\ -> 0
as
N -^ oo.
153
Quadratic Statistics
It follows from (1.10) that the test ψ9N(x) defined for gx similarly to (1.8) has asymptotically the same average power as the Bayes test V>JV(X)> i e > 9
βN(τrN;φ N)-βN(πN;φ^)^0
as N -> oo.
To illustrate Theorem 2.4 we state here a special case. Suppose that the distributions n^i in (1.2) are scale transforms of one and the same distribution π on R with scale factors UNi(μ) = U(μ/bNi),
i = 1,..., N,
where Πjvj(μ) and Π(μ), μ E R, denote the distribution functions corresponding to ΈNi and π. Let π and {} satisfy the following conditions: (Πl)
π is symmetric, i.e., Π(μ) = 1 — Π(—μ), μ E R;
(Π2)
/ μ 2 π ( d μ ) = 1, / μ 4 π ( d μ ) < oo;
(Bl)
bNi > 0, fciv,max : = maxi 0 as N -* oo;
(B2)
Σ i I i ^ - ^ J B > 0 asiV-^oc.
Note that the first condition in (Π2) is merely a normalization of π, under which b2Ni is the variance of μNiCorollary 1.1 Let Conditions (Πl), (Π2), (Bl), (B2) be fulfilled and let gN be defined by (1.9) with BN = B. Then (1.10) holds. Consider the particular case where 6 M = . . . = b^N- Obviously, (Bl), (B2) are satisfied for 6 ^ = ( J 3 / J V ) 1 / 4 , i = 1,...,JV. Then Corollary 1.1 says that, under the independence assumption on the components of μ # , Conditions (Πl) and (Π2) are sufficient for the Bayes test to be asymptotically chi-squared. It is well known that for any spherically symmetric prior distribution the Bayes test is exactly chi-squared. Under the independence assumption spherical symmetry holds only for π normal. Corollary 1.1 says, however, that Bayes tests become approximately chi-squared for large dimension under arbitrary symmetric π unless π is heavy-tailed (the second condition in (Π2)). Note that in the setup of Corollary 1.1, when the prior distribution has independent symmetric components differing only by scale factors, (Π2), (Bl), and (B2) are exactly conditions for asymptotic normality of
The same is true in the general case (see Remark 2.2). In the literature Bayes tests in the normal shift model of increasing dimension are used in asymptotically minimax nonparametric hypothesis testing, see Ingster (1993), (1997), and Spokoiny (1998), where further references
154
DM. Chibisov
can be found. In these studies the original problem of signal detection or goodness of fit reduces by a suitable orthogonal decomposition to a testing problem in the normal shift model (possibly, infinite-dimensional). Typically in the minimax setting this is the problem of testing for zero mean against the set of alternatives specified by a "big" ball or ellipsoid in a certain norm, say, Zς-norm, with a "small" ball or ellipsoid in, say, ίp-norm around the origin removed. The problem is treated asymptotically as the size of these domains varies and/or the common variance of the X^s tends to zero. For some particular prior distributions used in those papers the asymptotically ellipsoidal form of the Bayes tests was established directly. For example, Ingster (1993) uses "Bernoulli priors" specified by symmetric two-point prior distibutions of components. These distibutions obviously satisfy conditions (Πl) and (Π2). The choice of the prior distribution depends on the shape of the parameter set, specifically, on the degrees p and q of the norms. If the normal shift model originates from, say, a signal detection problem, these degrees are related, qualitatively, to smoothness properties of the least favorable signals and restrictions on their "energy". In this respect Spokoiny (1998) distinguishes four types of alternative sets. Apparently the type of alternatives treated here fits in one of those classes, viz., that of "smooth" signals. Another type of prior distributions used by Ingster (1993) and Spokoiny (1998) for other types of alternatives has three-point component distributions Έ^I with masses pw at points ±1 (up to scale factors) and mass 1 — 2pχ at 0 with PN -> 0 as N -» oo. Note that the ratio of the fourth moment to the squared variance equals here 1/PN —^ oo For this prior distribution the conditions and the conclusion of Theorem 2.4 fail. We state the main Theorem 2.4 in Section 2 and give its proof in Section 3. Section 4 contains the proofs of auxiliary results and Corollary 1.1. 2
Main Theorem
Recall that we consider testing the hypothesis Ho : μjsf = 0 based on the observed iV-variate random vector X = (Xi,..., XJV) with normal distribution ΛΓ(μw,/#). Under the alternative μN has prior distribution (cf. (1.2)) πN(dμN)
N
= x
i=l
πNi(dμNi).
Thus under this prior {μNί} form a triangular array of r.v.'s independent within each row (for each N) with corresponding distributions π/vι, i — 1,...,ΛΓ.
Assumption ( A l ) . The distributions TΓ^, i = 1,...,JV, JV E N = {1,2,...}, are symmetric, i.e., πjVi(A) = fκ^i{—A) for any Borel set A.
Quadratic Statistics
155
In terms of the corresponding distribution functions this assumption means that UNi(μ) = 1 - ΠΛΓΪ(—μ), μ G R (cf. (Πl) in Section 1). For α > 0, denote (2.1)
ΊNi(α) = 1- 7Γjvi([-α, α]) = 2πNi((α, oo)).
Assumption (A2). For any α > 0,
For a measure ζ) and a measurable function / (on the same space) we will write
(2.2)
Q(f) = Jf(x)Q(dx).
For α > 0, denote by τr^\ the measure πjsn restricted to the interval [—α, α], π(£)i(A)=πNi(An[-α,α}).
(2.3)
Define the corresponding truncated moments as (2.4)
^ v > ) = 41(μjvi),
* = 0,1,2,...
Note that due to symmetry of π ^ (see (Al) and (2.3)), ^fc,ΛΓ,i(α) = 0 for odd k. Obviously, Vk,N,i(β) for &ny even k is a nondecreasing function of α. Lemma 2.1
Under Assumptions (Al), (A2), N
N
(2.5) for any fixed αi, α 0 and any even fc > 0. Proof For 0 < αi < α2 the left-hand side of (2.5) is nonnegative and bounded by αί> Σ7ΛΓi(αi), which tends to zero by (A2). •
Assumption (A3). For any α > 0, ΛΓ t=i
By Lemma 2.1 the requirement "for any α > 0" can be equivalently reduced to the requirement "for some α > 0". Since ^N,i(α) ^ "4,ΛΓ,i(α)> Assumption (A.3) implies
156
DM. Chibisov
Corollary 2.2
Under Assumption (A3),
(2.6)
limsupί?Λr(α) < oo,
for any α > 0, where (2.7) L e m m a 2.3
Under Assumptions (A1)-(A3), for any fixed oi, α^ > 0 AN := BN{α2)
- BN{α{)
-» 0
as
iV ->• oo.
The proof of this lemma will be given in Section 4. Theorem 2.4
Under Assumptions
(2.8)
E | M X ) ~ ffiv(X; α)| -> 0
(A1)-(A3) as
N -> oo
for any α > 0, where (see (2.4), (2.7))
(2.9)
1 5n(x,α) = e x p ( -
N
i
Remark 2.1 The relation (2.8) implies, in particular, that the functions ^jv( ,α) for different choices of α approach each other in L\ norm. This can also be verified directly by using Lemmas 2.1 and 2.3. Remark 2.2 Assumptions (A1)-(A3) imply asymptotic normality of the sequence Σιμ2Ni with mean BN(O) and variance X) ί^4,ΛΓ,i(α) ~^|,iv,i( α )) for any α > 0, see Loeve (1960), Section 22.5. In this respect Corollary 1.1 relates to Theorem 2.4 in the same way as Theorem V.1.2 in Hajek and Sidak (1967) to the general normal convergence theorem in Loeve (1960) mentioned above.
3
Proof of Theorem 2.4
Take an α > 0. Without loss of generality we will assume that there exists the limit B(α) := limjv->oo-Biv(α). (Otherwise assume that (2.8) fails, select a subsequence where the left-hand side of (2.8) stays bounded away from zero and find by (2.6) a further subsequence where Bjsf(α) converges.) The proof relies on the following one-sided version of Scheίfe's Lemma (see Chibisov (1992), Lemma 3.1).
157
Quadratic Statistics
Lemma 3.1 Let for each J V G N the random variables UN > 0 and VN > 0 be denned on a probability space (AJV,*AΛΓ,PΛΓ) Assume: (i) ENUN -> 1, -> 1; (ii) V# are uniformly integrable w.r.t. PN, or, equivalently, EΛΓ[VΛΓ; AN]
:= /
Vfr dPjv -> 0
whenever
-» 0;
PN(AN)
(iii) PN(UN < VN - ε) -» 0 for any ε > 0. Then EN\UN-VN\->0. We will apply this lemma with PN : = PJV,O = iV(0,/JV)? ^ : = ^iv? and VJV : = 9N(' >°>)' Condition (i) for h^ holds by definition (see (1.5), (1.7)), since EJV,O^JV = 1. The following lemma will be used to verify Condition (i) for gN. Lemma 3.2 For any even k > 0 and any α > 0 max Vk Ni(α) ~+ 0. l 0, limsup max i/fe^ifα) < εfc + li 1 f c , generally along subsequences { n ^ } ^ C N, are closely connected with those of the limiting distributions of the whole untrimmed sums Snk = Snfc(0,0) = Σj=i Xj (Asymptotic distributions for any of the sums here and in the sequel are always meant with suitable centering and norming and all infinite subsequences of N are assumed unbounded throughout) Indeed, it was shown in their Corollary 6 by Csόrgδ, Haeusler and Mason (1988a) that Snk converges in distribution along some {rik} to a nondegenerate random variable, in other words, F is in the domain of partial attraction of some infinitely divisible distribution, if and only if Snfc (Z, m) converges in distribution to nondegenerate random variables for every pair (/, m), along the same {n^}. The limiting distributions of the latter are some "trimmed" forms of a special representation of an infinitely divisible random variable, the distribution of which is the limiting distribution of the former; the representation is given in the next section. One may conjecture that it is sufficient to require the distributional convergence of 5njb(/,m) for a single pair (/,m) G N2 to achieve the same conclusion for 5 nfc , and hence also for all (Z,m) E N 2 , along the same {n^}. For the whole sequence {n} = N this was proved by Kesten (1993), in which case the conclusion is that F is in the domain of attraction of a stable law. The general subsequential version is still open.
Semistable Trimmed Sums
175
Perhaps the most interesting case, the topic of the present note, is that m of moderately trimmed sums Sn{ln,πιn) = Σ?=ί +i Xj,m where (1.1)
/
ln -> oo, — -» 0 n
and
7T7
mn -> oo, — - —> 0 n
a s n —> o o .
The first deeper result is due to Csόrgδ, Horvath and Mason (1986), who proved that if the full sums Sn have a nondegenerate asymptotic distribution along the whole {n} = N, i.e. if F is in the domain of attraction of a (normal or nonnormal) stable law, then with ln = mn and suitable centering and norming sequences Sn(mn,mn) is asymptotically normal as n —> oo. Csόrgδ, Haeusler and Mason (1988b) then determined the class of all possible asymptotic distributions for Sn(ln,mn) along all possible subsequences {rik}, together with necessary and sufficient conditions for the convergence in distribution of Snk(lnk,mnk) as k —> oo. To formulate at least the condition for asymptotic normality, define f o r O < θ < l — £ < 1 , l-ί
/
(1.2)
/ Js
rl-t
[min(u, v) - txv] dQ(tί) dQ(v)
= β Q ( s ) + tQ2(1 -t)+
Q (u) du Js
- \sQ(s) + tQ(l -t) + f'*
Q(u) du} ,
a basic function in Csδrgo, Haeusler and Mason (1988a,b). For given sequences {ln} and {mn} set (1.3)
an(ln,mn)
= y/nσ[ — ,l \n
), n J
and introduce the two sequences of functions 2 '
ψl,n{X) = < X < OO,
and -oo 0 such that l A-n [Sn{ln-> ran) — Cn] —> Z as n -» oo if and only if (1.4)
lim ψjnix) = 0 for every x E l , j = l,2, n—>oo n
in which case Cn = c n (/ n ,m n ) := n / i n ± i Q{u)du and Λn = αn(ln,mn) work. The subsequential version of this result is also true. If at least one of the m functions φj,n{ ), or one of the renormalized functions αn(ln,mn)φj,n( )/An for some An > 0 for which αn{ln,mn)/An —> 0, j = 1,2, converges to a nonzero function either along the whole {n} or along a subsequence, then extra terms appear in the limiting random variable so that the asymptotic distribution, typically obtained along a further subsequence, is no longer normal; it does not even have a normal component in the renormalized case. The conditions appearing are optimal; for the precise statements the reader is referred to Csόrgδ, Haeusler and Mason (1988b, 1991b). Griffin and Pruitt (1989) rederived this theory by a different method, obtaining the conditions and the description of limiting random variables in alternative forms, with numerous additional observations. While the "asymptotic continuity" condition (1.4) solves the problem of asymptotic normality of moderately trimmed sums completely from a general mathematical point of view, its probabilistic meaning is not so clear until it is tied to better understood conditions that govern the asymptotic distribution of the entire untrimmed sums. Indeed, it was pointed out by Csόrgδ, Haeusler and Mason (1988b) and then by Griffin and Pruitt (1989) that if F is stochastically compact, meaning that the full sums are stochastically compact in the sense that there exist sequences of constants bn G R and dn > 0 such that every subsequence of N contains a further subsequence along which [Sn — bn]/dn converges in distribution to a nondegenerate random variable, then the sequences of functions {ψj,n(')}^Lι a r e uniformly bounded, j = 1,2, and hence the sequence S^(ln,mn) := [Sn(ln,mn) cn{ln,mn)]/αn(ln,mn) of centered and normed trimmed sums is also stochastically compact for any pair (/n?^n) of sequences satisfying (1.1). However, nonnormal subsequential limiting distributions do arise in this case. Thus, to date, the only explicitly determined family of underlying distributions for which 5*(m n ,m n ) is known to be asymptotically normal along the whole N for every sequence {mn} satisfying (1.1) is the family of those F that are in the domain of attraction of a stable law [Csόrgδ, Horvath and Mason (1986)], and the only explicit family for which 5*(ί n ,m n ) is known
177
Semistable Trimmed Sums
to be asymptotically normal for every sequence {(/n, mn)} of pairs satisfying (1.1) is the subfamily attracted by not completely asymmetric stable laws [Griffin and Pruitt (1989)]. The question arises whether there is a probabilistically meaningful larger class of distributions, necessarily within the class of stochastically compact distributions, which would respectively contain the families above and for which the same conclusions for the asymptotic normality of trimmed sums would still hold true. A feature of the phenomenon would of course be that the full sums, [Sn — bn]/dn, would no longer converge in distribution themselves along the whole {n} = N. The aim of this paper is to show that a larger class of distributions within the class of stochastically compact distributions does indeed exist with these properties: it is a proper subfamily of the family of distributions in the domain of geometric partial attraction of semistable laws. In the next section we describe this family of distributions, while Section 3 contains the new results and their proofs. 2 Semistable distributions and their domains of geometric partial attraction Let Φ be the class of all non-positive, non-decreasing, right-continuous functions φ( ) defined on the positive half-line (0, oo) such that /ε°° ψ2(s) ds < oo for all ε > 0. Let 2£f', E2 , -.., j = 1,2, be two independent sequences of independent exponentially distributed random variables with mean 1. With their partial sums F n = E± + + En as jump points, n E N, consider the standard left-continuous independent Poisson processes Nj(u) := ΣΪZLi I{YnJ>} < u), 0 < u < 00, j = 1,2, where /(•) is the indicator function. For a function φ G Φ, consider the random variables
~ JiΓ
- s]dφ(s)
where the first integrals are almost surely well defined, by the condition that φ G Φ, as improper Riemann integrals. For ψ i G Φ and φ 0: ψι{s) > χ } for α: < 0 and R(x) — — inf{s > 0: ^2(5) > —x} for x > 0. Here L( ) is left-continuous and nondecreasing on (—oo,0) with L(—oc) = 0 and R( ) is right-continuous and non-decreasing on (0,00) with R(oo) = 0, and f^ε x2dL{x)+^ x2dR(x) < 00 for every ε > 0 since ψutfa G Φ. Thus V(^i,^2,σ) is infinitely divisible by Levy's formula [see e.g. in Gnedenko and Kolmogorov (1954)]. Conversely, given the right side of (2.2) with L( ) and R( ) having the properties just listed, the variable W ^ i , ^ ? ^ ) has this characteristic function again if we choose ^i(s) = inf{x < 0 : L(x) > s} and ^2(5) = inf{x < 0: -R(-x) > 5}, 5 > 0, for then ^ 1 , ^2 G Φ. Thus the class I of all nondegenerate infinitely divisible distributions can be identified with the class {(Ί/>I,^2,CΓ) φ (0,0,0) : ^1,^2 G Φ,σ > 0} of triplets. Then F being in the domain of partial attraction of a G = Gψιψ2Ί(T G J , written F G Bp(G), means that there exists a subsequence {kn}™=1 C N and centering and norming constants Ckn G R and Ajςn > 0 such that ί k~
λ
v where a convergence relation is meant to hold as n -> 00 unless otherwise specified and Gφli7/,2iσ is the distribution function of the random variable V(^i,^2,σ) from (2.1); the characteristic function of V(VΊ>^2,σ) — θ(Ψι 1Ψ2) is in (2.2). By classical theory [Gnedenko and Kolmogorov (1954) or Corollary 5* in Csόrgό (1990)] this happens for {kn} = {n} = N if and only if either ( ^ I J ^ J G Γ ) = (0,0, σ) for some σ > 0, in which case F is in the domain of attraction of the normal distribution, written F G D(2), or W>i?^2j0") = (mi^ α ,m2'0 Q ,O) for some constants α G (0,2), mi,m2 > 0, m\ + 777,2 > 0, where ψα(s) = — s" 1 /", s > 0, in which case F is in the domain of attraction of a stable distribution of exponent α, written F G B(α). α (The superscript α in i/> , and in t/>f and ^2 beginning with (2.4) below, is meant as a label, not as a power exponent.) The normal being the stable law of exponent 2, let S denote the class of all stable laws. Levy (1937) introduced the class 5* C I of semistable laws by extending a defining property of stable characteristic functions and, as translated into
179
Semistable Trimmed Sums
the framework of the present description of infinitely divisible laws, showed that Gψliψ2i(T G S* if and only if either (^1, ^2 5 cr) = (0,0, σ) for some σ > 0, giving the normal distribution as a semistable distribution of exponent 2, or (tl>uih,σ) = (ψf,ψξ,0), where (2.4)
α
1α
ψf(s) = Mj(8)tl> (8) = -Mj(8)8- ' ,
s>0,
j = l,2,
for some α G (0,2), defining a semistable distribution of exponent α G (0,2), where M\ and M 0, j = 1,2, for some constant c > 1; the latter property will be referred to as multiplicative periodicity with period c. For α G (0,2), Levy's original description of the property in (2.4) in terms of L and R in (2.2) is that there exist nonnegative bounded functions ML( ) on (—00,0) and MR( ) on (0,00), one of which has a strictly positive infimum and the other one either has a strictly positive infimum or is identically zero, such that L(x) = ML(x)/\x\α, x < 0, is left-continuous and nondecreasing on (—00,0) and R(x) = —Mjι(x)/xα, x > 0, is rightcontinuous and nondecreasing on (0,00), while Mχ/(c1/αa:) = ML(X) for all x < 0 and Mβ(c 1 / α x) = MR(X) for all x > 0, for the same period c > 1. Because of the inversions given above, the two descriptions are equivalent. The realization of a tangible significance of 5* D S starts with a remark of Doeblin (1940), without any elaboration or, for that matter, even a precise statement, to the effect that semistable laws arise in the limit in (2.3) if the normalizing constants A*n satisfy a geometric growth condition. Thirty years later, Shimizu (1970) and Pillai (1971) came close while Kruglov (1972) and Mejzler (1973) fully achieved that realization, all four of them acting independently of one another. It turned out that the following Characterization Theorem is true: If (2.3) holds along a subsequence {kn} C N for which (2.5a)
liminf - ? — = c for some c G (1,00),
then the distribution Gψuψ2tσ of V(ψι,ψ2,σ) is in 5* such that, in the case when the exponent of Gψλψ2^σ = Gφa^a^ is α < 2, the multiplicative period of the functions M\ and M2 in (2.4) is the c from (2.5a). Conversely, for every Gψltψ2tσ G 5* there exists an F such that if ΛΊ, X2,. are independent random variables with the common distribution function F, then there exists a subsequence {kn} C N such that (2.5b)
lim -£±^ n->oc
k
=
c
for some c G [1,00)
180
Sandor Csδrgδ and Zoltan Megyesi
and (2.3) holds along {kn}. An equivalent version of this theorem, in terms of the Levy type description of 5* was proved by Kruglov (1972) and Mejzler (1973), while the present version was obtained by Megyesi (2000) with an independent proof within the framework of the 'probabilistic' or 'quantiletransform' approach of Csδrgό, Haeusler and Mason (1988a,b; 1991a,b) and Csδrgό (1990) to domains of attraction and partial attraction. For G = Gψuψ2i(T G 5*, we say that F is in the domain of geometric partial attraction of G with rank c > 1, in short F G B^(G), if (2.3) holds along a subsequence {kn} C N satisfying (2.5b). Of course, the geometric n n subsequence kn = [c j, the integer part of c , is unbounded and satisfies the (quasi)geometric growth condition (2.5b) if c > 1. Recalling that (Ψuψ2,σ) Φ (0,0,0) for G = Gψuψ2tσ G S*, define c = c(Go,o,σ) = 1 for any σ > 0 and c = c(G^«^«j0) = inf{c > 1: Mj(cs) = Mj(s), s > 0, j = 1,2}, the minimal common period c of the factor functions M\ and M2 in ψf and ψξ in (2.4) for a G (0,2). Thus c = c(G) is defined for all G G S*. It turns out for the whole domain B gp (G) := U c > 1 1 % ; (G) of geometric partial attraction of G G S* that B g p (G) = ΠmeN^gP™^) = ^gϊ>(G). Also, if c(G) = 1 for G G <S*, then G G S and B g p (G) = B(G), the domain of attraction of the stable G. In other words, if B(5) := \JσesΌ(G) = Uo*) was obtained by Grinevich and Khokhlov (1995). However, besides the fact that it contained an error, this characterization is in terms of the norming factors Akn in (2.3) and the tails of F, and so it is not useful when trying to apply the criterion (1.4) to trimmed sums. The following alternative characterization is due to Megyesi (2000). Consider a subsequence {kn}=1 C N satisfying (2.5b). If c = 1 in (2.5b), then put 7(s) = 1 for every 5 G (0,1). If c > 1, then the sequence {kn} is eventually strictly increasing to 00. Hence, for all s G (0,1) small enough there exists a uniquely determined fcn*(s) such that k~h^ < s < k~},\v For any such s we define 7(s) = sfcn*(s), so that for any fixed ε > 0 and all 5 G (0,1) small enough we have 1 < 7(5) < c + ε for the limiting c > 1 from (2.5b). In particular, for any sequence sm > 0 for which \imm-+oosm = 0, the limit points of the sequence {7(s m )}^ = 1 are in the interval [l,c]. Let Q+(') denote the right-continuous version of the quantile function Q( ) of the underlying distribution function F( ). Since D gp (G) = D(G) for a normal G G 5*, we only have to describe the domain of geometric partial attraction of nonnormal semistable laws, for which the Domain Theorem is this: If GφaφaQ G 5* is semistable with exponent α G (0,2), so that ψf and ψξ
181
Semistable Trimmed Sums
satisfy (2.4), and F e Bgp(Gψ 0 of Mj, j = 1,2. Conversely, if for the quantile function pertaining to F the equations in (2.6) hold with the properties of I and of h\ and Λ2 just described, for some a G (0,2) and functions M\ and M of exponent a G (0,2), so that (2.7) holds along a subsequence {kn}%L1 C N satisfying (2.5b), then, for the slowly varying function £(-) from (2.6) and (2-7), (3-3)
(34) where —> denotes convergence in probability, and
where the independent random variables W\(ψf) and W^iΦx) are given at f2.1^, and so
for any two sequences {/n}£Li and { m n } ^ = 1 of positive integers satisfying (1.1). The general theory in Csόrgό, Haeusler and Mason (1988a, 1991b) and Csόrgδ (1990) ensures the existence of sequences {/n} and {mn} satisfying (1.1) for which these statements hold, the point of Theorem 3.3 is that they hold for all such sequences. If Mj = ψ* = 0, which is allowed in (2.7) and in Theorem 3.3 above for one of the j , then of course Wj(0) = 0. A more general version of Theorem 3.3, in which a fixed number of the smallest and the largest extremes may be discarded from the sums in (3.3) and (3.5) is also true; the way in which the centering sequences and the limiting random variables should be changed in (3.3) and (3.5) for this version is clear from the general scheme in Csόrgδ, Haeusler and Mason (1988a), Csδrgδ (1990),
184
Sέndor Csδrgδ and Zoltan Megyesi
or Megyesi (2000). The formulation of Theorem 3.3 above suits well the genuinely two-sided case. In the completely asymmetric case when one of Φι and ψξ is identically zero, a somewhat stronger statement can be made, even in the more general version with possible light trimming: see the end of the proof of Theorem 3.3 for this in the present case of full extreme sums. Turning now to the proofs and recalling the notation in (1.2) and the statement in (2.8), Theorem 3.1 requires the following Lemma 3.4 If F G Bgp(G-0fa\\>h ( n , Vk x Mi 7 — + x +hι[ — +x
[
\\n
n JJ
\n
n
Sandor Csόrgδ and Zoltan Megyesi
186 and
for all n large enough. We substitute these into the formula for ψn \ (x) through the formula given for φ\tn(x). Using then the fact that K :=
< lim inf ... < limsup
σ{ln/n,1/2) σ{ln/n,1/2) ^ —{—j——-—
0 and tn(a;) = lnn~ι + xy/ΐnΠ'1 = sn [1 + xίή1^2] -> 0. Since also, as a result of our continuity assumption, lims^o h\(s) = 0 by the domain theorem at (2.6), and tn(0) = sn of course, for (3.9) it suffices to show that (3.10)
υn(x) := |Mi(7(tn(a?))) -
0 for each x φ 0.
Let c > 1 be the limit in (2.5b) for the sequence {kn}^=ι which defines 7( ) preceding (2.6). We may and do assume that c > 1 since in the case of c = 1, when F E D(α) for the given α G (0,2) at hand and Mχ( ) is a constant
187
Semistable Trimmed Sums function, (3.10) is trivial. Then for all n large enough, η{sn),η{tn{x)) [l,c 2 ], say, the definitions if7 ( ( ) )
-y(tn(x))> Ίn(x) := < I(ίn(x)),
7 ( ) ,
G
for x > 0,
if 7(*»(z))
1,
189
Semistable Trimmed Sums (j = 1,2), and so
if n > JV*. Recalling (2.8) we see that for n> N* the inequality (3.12)
\Kj(Vi) - Kj(V2)\ < C(a,6,a,e)
holds true for any choice of yuy2 6 [α,6], yi < y 2 , provided M*j(yι) < M nj(V2), 3 = 132. If this is the case indeed, then (3.12) in itself proves (3.11), but the following considerations apply in general. Indeed, observe that the choice of N* ensures that
(3.13)
\KAa)-KM
< I M J M - M J WI + J,
j = 1,2,
for all n> JV*, and note also that
Here \M*j(α) - M*j(yι)\ < C(α,b,α,ε) by (3.12) if M*d(α) < M l,j G {1,2}, but if this fails for some /, j G {1,2} then we still have
where the first term can be estimated using (3.13) and C(α, 6, α, ε) is an upper bound on the second one, provided M*j(yι) < M*j(b). However, if the latter inequality is not the case either, then \M*j(α) — M*j(yι)\ < \M*j{α) - M*j(b)\9 since M * » > M*j(yι) > M*d(b). All this together imply (3.11). • Proof of Theorem 3.2 We only have to deal with the case when G = G ψ β ^ o for some α G (0,2), where at least one of φf and ψξ is not identically 0. The other case being analogous, suppose that ψf φ 0. Retaining the notation in the proof of Theorem 3.1, we show that a sequence {ln} of positive integers can be chosen to satisfy both (1.1) and the first convergence relation in (3.8). The latter follows, through the same considerations as there, if {ln} is chosen to make sure that (3.9) holds. By the monotonicity of ψf we can pick a sequence of pairs (αj,6j), 1 < αj < bj < c, and constants Sj £ (0,1) such that both αj and bj are continuity points of M\ and the inequalities \Mχ(αj) — M\{bj)\ < gk and C(α,j, 6j, α, Sj) < jp hold for all j E N. Next, put NQ := 0 and, by means of the threshold numbers of Lemma 3.5, define an increasing sequence {Nj} of positive integers by setting Nj := max{N(αj,bj, gj,ej),-/Vjli + 1}, j € N. Elementary consideration shows now that for each j G N there exists a
190
Sandor Csδrgδ and Zoltan Megyesi
threshold number Nj G N such that for every n > NJ one can choose an ζ ; GN with the properties that (3.14)
in:
^~ n
' n
π
n
and
C
and it can clearly be stipulated that 1 < N* < Nζ < see that /I*
. By Lemma 3.5 we
/I*
i for all x 6 [—j, j] and n> Nf. Now we axe ready to choose the desired sequence {ln}. We set ln : = 1 for n < N* and define {ίnj^L v* by the following algorithm, in which T G N is a new auxiliary variable: Step 1. Let the initial values of j and n be j := 1 and n := ΛΓ*, and put T:=JVf. Sfep «. If NJ iVJ+1 then put ln := /* j or ln := l^j+i according as lnj+i ^ fa or Inj+i > fa->a n d if 'n,j+i > ' r then set also j := j + 1 and T := n. 5ίep ^. Set n := n + 1 and go to Step 2. Then ln -> ex) by the choices of T and, since iV? —>• ex) as j -> oo, we also have Zn/n -> 0 by (3.14). Thus (1.1) holds for the chosen sequence {ln} and the displayed inequality following (3.14) above shows that (3.9) is also satisfied for any fixed i G l If ψ% φ 0, then the sequence {mn} can be chosen in a similar fashion. If Φ2 = 0? then simply put mn := ln for every n G N, and the desired asymptotic normality follows as in the proof of part (ii) of Theorem 3.1. • Lemma 3.6 If a function ί( ) on (0,1) is slowly varying at zero and {rn} is a sequence of positive numbers such that rn —> 00 and rn/n -> 0, then
If, in addition j F G B g p (G^« ,«0 Mj(y) for every continuity point y > 0 of Mj, we have
at all the respective continuity points y > 0 of the limiting functions. Furthermore, Lemma 3.4(i) implies that l i m s u p —-rη-^
«^oo
/α
ki e(i/kn)
—
0 and rn/™>n -> 0; in fact, these are true along the whole N again. These four pairs of facts allow a subsequential application of that variant of a two-sided version of Theorem 1 in Csδrgδ, Haeusler and Mason (1991a), the version alluded to on p. 789 there, in which the basic functions Q+{s) and Q(l — 5), 0 < s < 1, are taken right-continuous and the Poisson processes A^i( ) and A^(-) are taken left-continuous as in the present paper. Using the eight facts above, this variant implies that every subsequence of N contains a further subsequence such that (3.3) and (3.5) hold jointly along that subsequence. This implies that (3.3) and (3.5) hold jointly as stated. By the convergence of types theorem, (3.3) and (3.5) already imply (3.4) for the subsequence {fcn}. However, if neither of Mi and M2, or equivalently, neither of ψf and ip% is identically zero, then the left side of (3.9) is bounded, by 2(Dι + D2) from (2.8), for both Mi and M 2 even if they are not continuous, implying that the two sequences of functions in (3.8) are pointwise
192
Sandor Csόrgό and Zoltan Megyesi
bounded. Hence the same is true for the sequences {φj,n{')}, j £ {1?2}. Also, setting rn = min(Zn, ran), we have αn(lni mn) < αn(rn, rn) for all n £ N 1 α and α n (r n ,r 7 l )/[n / ^(l/n)] -> 0 by Lemma 3.6. Therefore, the discussion at (1.13) in Csόrgδ, Haeusler and Mason (1988b) yields (3.4) as stated. If, on the other hand, M\ = 0 and M2 φ 0, then by the same argument ^>0
._
and, since in this case the first convergence in (3.15) takes place along the whole {n} = N with an identically zero limiting function, we also get
for both rn = ίn and r n = m n , which together prove (3.4). We see that if Mi( ) = 0 and M2( ) > 0, then in fact we have
along with (3.5). Similarly, if M2( ) = 0 and Mi( ) > 0, then again we have (3.4) and, in fact, U
along with (3.3).
REFERENCES Cheng, S.H. (1992). A complete solution for weak convergence of heavily trimmed sums. Science in China, Ser. A 35 641-656. Csδrgδ, S. (1990). A probabilistic approach to domains of partial attraction. Adv. in Appl. Math. 11 282-327. Csδrgδ, S. and Dodunekova, R. (1991). Limit theorems for the Petersburg game. In: Sums, Trimmed Sums and Extremes (M. G. Hahn, D. M. Mason, D. C. Weiner, eds.), pp. 285-315, Progress in Probability 23, Birkhauser, Boston.
Semistable Trimmed Sums
193
Csδrgδ, S., Haeusler, E. and Mason, D.M. (1988a). A probabilistic approach to the asymptotic distribution of sums of independent, identically distributed random variables. Adv. in Appl. Math. 9 259-333. Csόrgδ, S., Haeusler, E. and Mason, D.M. (1988b). The asymptotic distribution of trimmed sums. Ann. Probab. 16 672-699. Csδrgδ, S., Haeusler, E. and Mason, D.M. (1991a). The asymptotic distribution of extreme sums. Ann. Probab. 19 783-811. Csόrgδ, S., Haeusler, E. and Mason, D.M. (1991b). The quantile-transformempirical-process approach to limit theorems for sums of order statistics. In: Sums, Trimmed Sums and Extremes (M.G. Hahn, D.M. Mason and D.C. Weiner, eds.), pp. 215-267, Progress in Probability 23, Birkhauser, Boston. Csόrgδ, S., Horvath, L. and Mason, D.M. (1986). What portion of the sample makes a partial sum asymptotically stable or normal? Probab. Theory Rel. Fields 72 1-16. Csόrgδ, S. and Simons, G. (1996). A strong law of large numbers for trimmed sums, with applications to generalized St. Petersburg games. Statist. Probab. Letters 26 65-73. Doeblin, W. (1940). Sur l'ensemble de puissances d'une loi de probability. Studia Math. 9 71-96. Gnedenko, B.V. and Kolmogorov, A.N. (1954). Limit Distributions for Sums of Independent Random Variables. Addison-Wesley, Reading, Massachusetts. Grinevich, I.V. and Khokhlov, Y.S. (1995). The domains of attraction of semistable laws. Teor. veroyatn. primen. 40 417-422 (in Russian). [English version: Probab. Theory Appl. 40 361-366.] Griffin, P.S. and Pruitt, W.E. (1989). Asymptotic normality and subsequential limits of trimmed sums. Ann. Probab. 17 1186-1219. Hahn, M.G., Mason, D.M. and Weiner, D.C, editors (1991). Sums, Trimmed Sums and Extremes. Progress in Probability 23, Birkhauser, Boston. Kesten, H. (1993). Convergence in distribution of lightly trimmed and untrimmed sums are equivalent. Math. Proc. Cambridge Philos. Soc. 113 615-638. Kruglov, V.M. (1972). On the extension of the class of stable distributions. Teor. veroyatn. primen. 17 723-732 (in Russian). [English version: Probab. Theory Appl. 17 685-694.] Levy, P. (1937). Theorie de Vaddition des variables aleatoires. GauthierVillars, Paris. Megyesi, Z. (2000). A probabilistic approach to semistable laws and their domains of partial attraction. Ada Sci. Math. (Szeged) 66, to appear.
194
Sandor Csδrgδ and Zoltan Megyesi
Mejzler, D. (1973). On a certain class of infinitely divisible distributions. Israel J. Math. 16 1-19. Pillai, R.N. (1971). Semi stable laws as limit distributions. Ann. Math. Statist. 42 780-783. Shimizu, R. (1970). On the domain of partial attraction of semi-stable distributions. Ann. Inst. Statist Math. 22 245-255. Stigler, S.M. (1973). The asymptotic distribution of the trimmed mean. Ann. Statist. 1 472-477.
DEPARTMENT OF STATISTICS
BOLYAI INSTITUTE
UNIVERSITY OF MICHIGAN 4062 FRIEZE BUILDING ANN ARBOR, MICHIGAN 48109-1285 USA
UNIVERSITY OF SZEGED ARADI VERTANUK TERE 1 H-6720 SZEGED HUNGARY
scsorgo @umich. edu
csorgo ©math, u-szeged. hu [email protected]
STATISTICAL PROBLEMS INVOLVING PERMUTATIONS WITH RESTRICTED POSITIONS
PERSI DIACONIS, RONALD GRAHAM AND SUSAN P. HOLMES
Stanford University, University of California and ATT, Stanford University and INRA-Biornetrie The rich world of permutation tests can be supplemented by a variety of applications where only some permutations are permitted. We consider two examples: testing independence with truncated data and testing extra-sensory perception with feedback. We review relevant literature on permanents, rook polynomials and complexity. The statistical applications call for new limit theorems. We prove a few of these and offer an approach to the rest via Stein's method. Tools from the proof of van der Waerden's permanent conjecture are applied to prove a natural monotonicity conjecture. AMS subject classiήcations: 62G09, 62G10. Keywords and phrases: Permanents, rook polynomials, complexity, statistical test, Stein's method. 1
Introduction
Definitive work on permutation testing by Willem van Zwet, his students and collaborators, has given us a rich collection of tools for probability and statistics. We have come upon a series of variations where randomization naturally takes place over a subset of all permutations. The present paper gives two examples of sets of permutations defined by restricting positions. Throughout, a permutation π is represented in two-line notation 1 π(l)
2 3 π(2) π(3)
... n ••• τr(n)
with π(i) referred to as the label at position i. The restrictions are specified by a zero-one matrix Aij of dimension n with Aij equal to one if and only if label j is permitted in position i. Let SA be the set of all permitted permutations. Succinctly put: (1.1)
SA = {π : UUAiπ{i)
= 1}
Thus if A is a matrix of all ones, SA consists of all n! permutations. Setting the diagonal of this A equal to zero results in derangement, permutations with no fixed points, i.e., no points i such that π(i) = i.
196
Persi Diaconis, Ronald Graham and Susan P. Holmes
The literature on the enumerative aspects of such sets of permutations is reviewed in Section 2, which makes connections to permanents, rook polynomials and computational complexity. Section 3 describes statistical problems where such restricted sets arise naturally. Consider a test of independence based on paired data (Xi, Yi), (X2, Y2) {Xn, Yn)> Suppose the data is truncated in the following way: For each x there is a known set I(x) such that the pair (X, Y) can be observed if and only if Y E I(X). For example, a motion detector might only be able to detect a velocity Y which is neither too slow nor too fast. Once movement is detected the object can be measured yielding X. Of course, such truncation usually induces dependence. Independence may be tested in the following form: Does there exist a probability measure μ on the space where Y is observed such that {1.2) P{YieBi,l {X n, Yn) drawn independently from a joint distribution V with Xi e X,Yi £ y, suppose that V has margins V1 and V2. A test of the null hypothesis of independence: V — φλ x φ2 may be based on the empirical measure Vn. Let δ be a metric for probabilities on X x y. One class of test statistics is (3.12)
Tn =
δ{Vn,VιnxV2n)
Extending classical work of Hoeffding (1948), Blum, Kiefer and Rosenblatt (1961), and Bickel (1969), Romano (1989) show that under very mild regularity assumptions, the permutation distribution of the test statistic Tn gives an asymptotically consistent locally most powerful test of independence. Consider next the truncated case explained in Section 1. The hypothesis (1.2) may be called quasi-independence in direct analogy with the similar problem of testing independence in a contingency table with structural zeros. Clogg (1986) and Stigler (1992) review the literature and history of tests for quasi-independence with references to the work of Caussinus and Goodman. While optimality results are not presently available in the truncated case, it is natural to consider the permutation distribution of statistics such as (3.12). This leads to a host of open problems in the world of permutations with restricted position. We were led to present considerations by a series of papers in astrophysics literature dealing with the expanding universe. The red shift data that is collected for these problems suffers from heavy truncation problems. For example, Figure 1 from Efron, Petrosian (1998) shows a scatterplot of 210 x — y pairs subject to interval truncation, the x coordinate corresponds to red-shift, the y coordinate corresponds to log-luminosity. A suggested theory of 'luminosity evolution' says that early quasars were brighter. This suggests that points on the right side of the picture are higher because the high redshift corresponds to high age. Astronomers beginning with Lynden-Bell (1971, 1992) have developed permutation type tests based on Kendall's tau for dealing with these problems. There is a growing statistical literature on regression in the presence of truncation; see Tsui et al.(1988) for a survey.
205
Independence with Truncation 210 observed data points and their boundaries
0
0.5
1
1.5
2
2.5
3
3.5
Figure 1. Quasar Data, 210 points, upper and lower limits.
Most previous work deals with one-sided truncation of real-valued observations. The theory and practice is easier here as explained in Section 3.2 . Efron and Petrosian (1998) have recently developed tests and estimates for the case of two-sided truncation. We develop some theory for their setup in Section 3.3. The following preliminary lemma shows that interval truncation of real valued observations leads to restriction matrices with intervals of ones in each row.
Lemma 3.1 Let a?i,... ,x n take values in an arbitrary set. Let I(xt ) be a real interval. Let yi,2/2, ,2/n be real numbers with yι G I(xχ). Suppose the ordering is chosen so that y\ < y
< 1-
0 -
bi. Without loss of generality, we take b\ < 62 < * h < < bn in the sequel. Some examples: • If all bi = 1, Sb contains all permutations. • If bi = 1,1 < i < α, b{ = 2, α + 1 < i < n, Sb contains all permutations with 1 in the first α places. • If bi = ΐ, 5b contains only the identity permutation. • If bi = b2 = I, bi = i — 1,1 π W)? with π uniformly chosen in S&. Efron and Petrosian (1992) have introduced an apparently different notion of rank test with a simple distribution theory based on lemma 3.2. The relation between these rank tests and the distribution theory above are open problems. While Corollary 3.1 is well known in the combinatorial literature, it is often rediscovered by statisticians, see Tsai (1990).
3.3
An example with historical insights
Karl Pearson considered a natural source of censored observations in his work on what is now called quasi independence. He considered families with one or more imbecilic children, cross tabulating the family size versus the birth order of the first such child. Clearly a family of j children can only have its first born special child in a position between one and j and consequently, T{j = 0 for i > j. Pearson carried out a test of independence with this truncated dataset in 1913! A historical report on Pearson's work and its later impact is given by Stigler (1992). It is worth beginning with an exact quote of Pearson's procedure from the article by Elderton et al. (1913). "Lastly we considered the correlation between the imbecile's place in the family and the gross size of that family. Clearly the size of the family must always be as great or greater than the imbecile's place in it, and the correlation table is accordingly one cut off at the diagonal, and there would certainly be correlation, if we proceeded to find it by the usual product
Independence with Truncation
209
moment method, but such correlation is, or clearly may be, wholly spurious. Such tables often occur and are of considerable interest for a number of reasons. They have been treated in the Laboratory recently by the following method: one variate x is greater than or equal to the other y; let us construct a table with the same marginal tables, such that y is always equal to or less than x, but let its value be distributed according to an "urn-drawing" law, i.e. purely at random. This can be done. We now have two tables, one the actual table, the other one with the same marginal frequencies, would arrive if x and y were distributed by pure chance but subject to the condition that y is equal or less than x, this table we call the independent probability table. Now assume it to be the theoretical table, which is to be sampled to obtain 2 the observed table, and to measure by χ and P the probability that the observed result should arise as a sample from the independent probability table." We find this paragraph remarkable as an early clear example of the conditional permutation interpretation of the chi-square test. A careful reading reveals that Pearson is not explicit about the "urn drawing" commenting only that this can be done. In the rest of this Section we give an explicit algorithm by translating the problem into that of generating a random permutation with restrictions of the one-sided type and showing that lemma 3.2 achieves a particularly simple form. To begin with, it may be useful to give the classical justification for Fisher's exact test of independence in an uncensored table. Let T{j be a table with row sums T{ and column sums Cj . Under independence the conditional distribution of Tij given r^, Cj is the multiple hypergeometric. This may be obtained and motivated as a permutation test as follows. Suppose the n individuals counted by the table have row and column indicators (Xi,Yi)ι i. Suppose n = Σt>j n%j τ h e original data can be regarded as (X^ Yk)ι it) in a matrix A with ones on the diagonal and having the ones in each row in an interval use {ή,22? 5 ^} to label the rows and columns. Thus the matrix appears as 1
1
1
1
1
(3.18)
1
1
1 1 1
1
1 1 1
1 1 1
1
1
ϊ]
1
1
1
1
1
1
1
1
1
1
. .. . ..
-I
1 X
1
1 1 1
1 1 1
1
1
X
X
1 1
1
1
The proof proceeds in two cases: Case 1 i£ > i2 Then, by the row interval property there is a 1 in positions (ii,ii) and (^,22)- Thus labels 12 and it can be transposed. Case 2 it ^ ^ X chosen from the uniform distri-
214
Persi Diaconis, Ronald Graham and Susan P. Holmes 1
but ion on SA and X one step of the chain away from X. Such an exchangeable pair forms the basis of Stein's approach to the study of Hoeίfding's combinatorial limit theorem. Bolthausen (1984) and Schneller (1989) used extensions of Stein's method to get the right Berry-Esseen bound and Edgeworth corrections. Zhao, Bai, Chao and Liang (1997) give limit theorems α π π for double indexed statistics (a la Daniels) of form Σ (iiJi (i)> U)) using Stein's method. Finally, Mann(1995) and Reinert(1998) have used Stein's method of exchangeable pairs to show that the chi-square test for indepen2 dence in contingency tables has an approximate χ distribution, conditional on the margins. We have used the exchangeable pair described above to prove normal and Poisson limit theorems for the number of fixed points in a permutation chosen randomly from the set SA- There is a lot more work to be done. We note in particular that the limiting distribution of linear rank statistics is an open problem with even one-sided truncation. The distribution of Kendall's tau is an open problem in the case of two-sided truncation. We close this Section with a statistical comment and a useful lemma. The widely used non parametric measure of association Kendall's tau applied to paired data {{xi,Vi)} can be described combinatorially as follows: Sort the pairs by increasing values of X{. Then calculate the minimum number of pairwise adjacent transpositions required to bring {yι\ into the same order as {xi}. When working with restricted positions, it is natural to ask if any admissible permutation can be brought to any other by pairwise adjacent transpositions. The following example shows that this is not so. For n = 3 consider the / 1 1 1\ / 1 2 matrix I 0 1 0 I. There are two admissible permutations (
Vi
i i/
V
3\ I ;
/ 1 2 3 \ and I I. No pairwise adjacent transpositions of the labels is al\ 3 2 IJ lowable. The matrix has the row interval property and all transpositions connect. It is not hard to see that pairwise adjacent transpositions connect all admissible permutations in the one-sided case.The following lemma proves connectedness in the monotone case: The intervals I(xi) = (αi,&i) can be arranged so that α\ < α2 > <x\x >
(4.21)
By continuity, (4.21) also holds if some of the entries in V* or x are allowed to be zero. Proposition 4.1 follows from (4.21). By symmetry of the permanent, it is enough to prove it for e r . Consider the three matrices corresponding to αr x 6r, αr x br + e r , αr x br + 2eΓ. Move these rth blocks to the right of the full matrix. The last two columns of the full matrices with these blocks appear as ' 1 1"
' 1 1"
"1 1 '
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
0
1
1
:
1
1
0
1
1
*
αr
0
1
0
All other columns axe the same, Call the two columns from the first matrix x and y and apply (4.21). • Remark 4.2 In Chung, Diaconis, Graham and Mallows (1981) it was in fact shown that Πk = AΓ(α, b + kei) is log-concave: n\ > n/-+infc_i. Their proof was combinatorial and only worked for zero-one matrices. It is not clear if there is an analog of log-concavity for more general Lorentzian forms.
Acknowledgements. We thank Brad Efron for providing the original motivation and examples of truncated data, as well as many ideas on the relation to quasi-independence, Steve Stigler for the explanation of Pearson's work, Alistair Sinclair for help with the permanent literature, Steve Fienberg for pointers to the literature on structural zeros and Marc Coram, Mark Huber and Jim Fill for reading the manuscript carefully.
Independence with Truncation
219
REFERENCES
Albers, W., Bickel, P. J,. van Zwet, W. R.(1976) Asymptotic expansions for the power of distribution-free tests in the one-sample problem (Corn V6 pll70), Annals of Statistics,^, 108-156. Bai,Z. Chao,C.C,Liang C , Zhao L. (1997) Error bounds in a central limit theorem of doubly indexed permutation matrices, Annals of Statistics, 25, 2210-2227. Bapat R. B. (1990) Permanents in probability and statistics, Linear Algebra and its Applications, 127, 3-25. Barvinok,A. (1998) A simple polynomial time algorithm to approximate the permanent within a simply exponential factor. Preprint, Mathematical Sciences Research Center, Berkeley. Bickel P. (1969) A distribution-free version of the Smirnov two-sample test in the p-variate case. Annals of Mathematical Statistics, 40, 1-23. Blum, J. R. Kiefer, J. and Rosenblatt, M., (1961) Distribution free tests of independence based on the sample distribution function, Annals of Mathematical Statistics, 485-49. Bolthausen E. (1984) An estimate of the remainder in the combinatorial central limit theorem. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 66, 379-386. Brightwell, G. and Winkler P. (1991), Counting linear extensions, Order, 8, 225-242. Caussinus H. (1966) Contribution a l'analyse statistique des tableaux de correlation, Annales de la Faculte des Sciences de Toulouse(annee 1965), 29, 77-182. Chung F., Diaconis P., Graham R. and Mallows C. L. (1981) On the permanents of the complements of the direct sum of Identity matrices, Advances in Applied Maths., 2, 121-137. Clogg, C.C. (1986) Quasi-independence, in Encyclopedia of Statistical Sciences (9 vols. plus Supplement) 7, 460- 464. Cook, W. (1998) Combinatorial Optimization, Wiley, NY. Daley, D. J. and Vere-Jones, D. (1988) An Introduction to the Theory of Point Processes, Springer-Verlag,NY. Diaconis, P. (1988) Group Representations in Probability and Statistics, Institute of Mathematical Statistics, Hayward, California. Diaconis, P. and Graham R. (1977), Spearman's footrule as a measure of disarray Journal of the Royal Statistical Society, Series B, 39, 262268. Diaconis, P. and Graham R. (1981) The analysis of sequential experiments with feedback to the subjects. Annals of Statistics, 9, 3-23. Diaconis P. and Sturmfels B. (1998), Algebraic algorithms for sampling conditional distributions, Annals of Statistics, 26, 363-397.
220
Persi Diaconis, Ronald Graham and Susan P. Holmes
Efron B. and Petrosian V. (1992), A simple test of independence for truncated data with applications to red-shift surveys, Astrophysicαl Journal, 399,345-352. Efron B. and Petrosian V. (1999), Non parametric methods for doubly truncated data, Journal of the American Statistical Association, 94, 824834. Elderton E.M., Barrington A., Jones H.G., Lamotte E.M., Laski H.J., Pearson, K. (1913) On the correlation of fertility with social value: A cooperative study, Eugenics Laboratory Memoirs XVIII, University of London. Feller W. (1968) Introduction to Probability Theory and its applications, vol I, 3rd edition, Wiley, New York. Goodman L.A. (1968), The analysis of cross-classified data: Independence, quasi-independence, and interactions in contingency tables with or without missing entries, Journal of the American Statistical Association, 63, 1091-1131. Garey M.R. and Johnson D.S. (1979) Computers and intractability, a guide to the theory of NP-completeness, Freeman and co., San Francisco. Godsil C D . (1981) Matching behavior is asymptotically normal. Combinatorica, 1,369-376. Graham R.L., Knuth D., Patashnik, (1989) Concrete Mathematics, Addison Welsley, Reading, MA. Guo, S. W., Thompson, E. A.,(1992), Performing the exact test of HardyWeinberg proportion for multiple alleles, Biometrics, 48, 361-372 Hanlon P. (1996), A random walk on the rook placements on a Ferrers board. Electronic Journal of Combinatorics, vol 3,
http://math34.gatech.edu:8080/Journal/journalhome.html Hoeffding, W. (1948), A non-parametric test of independence, Annals of Mathematical Statistics, 19, 546-557. Jerrum, M.R. and Sinclair, A. J.(1989), Approximating the permanent. SI AM Journal on Computing 18 , 1149-1178. Jerrum, M.R. and Vazirani,U. (1992) A mildly exponential approximation algorithm for the permanent. Proceedings of the 33rd Annual IEEE Conference on Foundations of Computer Science, 320-326. Karmarkar,N.,Karp, R.,Lipton,R., Lovasz,L. and Luby,M., (1993) A MonteCarlo algorithm for estimating the permanent. SI AM Journal on Computing 22, 284-293. Lai, T.L. Ying, Z. (1991), Rank regression methods for left-truncated and right-censored data, Annals of Statistics, 19, 531-556. Lazeronni L. and Lange K. (1997), Markov chains for Monte Carlo tests of geneteic equilibrium in multidimensional contingency tables, Annals of Statistics, 138-168. Lehmer D.H. (1970) Permutations with strongly restricted displacements, in Combinatorial Theory and its Applications II, Eds. Erdos P., Renyi A., Sόs V., 755-770 North Holland, Amsterdam.
Independence with Truncation
221
Lieb E.H. (1966) Proofs of some conjectures on Permanents, Journal of Mathematics and Mechanics, 16, 127-134. Lovasz L. and Plummer (1986) Matching Theory, North Holland, Amsterdam. Lynden-Bell, D. (1971) A method of allowing for a known observational selection in small samples applied to 3CR quasars, Monographs of the National Royal Astrophysical Society, 155, 95-118. Lynden-Bell, D. (1993) Eddington-Malmquist bias, streaming motions, and the distribution of galaxies (Disc: p217-220) in Statistical Challenges in Modern Astronomy, ed. Babu J. and Feigelson E., 201- 216. Mallows,C. (1957) Non-null ranking models I, Biometrika, 44, 114-130. Mann B.(1995) Ph D thesis, Harvard University, Cambridge,MA. Pitman J. (1997) Probabilistic bounds on the coefficients of polynomials with only real zeros, Journal of Combinatorial Theory,k, 77, 279-303. Rasmussen,L.E.(1998) On approximating the permanent and other #Pcomplete problems. PhD Thesis, University of California at Berkeley. Rasmussen,L.E.(1994) Approximating the permanent: a simple approach. Random Structures and Algorithms 5, 349-361. Read R. (1962) Card guessing with information-a problem in probability. American Mathematics Monthly, 69, 506-511. Reinert G. (1998) Stein's method for the Chi-2 statistic, Technical Report Riordan J. (1958) An introduction to combinatorial theory, Wiley, N.Y. Romano, J. P. (1989) Bootstrap and randomization tests of some nonparametric hypotheses, Annals of Statistics, 17, 141-159 Schneller,W. (1989), Edgeworth expansions for linear rank statistics, Annals of Statistics, 17, 1103-1123. Sinclair A. (1992) Algorithms for random generation and counting. A Markov chain approach. Birkhaeuser. Stein C. (1986) Approximate computation of expectations Institute of Mathematical Statistics, Lecture Notes and Monographs, Hayward, California. Stembridge J. (1991) Immanants of totally positive matrices are nonnegative, Bulletin of the London Mathematics society,23, 422-428. Stembridge J. (1992) Some conjectures for Immanants, Canadian Journal of Mathematics, 44, 1079-1099. Stanley R. (1986) Enumeratiυe Combinatorics, volume I, Wadsworth/Brooks, Cole, Monterey, CA. Stigler S. (1992) Studies in the history of probability and statistics XLIII. Karl Pearson and quasi-independence, Biometrika,79,563-575. Tsai,W.Y. (1990), Testing the assumption of independence of truncation time and failure time, Biometrika, 77, 169-177.
222
Persi Diaconis, Ronald Graham and Susan P. Holmes
Tsui, K.L., Jewell, N. P. Wu, C. F. J. (1988) A nonparametric approach to the truncated regression problem, Journal of the American Statistical Association^, 785-792. Valiant L.,(1979) The complexity of computing the permanent, Theoretical Computer Science, 8, 189-201. Van Lint J. and Wilson,R. (1992) A course in combinatorics , Cambridge University Press, Cambridge. Zeckendorf,E. (1972) Representation des nombres naturels par une somme de nombres de Fibonacci ou de nombres de Lucas, Bulletin de la Societe Royale Scientifique de Liege, 41,179-182.
PERSI DIACONIS MATHEMATICS AND STATISTICS SEQUOIA HALL STANFORD UNIVERSITY
CA 94305-4065 RONALD GRAHAM COMPUTER SCIENCE UNIVERSITY OF CALIFORNIA AT SAN DIEGO AND ATT, FLORHAM PARK, NJ.
graham @ucsd. edu
SUSAN HOLMES STATISTICS STANFORD UNIVERSITY AND UNITE DE BIOMETRIE INRA-MONTPELLIER, FRANCE
susan @stat. Stanford, edu
MARKOV CHAIN CONDITIONS FOR ADMISSIBILITY IN ESTIMATION PROBLEMS WITH QUADRATIC LOSS
MORRIS L. EATON
1
University of Minnesota Consider the problem of estimating a parametric function when the loss is quadratic. Given an improper prior distribution, there is a formal Bayes estimator for the parametric function. Associated with the estimation problem and the improper prior is a symmetric Markov chain. It is shown that if the Markov chain is recurrent, then the formal Bayes estimator is admissible. This result is used to provide a new proof of the admissibility of Pitman's estimator of a location parameter in one and two dimensions. AMS subject classifications: K1232, H3789. Keywords and phrases: estimation, quadratic loss, admissibility, Markov chain, recurrence.
1
Introduction
In this paper we consider a classical parametric estimation problem when the loss is quadratic. Here attention is restricted to the so-called formal Bayes estimators - that is, estimators obtained as minimizers of the posterior risk calculated via a formal posterior distribution. Because the loss is quadratic, admissibility questions regarding such estimators are typically attacked using the explicit representation of the estimator as the posterior mean of the function to be estimated. Examples can be found in KARLIN (1958), STEIN (1959), ZIDEK (1970), PORTNOY (1971), BERGERand SRINIVASAN (1978), BROWN and HWANG (1982), EATON (1992), and HOBERT and ROBERT (1999). To describe the problem of interest here, let P(dx\θ) be a statistical model on a sample space X where the parameter θ E Θ is unknown. That is, for each 0, P( \θ) is a probability measure on the Borel sets of X. Both X and Θ are assumed to be Polish spaces with the natural σ-algebra. Given a real valued function φ(θ) that is to be estimated, consider the loss function (1.1) 1
L{α,θ) = (α
Research was supported in part by National Science Foundation Grant DMS 96-26601. Research was supported in part by grants from CWI and NWO in the Netherlands.
224
Morris L. Eaton
In order to define a formal Bayes estimator of φ(θ), let v be a σ-finite improper prior distribution defined on the Borel sets of Θ, so i/(θ) = +00. The marginal measure on X is defined by (1.2)
M(B)=
(P{B\θ)v(dθ) Θ
for Borel subsets of X. When M is σ-finite (assumed throughout this paper), then a formal posterior Q{dθ\x) exists and is characterized by (1.3)
P(dx\θ)v{dθ) = Q{dθ\x)M{dx).
The equality in (1.3) means that the measures o n ί x θ defined by the left and right side of (1.3) are equal. The formal posterior Q(-\x) is a probability measure for each x £ X, For a discussion of the existence of Q and uniqueness (up to sets of M-measure zero), see JOHNSON (1991). When the loss is (1.1) and the improper prior is z/, the formal Bayes estimator of φ(θ) is defined to be the point α(x) which miminizes (over α's)
J(α-φ(θ))2Q(dθ\x).
(1.4) Of course, the minimizer is
(1.5)
φ(x) = Iφ(θ)Q(dθ\x).
For the present, questions concerning the existence of integrals will be ignored. The risk function of this estimator is (1.6)
R(φ,θ) =
Eθ(φ(X)-φ(θ))2
where EQ denotes expectation under P{ \θ). The main focus of this paper concerns the admissibility of φ and the relationship of this admissibility to a Markov chain associated with the estimation problem. For our purposes, the relevant notion of admissibility is the following (STEIN (1965)). Definition 1.1 For any estimator t(X) of φ(θ), let R{t,θ) = Eθ(t(X) φ(θ))2 be the risk function of t. The estimator φ is almost-v-admissible (α — v — a) if for every estimator t which satisfies (1.7)
i?(ί,0) 1 for θ e C,
r / g(θ)u(dθ) < +oo}.
For g £ U(C), think of g{θ)v(dθ) as defining a proper prior distribution (it has not been normalized to integrate to one) and consider the marginal measure on X given by
(1.10)
Mg(B) = J P(B\θ)g(θ)u(dθ).
Because the measure Mg is finite, we can write (as in (1.3)), (1.11)
P{dx\θ)g(θ)u{dθ)
=
Qg{dθ\x)Mg{dx)
where Qg(dθ\x) now is a proper posterior distribution corresponding to the proper prior cg{θ)v(dθ) where c is the normalizing constant. Thus, the Bayes solution to the estimation problem is the Bayes estimator
(1.12)
φg(x) = Jφ(θ)Qg(dθ\x)
which is the posterior mean of φ(θ). Next, consider the integrated risk difference (1.13)
IRD(g) = J[R(φ,θ) - R(φg,θ)]g(θ)v(dθ).
Roughly (subject to some regularity described precisely in later sections), one version of the Blyth-Stein condition is: For sufficiently many sets C, (1.14)
inf IRD{g)=0. geu(C)
When (1.14) holds, then φ is a α - v - a (for example, see STEIN (1965)). In typical examples, a direct verification of (1.14) is not routine. A main result in this paper provides an upper bound for IRD(g) which allows us to use results from Markov chain theory to establish a sufficient
226
Morris L. Eaton
condition for (1.14). This result, established in Section 3 under regularity conditions, is the following: For g E U{C), IRD(g) < Δ(y/g) where Δ(Λ) =
" Φ(v))2Q(dθ\x)Q(dη\x)M(dx)
I θ θ X
is defined for real valued functions h. Although the function Δ(h) looks rather complicated, there is a Markov chain associated with Δ lurking in the background. To see this, recall (1.3) and let (1.16)
R{dθ\η) =
ίQ(dθ\x)P(dx\η).
x Then R{-\η) is the expected value of the formal posterior Q(-\x) when the model is P( \η). Obviously, R( \η) is a transition function (see EATON (1992, 1997) for further discussion; see HOBERT and ROBERT (1999) for some related material) and we can write
(1.17)
Δ(Λ) = J J(h(θ) - h(η))2(φ(θ) - φ(η))2R(dθ\η)v(dη). Θ
e
Then, with ' (1.18)
Φ(jl) =
< T(dθ\η) = ξ(dη)
=
J(Φ(θ)-φ(η))2R(dθ\η) 2
%-\η){φ{θ)-φ{η)) R{dθ\η) φ(η)u(dη)
it follows that
(1.19)
Δ(Λ) = I J(h(θ) - h(η))2T{dθ\η)ξ(dη). θ θ
By definition, T(dθ\η) is a transition function and hence defines a discrete time Markov chain, W = (WQ = 77, W\, W2,...) whose state space is Θ and whose path space is Θ°°. That is, under T( |77), the chain starts at Wo — η and the successive states of the chain Wi+i have distribution Γ( |Wi), i = 0,1,2, Under some regularity conditions to be specified later, when
Markov Chains and Admissibility
227
the chain W is "recurrent", it follows from results in EATON (1992, Appendix 2) that
{
for each set C with 0 < i/(C) < +oo, geu(C)
Therefore, the recurrence of the chain W implies that (1.14) holds and hence α — v — α for φ obtains. In summary, the above argument runs as follows: (i) The Blyth-Stein condition (1.14) is sufficient for α — v — α. (ii) The integrated risk difference is bounded above by Δ(y/g) as in (1.15). (iii) When the Markov chain associated with Δ is recurrent, then (1.20) implies (1.14) holds and we have α — v — α. Step (i) is a well known technique in decision theory and has appeared in many application such as those listed at the beginning of this section. Step (iii) was used in EATON (1992) and is a direct consequence of general results concerning symmetric Markov chains. What is new in this paper is step (ii) as expressed in (1.15). Inequalities like (1.15) were used in EATON (1992) but only for bounded functions φ. Thus the advance here is the extension of the Markov chain arguments to cover cases of estimating unbounded functions such as mean values. The following is a simple, but not so trivial, example which shows how the results described above can be applied. E x a m p l e 1.1 Let / be a symmetric density with an absolute third moment on JR1 and assume one observation X is made from f(x — θ)dx where θ is an unknown translation parameter, θ £ R1. The loss function is (α — θ)2 so the parameter θ is to be estimated. Consider the improper prior distribution dθ so the formal posterior is Q(dθ\x) — f(x — θ)dθ. Thus the formal Bayes estimator is f
ΘQ{dθ\x)
= x 2
and the risk function is just the constant E^X where EQ denotes expectation when 0 = 0. The Markov chain associated with this problem has transition function T given in (1.18). A routine calculation shows that the transition function R(dθ\η) of (1.16) is R(dθ\η) = r(θ-η)dθ where
(u) = r(-u) = / f(x-u)f(x)dx
228
Morris L. Eaton
is a density on R1. Thus 2
ψ(η) = f(θ-η) R(dθ\η)
2
= ίθ r{θ)dθ = c2
is constant. From (1.18) we have T(dθ\η) = (fl"7?) r^~^de
= t{θ-η)dθ
C2
and ξ{dη) = c2dη. Therefore T is a translation kernel with density t, so the Markov chain associated with T is a random walk on R1. Thus the existence of a first moment for t implies this random walk is recurrent (CHUNG-FUCHS (1951)). Using the definition of t and the third moment assumption for / yields
/
-
I ί \u\t{u)du = — / \uΓr(u)du = C2 J
3
ί ί \u\ f{x - u)f{x)dudx < -E0\X\* < +oo.
02 J J
C2
Hence the random walk is recurrent and the estimator x is almost admissible (relative to Lebesque measure). Of course, this example is just a very special case of the admissibility of Pitman's estimator on R1 when third moments exist. This was first established by STEIN (1959) using the Blyth-Stein method directly. Here is a brief summary of this paper. Section 2 contains the formal problem statement, basic assumptions, and a statement of the Blyth-Stein condition. The basic inequality is proved in Section 3 while Section 4 contains some background material on symmetric Markov chains. The main theorem connecting recurrence and admissibility is proved in Section 5, while some useful extensions are described in Section 6. The results are then applied in Section 7 to provide an alternative proof of the admissibility of the Pitman estimator of a location parameter in one and two dimensions. BROWN (1971) considered the problem of estimating the mean vector of a multivariate normal distribution when the loss is quadratic. Under regularity conditions, he established a close connection between admissibility and the recurrence of an associated diffusion process defined on the sample space. The relationship between Brown's work and the results here remains quite obscure. For further discussion, see EATON (1992, 1997).
Markov Chains and Admissibility 2
229
Notation and Assumptions
Certain integrability assumptions are needed to justify the arguments that are sketched in Section 1. Some of these assumption are stated here. The two spaces X and Θ are assumed to be Polish spaces with the natural σ-algebras. The model P(dx\θ) is a Markov kernel and the improper prior distribution v is σ-finite. The marginal measure M(dx) defined in (1.2) is assumed to be σ-finite so that equation (1.3) holds for the formal posterior Q{dθ\x). Let φ be a real valued function defined on Θ such that
(A.I)
I φ2(θ)Q(dθ\x) < +oo for all x..
Then the formal Bayers estimator φ(x) given in (1.5) is well defined. The risk function defined by (1.6) is assumed to satisfy the following local integrability condition: There exists an increasing sequence of (A.2)
>
sets {Ki} such that (wJ K{ = Θ, 0 < u(Ki) < oo, / R(φ, θ)u(dθ) < oo, for each i.
Observe that if g E U(Ki) (as defined in (1.9)) and g vanishes outside some Kj with j > ΐ, then the integrated risk
(2.1)
j R{φ,θ)g{θ)v{dθ).
is finite. Now, recalling (1.9), let g E U(C) and consider
(2.2)
g{x) = jg(θ)Q(dθ\x).
Recall that the marginal measure Mg is
(2.3)
Mg(B) = J J IB(x)P(dx\θ)g(θ)v(dθ). Θ X
Using (1.3), we see (2.4)
Mg(dx) = g(x)M(dx)
so that g is the Radon-Nikodym derivative of Mg with respect to M. Hence the set Ao = {x\g(x) = 0} has Mg measure zero. Now, define Qg(dθ\x) as follows: Q(dθ\x)
230
Morris L. Eaton
It is then easy to verify that (2.6)
P(dx\θ)g{θ)v(dθ) =
Qg(dθ\x)Mg(dx).
Therefore the Bayes estimator
(2.7)
φg(x) = I φ(θ)Qg(dθ\x)
is well defined because (A.I) and the boundedness of g imply ί φ2(θ)Qg{dθ\x)
(2.8)
< +oo for all x.
A rigorous statement of the Blyth-Stein Lemma follows. Given a K% in (A.2), let (2.9)
Ό\Ki)
= {g\g E U(Ki), jR(φ,θ)g(θ)v(dθ)
< +oo}.
Theorem 2.1 (Blyth-Stein Lemma). For each i, assume that (2.10)
inf
IRD{g)
= 0.
geU*(Ki)
Then φ is a — v — a. Proof The proof of this well known condition is by contradiction. The details are left to the reader. • Theorem 2.2 For g
(2.11)
eU*(Ki),
IRD(g) = J(φ(x) - φg(x))2g(x)M(dx).
Proof The proof of (2.11) is routine algebra coupled with the earlier observation that AQ has Mg measure zero. • 3
The Basic Inequality
In this section, the inequality described in (1.15) is established for g G U*(K{),i = 1,2, Here is a basic lemma which may be of independent interest. 2
Lemma 3.1 Let W and Y be real valued random variables such that EW < +oo,y > 0, and μ = EY < +oo. Also let (W,Ϋ) be an independent and identically distributed copy of (W,Y). Then (3.1)
[Cov(Wi Y)]2 < μE(W - W)2(VΫ
-
Markov Chains and Admissibility Proof
231
A direct calculation shows that
Cov{W,Y) = \E(W -W)(Y
-Ϋ).
Writing
= (y/Ϋ - y/γ)(y/Ϋ + y/Ϋ)
(Y-Ϋ)
and using the Cauchy-Schwarz inequality yields
[Cσv(W,Y)}2 < ].E(VΫ + VΫ)2E{W -
W)2(VΫ-
But (VΫ+ y/Ϋ)2 < 2(Y + Ϋ) so that \E{VΫ + VY)2 < μ. This completes the proof. • Theorem 3.1 For g E U*{Ki), (3.2)
IRD(g) < A(y/g)
where Δ is defined in (1.15). Proof For each x E Ag = {x\g(x) > 0},
φ{x) - φg(x) = I φ(θ)Q(dθ\x) - I φ(θ)Qg(dθ\x) (3.3)
= =
^Jφ(θ)(g(χ)-g(θ))Q(dθ\x) -TΠΛCOVΛΦ,9)
where Covx denotes covariance under the probability measure Q( \x). The last equality follows since g(x) is the mean of g(θ) under Q( \x). Applying inequality (3.1) with W = φ and Y = g, we have
(φ(χ)-φg(χ)f (3.4)
= -
1 θ
θ
Subsituting this inequality into the rightside of (2.11) clearly yields (3.2). This completes the proof. • The upper bound A(y/g) in (3.2) depends only on the three essential components of the original problem - namely the model, the improper prior and the function φ to be estimated. Of course this statement assumes that
232
Morris L. Eaton
the loss is quadratic. When the function φ is bounded, say \φ(θ\ < c, then obviously (3.5) where
The function Δi appeared first in EATON (1992) and was used to relate Markov chain recurrence to admissibility questions regarding the estimation of bounded functions. Not only is the argument here more general, it is far more transparent than the original in the case when φ is bounded.
4
Symmetric Markov chains
Some basic theory concerning symmetric Markov chains with values in a Polish space is described here. Of course, the emphasis is on those aspects of the theory which are most directly related to the admissibility questions under consideration here. The discussion follows EATON (1992, Appendix 2) quite closely. Let (y, B) be a measurable space where y is Polish and B is the usual Borel σ-algebra. Consider a Markov kernel S(du\υ) defined on B x y so that S( \v) is a probability measure for each v £ y and S(B\ ) is #-measureable for each B E B. Let ξ be a σ-finite measure defined on B with ζ(y) > 0. Definition 4.1 The Markov kernel S(du\v) is ξ-symmetric if the measure (4.1)
m{du,dv)
= S{du\v)ξ{dv)
defined on B x B is a symmetric measure. In all that follows, S(du\v) is assumed to be ξ-symmetric. The assumption that ξ is σ-finite is important (see the development in Appendix 2 in EATON (1992)). The symmetry of m implies that m has marginal measures ξ-that is, (4.2) m(y x B) = m(B x y) = ξ(B). Of course, (4.2) implies that ξ is a stationary measure for S(du\υ) since
(4.3)
Js(B\v)ξ(dv) = ζ(B). y
Now, each Markov kernel defines a Markov chain, and conversely, to specify a Markov chain one needs, at least implicity, a Markov kernel. A Markov chain is called symmetric if this Markov kernel is symmetric with respect
Markov Chains and Admissibility
233
to some σ-finite measure. For finite and countable state spaces, symmeteric Markov chains are also called reversible chains, but that terminology is not used here (see KELLY (1979) or LAWLER (1995)). According to the above terminology, a symmetric Markov chain on y gives rise to a symmetric measure (as in (4.1)) on B x B and this symmetric measure has a σ-finite marginal measure as defined in (4.2). Conversely, suppose n(du, dv) is a symmetric measure on B x B and suppose its marginal measure (4.4) μ{B) = n{B x y) is σ-finite. This implies that there is a unique (up to sets of n-measure zero) Markov kernel T(du\v) such that (4.5)
n(du,dυ) = T(du\υ)μ(dυ).
This result seems to be well known but I do not know a reference with an explicit statement. A slightly more general result can be found in JOHNSON (1991). The above discussion shows there is a one to one correspondence between symmetric Markov chains and symmetric measures with σ-finite marginals. This observation is what allows us to associate a Markov chain with the function Δ appearing in (1.15). More about this in the next section. Now, let S(du\v) be ^-symmetric and let Y = (YQ = v, Yί, I2? •) be the corresponding Markov chain with values in y. The notation means the chain starts at v and the succesive Y +i have distribution S(-\Yi) for i = 0,1, — The joint measure of the chain on y°° is denoted by Prob( |?;) where YQ = v is the initial state of the chain. Next, we turn to a discussion of recurrence when S(du\v) is ^-symmetric. Definition 4.2 Let B E B satisfy 0 < ξ(B) < +00. The set B is locally-ξrecurrent (l — ξ — r) if the set (4.6)
{υ\v e £,Prob(lj G B for some j > l\v) < 1}
has ξ measure zero. In other words, B is I — ξ — r if except for a set of starting values of ξmeasure zero, the chain returns to B with probability one when it starts in B. A characterization of local-ξ-recurrence can be given in terms of a quadratic form. For h G L2(ξ), the linear space of ξ square integrable functions, define D(h) by
(4.7)
D(h) = ί
f(h(u)-h(v))2m(du,dv).
where m is the symmetric measure given by (4.1). For B such that 0 < ξ(B) < +00, let (4.8)
V(B) = {h\h > 0, h € L2{ξ), h{u) > 1 for u G B}.
234
Morris L. Eaton
Theorem 4.1 The following are equivalent: (i) B (ii)
isl-ξ-r
inf Dlh) = 0 hev(B) v J
Proof This is a direct consequence of Theorem A.2 in
EATON
(1992). •
For our applications, a slight strengthening of Theorem 4.1 is needed. Let C G B satisfy C D B and ξ{C) < +oo. Then set (4.9) V{B, C) = {h\h e V{B), h is bounded , h{u) = 0 for u e Cc}. Theorem 4.2 Consider Cγ C C 2 C following are equivalent (i) B
isl-ξ-r
(ii) lim
inf
«Λίή B C C i αnrf limC< = y. The
Dlh) = 0.
i^ooheV{Bd)
Proof This is a consequence of results in
EATON
(1992, Appendix 2). •
It is Theorem 4.2 which will be used to establish a connection between the Blyth-Stein condition and recurrence. Definition 4.3 The chain Y is locally-ξ-recurrent if for each set B with 0 < ξ(B) < +oo, B isl-ξ-r. It is not too hard to show that Y is locally-ξ-recurrent iff there exists an increasing sequence of sets C\ C C 2 C with 0 < ξ(Ci) < +oo and lim d = y such that each d is / — ξ — r. In applications one can often choose a convenient sequence of sets d m order to check I — ξ — r. The quadratic form D(h) in (4.7) is well known in the theory and applications of symmetric Markov chains. In the probability literature \D{h) is known as the Dirichlet form associated with the symmetric measure m, or the symmetric transition S in (4.1). It is typical to write \D(h) in terms of the linear transformation S* defined on L2(ξ) as follows: (4.10)
(S*h){v) =
ίh{u)S{du\v). 2
Let (Λi,/i2) denote the standard inner product on L (ξ) given by
(4.11)
(huh2)
= j hλ{u)h2{u)ξ(du).
Markov Chains and Admissibility
235
A routine calculation shows that (4.12)
l
=
-D{h)
(h,(I-S*)h)
where / is the identity. The operator / — S* is commonly called the LaPlacian. Further discussion and some applications can be found in DlACONlS and STROOK (1991) and LAWLER (1995). 5
Recurrence implies admissibility
It is argued here that, under an additional assumption, recurrence of the Markov chain associated with the quadratic form
(5.1) Δ(Λ) = JJ J(h(θ) - h(η))2(φ(θ) - φ(η))2Q(dθ\x)Q(dη\x)M(dx) Θ
Θ X
will imply that the Blyth-Stein condition of Theorem 2.1 holds, so that φ is α — v — α. To carry out this argument, first observe that the measure on Θ x Θ given by (5.2)
α(dθ,dη) = ί(φ(θ) -
φ(η))2Q(dθ\x)Q(dη\x)M(dx)
x is, by inspection, symmetric. Using (1.3) and (1.16), the measure α can be written (5.3) α(dθ,dη) = (φ(θ) - φ{η))2R{dθ\η)v{dη) where R(dθ\η) is a transition function and v is the improper prior used to defined the estimator φ[x) in (1.5). Next, for r E θ , let
(5.4)
φ(η) =
j{φ(θ)-φ{η))2R{dθ\η).
The following assumption controls the behavior of φ and is expressed in terms of the sets K{ appearing in assumption (A.2) of Section 2. 0 < φ(η) < +oo for all η E θ , and (A.3)
/ ψ(η)v(dη) < +oo for all i.
Theorem 5.1 Assume (A.3) holds. Then the symmetric measure a has a σ-finite marginal measure (5.5)
ξ(dη) =
236
Morris L. Eaton
Further, with (5.6)
T(dθ\η) = φ-\η)(φ(θ) - φ(η))2R(dθ\η),
The measure a is given by (5.7)
a(dθ,dη) = T(dθ\η)ξ(dη).
Proof That (5.7) holds is immediate from (5.3) and the definition of ξ and T. Since T(dθ\η) is a transition function by definition, integration of (5.7) over Θ shows that a has ξ as a marginal measure. The σ-finiteness of ξ is immediate from assumption (A.3). This completes the proof. • Now, let W = (Wo = r?, Wi, W 2 ,...) be the Markov chain on Θ with transition function T. The above discussion shows that T is ^-symmetric (i.e. W is a symmetric Markov chain). Observe that the quadratic form associated with this chain as defined in (4.7) is exactly Δ given in (5.1). In other words, for h G L2(ξ),
(5.8)
Δ(Λ) = I J(h(θ) - h(η))2a(dθ, dη)
so that the results described in Section 4 are directly applicable. Here is the main result of this paper. Theorem 5.2 Assume (A.I), (A.2) and (A.3) hold. If the Markov chain W associated with the quadratic form Δ is locally-ξ-recurrent, then the formal Bayes estimator φ(x) is almost-v-admissible. Proof It suffices to show that condition (2.10) holds for each ΐ, ί = 1,2, — Fix an index j > i and consider the set V(K^ Kj) defined in (4.9). Assumptions (A.2) and (A.3) show that if y/g 3 since Rk(k > 3) does not support any non-trivial recurrent random walks (see GuivARCH'H, KEANE and ROYNETTE (1977)). Appropriate shrinkage estimators
242
Morris L. Eaton
on Rk,k > 3, provide explicit dominators of Pitman estimators in many translation problems. The results in PERNG (1970) show that in the case of k = 1, failure of the third moment assumption can lead to inadmissibility of the Pitman estimator. It is encouraging that the Markov chain arguments used here reproduce results which are known to be fairly sharp. At present, very little more is known concerning the sharpness of the Markov chain argument in Theorem 6.3. Work in this direction is underway.
REFERENCES Berger, J.O. and Srinivasan, C. (1978). Generalized Bayes estimators in multivariate problems. Ann. Statist 6. 783-801. Blyth, C.R. (1951). On minimax statistical procedures and their admissibility. Ann. Math. Statist. 22, 22-42. Brown, L. (1971). Admissible estimators, recurrent diffusions, and insolvable boundary value problems. Ann. Math. Statist. 42. 855-904. Brown, L. and Hwang, J.T. (1982). A unified admissibility proof. In Statistical Decision Theory and Related Topics IV. (S.S. Gupta and J.O. Berger, eds.). 1, 299-324. Academic Press, New York. Chung, K.L. and Fuchs, W.H. (1951). On the distribution of values of sums of random variables. Mem. Amer. Math. Soc. 6 1-12. Diaconis, P. and Strook, D. (1991). Geometric bounds for eigenvalues of Markov chains. Ann. Appl. Probab. 1, 36-61. Eaton, M.L. (1992). A statistical diptych: Admissible inferences-Recurrence of symmetric Markov chains. Ann. Statist. 20, 1147-1179. Eaton, M.L. (1997). Admissibility in quadratically regular problems and recurrence of symmeteric Markov chains: Why the connection? Jour. Statist. Planning and Inference. 64, 231-247. Guivarc'h, Y. Keane, M. and Roynette, B. (1977). Marches aleatoires sur les Groupes de Lie. Lecture Notes in Math. 624. Springer Verlag, New York. Hobert, J.P. and Robert, C.P. (1999). Eaton's Markov chain, its conjugate partner and P-admissibility. To appear in Ann. Statist. James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1, 361-380. University of California Press, Berkeley. Johnson, B. (1991). On the admissibility of improper Bayes inferences in fair Bayes decision problems. Ph.D. Thesis, University of Minnesota. Karlin, S. (1958). Admissibility for estimation with quadratic loss. Ann. Math. Statist 29, 406-436. Kelly, F.P. (1979). Reversibility and Stochastic Networks. Wiley, New York.
Markov Chains and Admissibility
243
Lawler, G.F. (1995). Introduction to Stochastic Processes. Chapman-Hall, London. Perng, S.K. (1970). Inadmissibility of various "good" statistical procedures which are translation invariant. Ann. Math. Statist. 4 1 , 1311— 1321. Portnoy, S. (1971). Formal Bayes estimation with applications to a random effects model. Ann. Math. Statist. 42, 1379-1402. Revuz, D. (1984). Markov chains, Second Edition. North-Holland, Amsterdam. Stein, C. (1955). A necessary and sufficient condition for admissibility. Ann. Math. Statist. 30, 970-979. Stein, C. (1959). The admissiblity of Pitman's estimator of a single location parameter. Ann. Math. Statist. 30, 970-979. Stein, C. (1965). Lecture notes on decision theory. Unpublished manuscript. Zidek, J.V. (1970). Sufficient conditions for admissibility under squared error loss of formal Bayes estimators. Ann Math. Statist. 41, 446-456.
SCHOOL OF STATISTICS UNIVERSITY OF MINNESOTA 224
CHURCH S T R E E T
SE
MN 55455 USA eaton@stat. umn. edu MINNEAPOLIS
THE SCIENTIFIC FAMILY TREE OF WILLEM R. VAN ZWET
CONSTANCE VAN EEDEN
University of British Columbia Willem van Zwet had 16 PhD students, 15 as supervisor (promotor in Dutch) and one as co-supervisor, and what I have tried to do is to get a complete list of the members of his scientific family up to 20 February 2000. By "A is a member of Willem van Zwet's scientific family", I mean that A can trace his "scientific ancestry" back to Willem van Zwet through (co-)promotors.
1
Introduction
The family members are listed by generation (with the PhDs of W.R. van Zwet as the first generation), within each generation by PhD-thesis supervisor (promotor in Dutch) and, for the PhDs of a given supervisor, in decreasing order of age. Further, for a given generation, the supervisors are listed in decreasing order of their own supervisors age and, in case two of them have the same supervisor, in decreasing order of their own age. In determining the age of a PhD, the date (year) of obtaining his/her PhD degree is taken as his/her birthday (birthyear). For each person in the list, his/her thesis title as well as the date on which he/she defended his/her thesis (or the year in which the degree was granted) is given. The university mentioned in the heading of each sub-list is, unless otherwise mentioned, the one at which the students in that sub-list received their degrees. Note that, because students may have more than one tree-member as (co-)promotor, some PhDs are listed several times - once for each of their (co-)promotors who is a tree-member. In case the several (co-)promotors are not of the same generation, these multiply-listed PhDs show up in more than one generation. For such multiply-listed PhDs who are themselves (co)promotors, the generation number of their PhDs is taken as "the lowest generation number among their (co-)promotors +1". The table of contents lists all those in the tree who have been (co)promotors, with mention of the generation number of their PhDs (i.e., their own number + 1), the number of PhD students they had up to now, as well as the numbers of the sections in which their PhDs are listed. The total number of tree members (inclusive van Zwet) is 145 and of those 23 have been (co-)promotors of members of van Zwet's tree. My thanks to all who helped me assemble the data.
Family Tree of Willem van Zwet
2
Overview
Generation
Promotor
1 2
W.R. van Zwet J. Oosterhoff W. Molenaar F.H. Ruymgaart W. Albers M.C.A. van Zuijlen R. Helmers C.A.J. Klaassen R.J.M.M. Does S.A. van de Geer W.C.M. Kallenberg P. Groeneboom R.D. Gill P.D. Bezemer F.C. Drost M.N.M. van Lieshout K. Sijtsma J.A.M. van Druten J.H.H. Einmahl J.A. Beirlant Tj. Imbos L.C. van der Gaag M.J. van der Laan
3
4
3 3.1
245
# of PhDs
section
16 8 60 6 6 3 1 5 5 1 6 8 14 3 1 1 4 1 3 2 1 1 2
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
First Generation: PHD Students of W.R. van Zwet Rijksuniversiteit Leiden, The Netherlands
J. Oosterhoff W. Molenaar F.H. Ruymgaart W. Albers J.L. Mijnheer M.C.A. van Zuijlen R. Helmers C.A.J. Klaassen R.J.M.M. Does A.W. van der Vaart S.A. van de Geer M.C.M. de Gunst
Combination of one-sided statistical tests Approximations to the Poisson, binomial and hypergeometric distribution functions (co-promotor with J. Hemelrijk as promotor) 30/5/1973 Asymptotic theory of rank tests for independence 27/6/1974 Asymptotic expansions and the deficiency concept in statistics Sample path properties of stable processes (promotors: 2/9/1974 J. Fabius and W.R. van Zwet) 15/12/1976 Empirical distributions and rank statistics 13/12/1978 Edgeworth expansions for linear combinations of order statistics 22/10/1980 Statistical performance of location estimators 16/6/1982 Higher order asymptotics for simple linear rank statistics Statistical estimation in large parameter spaces (co4/6/1987 promotor: C.A.J. Klaassen) 30/9/1987 Regression analysis and empirical processes (promotors: R.D. Gill and W.R. van Zwet) A random model for plant cell population growth (pro2/3/1988 motors: K.R. Libbenga and W.R. van Zwet)
26/6/1969 6/5/1970
246
Constance van Eeden
H. Putter M. Wegkamp C.G.H. Diks M. Fiocco
4 4.1
23/11/1994 Consistency of resampling methods 26/6/1996 Entropy methods in statistical estimation (copromotor: S.A. van de Geer) 25/9/1996 On nonlinear time series analysis (promotors: F. Takens and W.R. van Zwet) 15/10/1997 Statistical estimation for the supercritical contact process
Second Generation: PHD Students of J. Oosterhoff Vrije Universiteit Amsterdam, The Netherlands
W.C.M. Kallenberg
14/12/1977 Asymptotic optimality of likelihood ratio tests in exponential families P. Groeneboom 14/6/1979 Large deviations and asymptotic efficiencies R.D. Gill 20/12/1979 Censoring and stochastic integrals (co-promotor: C.L. Scheffer) P.D. Bezemer 8/5/1981 Referentiewaarden (co-promotor with Chr.L. Rϋmke as promotor) A.D.M. Kester 22/6/1983 Some large deviation results in statistics (co-promotor: W.C.M. Kallenberg) B.F. Schriever 21/6/1985 Order dependence (co-promotor: R.D. Gill) F.C. Drost 7/10/1987 Asymptotics for generalized chi-square goodness-of-fίt tests (co-promotor: W.C.M. Kallenberg) M.N.M. van Lieshout 7/4/1994 Stochastic geometry models in image analysis and spatial statistics (co-promotor with A.J. Baddeley as promotor)
5 5.1
Second Generation: PHD Students of W. Molenaar Rijksuniversiteit Groningen, The Netherlands
F.P.H. Dijksterhuis
21/6/1973
H. van der Laan
13/12/1973
H.L.W. Angenent
20/6/1974
R. van der Kooy
14/10/1974
G.J.P. Visser
8/10/1975
H. Cohen
9/10/1975
N.A.J. Lagerweij
9/12/1976
G.G.H. Jansen
24/11/1977
J.M.F. ten Berge
15/12/1977
De gevangenis Bankenbos II (co-promotor with W. Buikhuisen as promotor) Leren lezen, schrijven en rekenen (co-promotor with W.J. Bladergroen as promotor) Opvoeding, persoonlijkheid en gezinsverhoudingen in verband met kriminaliteit (co-promotor with W. Buikhuisen as promotor) Spelen met Spel (co-promotor with W.J. Bladergroen as promotor) Luxatie-fracturen van de enkel (promotors: P.J. Kuijjer and W. Molenaar) Drugs, druggebruikers en drug-scene (co-promotor with W. Buikhuisen as promotor) Handleidingen in het onderwijs (co-promotor with L. van Gelder as promotor) An application of Bayesian statistical methods to a problem in educational measurement (promotors: W. Molenaar and W.K.B. Hofstee) Optimizing factorial invariance (promotor: J.P. van de Geer; co-promotors: W.K.B. Hofstee and W. Molenaar)
Family Tree of Willem van Zwet L. Dijkstra
19/1/1978
F.B. Brokken
22/6/1978
K.M. Stokking P. Vijn H.H. de Vos
21/6/1979 10/1/1980 13/3/1980
Tj. Tijmstra
24/4/1980
H. Kuyper
29/5/1980
T.P.B.M. Suurmeijer 16/12/1980 R. de Groot
27/8/1981
R. Nauta
24/9/1981
H.H. van der Molen
28/4/1983
A. Boomsma
23/6/1983
T.J. Euverman & A.A. Vermulst R. Popping
3/11/1983
247
Ontvouwing. Over het afbeelden van rangordes van voorkeur in ruimtelijke modellen (promotors: I. Gadourek and W. Molenaar) The language of personality (co-promotor with W.K.B. Hofstee as promotor) Toetsend onderzoek (co-promotor: L.W. Nauta) Prior information in linear models Het met en van werkorientaties, een vergelijking van verschillende technieken voor het meten van houdingen (co-promotor: I. Gadourek) Sociologie en tandheelkunde (co-promotor with I. Gadourek as promotor) About the saliency of social comparison dimensions (promotors: H.A.M. Wilke and W. Molenaar) Kinderen met epilepsie (promotors: Th.J. IJzerman and W. Molenaar) Adolescenten met leermoeilijkheden in het LBO (copromotor with W.J. Bladergroen as promotor) Studie politiek cultuur (promotors: P.J. van Strien and W. Molenaar) Pedestrian ethology (co-promotor with J.A. Michon as promotor) On the robustness of LISREL (maximum likelihood estimation) against small sample size and non-normality Bayesian factor analysis
15/12/1983 Overeenstemmingsmaten voor nominale data (promotors: F.N. Stokman and W. Molenaar) B.F. van der Meulen 23/2/1984 Bayley ontwikkelingsschalen (promotors: A.F. Kalverboer and W. Molenaar) & M.R. Smrkovsky P.F. Lour ens 24/5/1984 The formalization of knowledge by specification of subjective probability distributions (promotors: W. Molenaar and W.K.B. Hofstee) 13/9/1984 Over psychiatrische invaliditeit (promotors: R. Giel A. de Jong and W. Molenaar) 4/10/1984 Leiding & Participate in het werk (promotors: P.J. J. Geersing van Strien, W. Molenaar and P. Veen) 13/12/1984 Structure in political beliefs. A new model for stochasH. van Schuur tic unfolding with application to European Party activists (promotors: F.N. Stokman and W. Molenaar) Planningsgedrag van leraren. Een empirisch onder9/5/1985 J. Ax zoek naar de onderwijsplanning door leraren in het voortgezet onderwijs (promotors: H.P.M. Creemers, W. Molenaar and N.A.J. Lagerweij) H.J.A. Schouten 20/11/1985 Statistical measurement of interobserver agreement. Analysis of agreements and disagreements between observers (promotors: R. van Strik and W. Molenaar) 11/12/1986 A general family of association coefficients (promotors: F.E. Zegers W. Molenaar and W.K.B. Hofstee) 31/3/1988 Arbeid en gezondheid van stadsbuschauffeurs (promoM. Kompier tors: G. Mulder, F.J.H. van Dijk and W. Molenaar)
Constance van Eeden
248
Contributions to Mokken's nonparametric item response theory (co-promotor: P.J.D. Drenth) 20/4/1989 Subjective probability distributions, a psychometric P. Terlouw approach (promotors: W. Molenaar and W.K.B. Hofstee) Y.J. Pijl 31/8/1989 Het toelatingsonderzoek in het LOM- en MLKonderwijs (promotors: B.P.M. Creemers, W. Molenaar and J. Rispens) T.C.W. Luijben 30/11/1989 Statistical guidance for model modification in covariance structure analysis W. Schoonman 21/12/1989 An applied study on computerized adaptive testing (promotors: W. Molenaar and W.K.B. Hofstee) PARELLA, Measurements of latent traits by proximH.J.A. Hoijtink 7/6/1990 ity items I. van Kamp 6/9/1990 Coping with noise and its health consequences (promotors: G. Mulder and W. Molenaar) G.I.J.M. Kempen 25/10/1990 Thuiszorg voor ouderen (promotors: W.J.A. van den Heuvel and W. Molenaar) Construction and validation of the SON-R 5 \ J.A. Laros &; 2/1991 -17, the Snijders-Oomen non-verbal intelligence test P.J. Tellegen (promotors: W.K.B. Hofstee, T.A.B. Snijders and W. Molenaar) T.F. Meijman 25/4/1991 Over vermoeidheid (promotors: G. Mulder, W. Molenaar and Hk. Thierry) H. Meurs 19/9/1991 A panel data analysis of travel demand (promotors: T.J. Wansbeek, G. Ridder and W. Molenaar) W.J. Post 7/5/1992 Nonparametric unfolding models. A latent structure approach (promotors: T.A.B. Snijders and W. Molenaar) P. Cavalini 22/10/1992 It is an ill wind that brings no good (promotors: C. A. J. Vlek and W. Molenaar) R. Festa 12/11/1992 Optimum inductive methods. A study in inductive probability, Bayesian statistics, and verisimilitude (first promotor T.A.F. Kuipers, second promotor promotor W. Molenaar) M.A.J. van Duijn 4/2/1993 Mixed models for repeated count data (promotors: T.A.B. Snijders and W. Molenaar) J. van Lenthe 16/9/1993 ELI. The use of proper scoring rules for eliciting subjective probability distributions (promotors: W. Molenaar and W.K.B. Hofstee) R.R. Meijer1 17/1/1994 Nonparametric Person Fit Analysis (promotors: W. Molenaar and P.J.D. Drenth, co-promotor: K. Sijtsma) W.R.E.H. Mulder2/11/1994 Categories or dimensions ? The use of diagHajonides van der nostic models in psychiatry (promotors: W. Meulen Molenaar and R.H. van den Hoofdakker) P.M. van der Lubbe 11/12/1995 De ontwikkeling van de Groningse vragenlijst over sociaal gedrag (GVSG) (promotors: R.Giel and W. Molenaar) E.P.W.A. Jansen Curriculumorganisatie en studievoortgang (promo4/4/1996 tors: W.T.J.G. Hoeben and W. Molenaar)
K. Sijtsma
2/6/1988
Family Tree of Willem van Zwet P. de Kort
22/5/1996
B.T. Hemker2
27/9/1996
R.I. Gal
29/5/1997
A. Camstra J.M.E. Huisman
7/5/1998 4/2/1999
6
249
Neglect (promoters: J.M. Minderhoud, B.G. Deelman and W. Molenaar) Unidimensional IRT models for polytomous items, with results for Mokken scale analysis (promotors: P.G.M. van der Heijden and W. Molenaar; copromotor: K. Sijtsma) Unreliability. Contract discipline and contract governance under economic transition (promotors: S.M. Lindenberg and W. Molenaar) Cross-validation in covariance structure analysis Item nonresponse: Occurence, causes, and imputation of missing answers to test items
Second Generation: PHD Students of F.H. Ruymgaart
6.1
Katholieke Universiteit Nijmegen, The Netherlands
M.J.M. Jansen
25/6/1981
J.A.M. van Druten
1/10/1981
J.H.J. Einmahl
16/5/1986
G. Nieuwenhuis
12/6/1989
6.2
Equilibria and optimal threat strategies in two-person games (co-promotor: T.E.S. Raghavan) A mathematical-statistical model for the analysis of cross-sectional serological data with special reference to the epidemiology of malaria (co-promotor with J.H.E.Th. Meuwissen as promotor) Multivariate empirical processes (co-promotor: D.M. Mason) Asymptotics for point processes and general linear processes (co-promotor: W. Vervaat)
Texas Technical University, Lubbock, USA
A.K. Dey
8/1994
K. Chandrawansa
12/1997
7
Cross-validation for parameter selection in statistical inverse estimation problems Statistical inverse estimation of irregular input signals
Second Generation: PHD Students of W. Albers
7.1
Rijksuniversiteit Limburg, Maastricht, The Netherlands
L.W.G. Strijbosch
7.2
Experimental design and statistical evaluation of limiting dilution assays (co-promotor: R.J.J.M. Does)
Universiteit Twente, Enschede, The Netherlands
C.A.W. Glas3 A.J. Koning G.D. Otten G.R.J. Arts
1 2
21/4/1989
12/10/1989 Contributions to estimation and testing Rasch models (promotors: W.J. van der Linden and W. Albers) 8/3/1991 Stochastic integrals and goodness-of-fit tests (assistant-promotor: J.H.J. Einmahl) 23/6/1995 Statistical test limits in quality control (assistantpromotor: W.C.M. Kallenberg) 12/6/1998 Test limits in quality control using correlated product characteristics (assistant-promotor: W.C.M. Kallenberg)
Degree obtained at the Vrije Universiteit van Amsterdam, The Netherlands. Degree obtained at the Rijksuniversiteit Utrecht, The Netherlands. The degree was
awarded cum lαude.
250
Constance van Eeden
P.C. Boon 8
17/12/1999 Asymptotic behavior of pre-test procedures (assistantpromotor: W.C.M. Kallenberg)
Second Generation: PHD Students of M.C.A. van Zuijlen
8.1
Katholieke Universiteit Leuven, Belgium
J.A. Beirlant
8.2
1984
Eigenschappen van de empirische verdelingfunctie en het empirisch proces van toevalsafstanden en asymptotische theorie van hierop gebaseerde statistische functionalen (co-promotor with E.C. van der Meulen as promotor)
Katholieke Universiteit Nijmegen, The Netherlands
T. van der Meer
12/6/1995
Applications of operators in nearly unstable models (co-promotor with W.Th.F. den Hollander as promotor)
I. Alberink
26/1/2000
Berry-Esseen bounds for arbitrary statistics (copromotor with W.Th.F. den Hollander as promotor)
9
Second Generation: PHD Students of R. Helmers
9.1
Erasmus Universiteit Rotterdam, The Netherlands
A.L.M. Dekkers 10
14/11/1991 On extreme-value estimation (co-promotor with L. de Haan as promotor)
Second Generation: PHD Students of C.A.J. Klaassen
10.1
Rijksuniversiteit Leiden, The Netherlands
A.W. van der Vaart 10.2
4/6/1987
Universiteit van Amsterdam, The Netherlands
S.A. Venetiaan E.R. van den Heuvel
6/12/1994 21/5/1996
W.J.H. Stortelder
12/3/1998
A.J. Lenstra
26/3/1998
11
Statistical estimation in large parameter spaces (copromotor with W.R. van Zwet as promotor) Bootstrap bounds Bounds for statistical estimation in semiparametric models Parameter estimation in nonlinear dynamic systems (co-promotor with P.W. Hemker as promotor) Analyses of the nonparametric mixed proportional hazards model (promotors: C.A.J. Klaassen, G. Ridder and A.CM. van Rooij)
Second Generation: PHD Students of R.J.M.M. Does
11.1
Rijksuniversiteit Limburg, Maastricht, The Netherlands
Tj. Imbos
17/2/1989
L.W.G. Strijbosch
21/4/1989
Degree awarded cum lαude.
Het gebruik van einddoeltoetsen bij aanvang van de studie (co-promotor with W.H.F.W. Wijnen as promotor) Experimental design and statistical evaluation of limiting dilution assays (co-promotor with W. Albers as promotor)
Family Tree of Willem van Zwet 11.2
Universiteit van Amsterdam, The Netherlands
E.S. Tan C.B. Roes A.F. Huele
12 12.1
13.1
16/12/1994 A stochastic growth model for the longitudinal measurement of ability (co-promotor: Tj. Imbos) 27/11/1995 Shewhart-type charts in statistical process control 5/11/1998 Statistical robust design (co-promotor: J. Engel)
Second Generation: PHD Students of S.A. van de Geer Rijksuniversiteit Leiden, The Netherlands
M. Wegkamp
13
26/6/1996
Vrije Universiteit Amsterdam, The Netherlands 22/6/1983
F.C. Drost
7/10/1987
G.D. Otten G.R.J. Arts P.C. Boon
14.1
22/6/1990
Stochastic models in reliability (assistant-promotor with J.H.A. de Smit as promotor) 23/6/1995 Statistical test limits in quality control (assistantpromotor with W. Albers as promotor) 12/6/1998 Test limits in quality control using correlated product characteristics (assistent-promotor with W. Albers as promotor) 17/12/1999 Asymptotic behavior of pre-test procedures (assistentpromotor with W. Albers as promotor)
Third Generation: PHD Students of P. Groeneboom Universiteit van Amsterdam, The Netherlands
J. Praagman4
27/6/1986
A.J. van Es
2/11/1988
F.A.G. Windmeyer
9/6/1992
14.2
Some large deviation results in statistics (co-promotor with J. Oosterhoff as promotor) Asymptotics for generalized chi-squared goodness-of-fit tests (co-promotor with J. Oosterhoff as promotor)
Universiteit Twente, Enschede, The Netherlands
D.P. Kroese
14
Entropy methods in statistical estimation (copromotor with W.R. van Zwet as promotor)
Third Generation: PHD Students of W.C.M. Kallenberg
A.D.M. Kester
13.2
251
Efficiency of change-point tests (promotors: P. Groeneboom and P.C. Sander) Aspects of nonparametric density estimation (copromotor: P.L. Janssen) Goodness-of-fit in linear and qualitative-choice models (promotors: P. Groeneboom and H. Neudecker)
Technische Universiteit Delft, The Netherlands
E.A.G. Weits A.J. Cabo
17/5/1990 28/6/1994
A stochastic heat equation for freeway traffic flow Set functionals in stochastic geometry (promotors: A.J. Baddeley and P. Groeneboom) G. Jongbloed 9/10/1995 Three statistical inverse problems - estimators, algorithms, asymptotics R.B. Geskus 11/2/1997 Estimation of smooth functionals with interval censored data and something completely different P.P. de Wolf 5/10/1999 Estimating the extreme value index - tales of tails 4 Degree obtained at the Technische Universiteit Eindhoven, The Netherlands.
252
15 15.1
Constance van Eeden
Third Generation: PHD Students of R.D. Gill Vrije Universiteit Amsterdam, The Netherlands
B.F. Schriever
21/6/1985
R.H. Baayen
29/5/1989
15.2
Rijksuniversiteit Leiden, The Netherlands
S.A. van de Geer
15.3
30/9/1987
Regression analysis and empirical processes (promotors: W.R. van Zwet and R.D. Gill)
Universiteit van Amsterdam, The Netherlands
L.C. van der Gaag
15.4
Order dependence (co-promotor with J. Oosterhoff as promotor) A corpus-based approach to morphological productivity (promotors: G.E. Booij and R.D. Gill)
26/9/1990
Probability-based models for plausible reasoning (promotors: J.A. Bergstra and R.D. Gill)
Rijksuniversiteit Utrecht, The Netherlands
M.C.J. van Pul
Statistical analysis of software reliability (copromotor: K. Dzhaparidze) M. J. van der Laan 13/12/1993 Efficient and inefficient estimation in semiparametric models (promotors: R.D. Gill and P.J. Bickel) B.J. Wijers 19/1/1995 Nonparametric estimation for a windowed linesegment process C.G.M. Oudshoorn 7/11/1996 Optimality and adaptivity in nonparametric regression (co-promotor: B.Y. Levit) C.M.A. Schipper 15/1/1997 Sharp asymptotics in nonparametric estimation (copromotor: B.Y. Levit) R.W. van der Hofstad 16/6/1997 One-dimensional random polymers (promotors: W.Th.F. den Hollander and R.D. Gill) E.N. Belitser 17/6/1997 Minimax estimation in regression and random censorship models (co-promotor: B.Y. Levit) Y. Nishiyama 25/5/1998 Entropy methods for martingales G.E. Krupa 2/11/1998 Limit theorems for random sets (co-promotor: E.J. Balder) E.W. van Zwet 3/9/1999 Likelihood devices in spatial statistics
16 16.1
Third Generation: PHD Students of P.D. Bezemer Universiteit van Amsterdam, The Netherlands
H. Walinga
16.2
24/2/1993
30/5/1985
Varioliforme erosies - Een onderzoek naar de etiologie (promotor: G.N.J. Tijtgat with W. Dekker, P.J. Kostense and P.D. Bezemer as co-promotors)
Vrije Universiteit Amsterdam, The Netherlands
E.S.M. de Lange-de Klerk L. van den Berg
19/5/1993
Effects of homoeopathic medicines on children with recurrent upper respiratory tract infections (copromotor with L.Feenstra and O.S. Miettinen as promotors) 31/10/1997 Postoperatieve pijnbestrijding: Een prospectieve studie naar de kosteneffectiviteit van twee methoden (promotor: J.J. de Lange with W.W.A. Zuurmond and P.D. Bezemer as co-promotors)
Family Tree of Willem van Zwet 17
253
Third Generation: PHD Students of F.C. Drost
17.1
Katholieke Universiteit Brabant, Tilburg, The Netherlands
B.J.M. Werker
18
14/9/1995
Statistical methods in financial econometrics (copromotor with B.B. van der Genugten and Th.E. Nijman as promoters)
Third Generation: PHD Students of M.N.M. van Lieshout
18.1
University of Warwick, Coventry, United Kingdom
E. Thδnnes
19
24/11/1998 Perfect and imperfect simulations in stochastic geometry in image analysis (co-promotor with W.S. Kendall as promotor)
Third Generation: PHD Students of K. Sijtsma
19.1
Vrije Universiteit Amsterdam, The Netherlands
R.R. Meijer A.C. Verweij
19.2
28/2/1996
Dansende beren. Beoordelingsprocessen bij personeelsselectie (co-promotor with G.J. Mellenbergh as first and W.M.M. Alt ink as second promotor)
Rijksuniversiteit Utrecht, The Netherlands
B.T. Hemker6
20
Nonparametric person fit analysis (co-promotor with W. Molenaar and P.J.D. Drenth as promotors) 20/12/1994 Scaling transitive inference in 7-12 year old children (co-promotor with W. Koops as promotor)
Universiteit van Amsterdam, The Netherlands
K. van Dam 5
19.3
17/1/1994
27/9/1996
Unidimensional IRT models for polytomous items, with results for Mokken scale analysis (co-promotor with P.G.M. van der Heijden and W. Molenaar as promotors)
Third Generation: PHD Students of J.A.M. van Druten
20.1
Universiteit van Amsterdam, The Netherlands
J.C.M. Hendriks
21
7/5/1999
The incubation period of AIDS (co-promotor with R.A. Coutinho and G.J.F. van Griensven as promotors)
Third Generation: PHD Students of J.H.J. Einmahl
21.1
Universiteit Twente, Enschede, The Netherlands
A.J. Koning 5 6
8/3/1991
Stochastic integrals and goodness-of-fit tests (assistant-promotor with W. Albers as promotor)
K. van Dam received the 1996 NITPB prize for his PhD thesis. Degree awarded cum lαude.
254
Constance van Eeden
21.2
Erasmus Universiteit Rotterdam, The Netherlands
H. Xin
15/1/1992
A.K. Sinha
2/10/1997
22 22.1
Third Generation: PHD Students of J.A. Beirlant Katholieke Universiteit Leuven, Belgium
P. Vynckier
1996
C. Vynckier
1997
23 23.1
24.1
Universiteit van Amsterdam, The Netherlands 16/12/1994 A stochastic growth model for the longitudinal measurement of ability (co-promotor with R.J.M.M. Does as promotor)
Fourth Generation: PHD Students of L.C. van der Gaag Rijksuniversiteit Utrecht, The Netherlands
R.R. Bouckaert
25 25.1
Tail estimation, quantile plots and regression diagnostics Applications of generalized quantiles in multivariate statistics
Third Generation: PHD Students of Tj. Imbos
E.S. Tan
24
Statistics of bivariate extreme values (copromotor with L. de Haan as promotor) Estimating failure probability when failure is rare: multidimensional case (co-promotor with L. de Haan as promotor)
13/6/1995
Bayesian belief networks: Prom construction to inference (co-promotor with J. van Leeuwen as promotor)
Fourth Generation: PHD Students of M.J. van der Laan University of California, Berkeley, USA
A.E. Hubbard
1998
D.R. Peterson
1998
Applications of locally efficient estimation to censored data models On nonparametric estimation and inference with censored data, bandwidth selection for local polynomial regression, and subset selection in explanatory regression analysis
Acknowledgements. The information of this paper is taken from the larger family tree in "The Scientific Family Tree of David van Dantzig", compiled by Constance van Eeden, published by the Stichting Mathematisch Centrum, September 2000. DEPARTMENT OF STATISTICS THE UNIVERSITY OF BRITISH COLUMBIA
333-6356
AGRICULTURAL ROAD
VANCOUVER, B.C.,
cυe@xs4αll.nl
CANADA, V6T
1Z2
ASYMPTOTICS IN QUANTUM STATISTICS
RICHARD D. GILL
University of Utrecht and Eurandom Observations or measurements taken of a quantum system (a small number of fundamental particles) are inherently random. If the state of the system depends on unknown parameters, then the distribution of the outcome depends on these parameters too, and statistical inference problems result. Often one has a choice of what measurement to take, corresponding to different experimental set-ups or settings of measurement apparatus. This leads to a design problem—which measurement is best for a given statistical problem. This paper gives an introduction to this field in the most simple of settings, that of estimating the state of a spin-half particle given n independent copies of the particle. We show how in some cases asymptotically optimal measurements can be constructed. Other cases present interesting open problems, connected to the fact that for some models, quantum Fisher information is in some sense non-additive. In physical terms, we have non-locality without entanglement. AMS subject classifications: 62F12, 62P35.
Keywords and phrases: quantum statistics, information, spin half, qubit, two level system.
1
Introduction
The fields of quantum statistics and quantum probability have a reputation for being esoteric. However, in our opinion, quantum mechanics, from which they surely derive, is a fascinating source of probabilistic and statistical models, unjustly little known to 'ordinary' statisticians and probabilists. Quantum mechanics has two main ingredients: one deterministic, one random. In isolation from the outside world a quantum system evolves deterministically according to Schrόdinger's equation. That is to say, it is described by a state or wave-function whose time evolution is the (reversible) solution of a differential equation. On the other hand, when this system comes into interaction with the outside world, as when for instance measurements are made of it (photons are counted by a photo-detector, tracks of particles observed in a cloud chamber, etc.) something random and irreversible takes place. The state of the system makes a random jump and the outside world contains a record of the jump. Prom the state of the system at the time of the interaction one can read off, according to certain rules, the probability distribution of the macroscopic outcomes and the new state of the system. See Penrose (1994) for an eloquent discussion of why there is something paradoxical in the peaceful coexistence of these two principles;
256
Richard D. Gill
see Percival (1998) for interesting stochastic modifications to Schrδdinger's equation which might offer some reconciliation.1 Till recently, most predictions made from quantum theory involved such large numbers of particles that the law of large numbers takes over and predictions are deterministic. However, technology is rapidly advancing to the situation that really small quantum systems can be manipulated and measured (e.g., a single ion in a vacuum-chamber or a small number of photons transmitted through an optical communication system). Then the outcomes definitely are random. The fields of quantum computing, quantum communication, and quantum cryptography are rapidly developing and depend on the ability to manipulate individual quantum systems. Especially of interest are assemblages of two-level systems, known as qubits or spin half systems in quantum computing and information. Theory and conjecture are much further than experiment and technology, but the latter are following steadily. In this paper we will introduce as simply as possible the model of quantum statistics and consider the problem of how to best measure the state of an unknown spin-half system. We will survey some recent results, in particular, from joint work with O.E. Barndorff-Nielsen and with S. Massar (Barndorff-Nielsen and Gill, 2000; Gill and Massar, 2000). This work has been concerned with the problem posed by Peres and Wootters (1991): can more information be obtained about the common state of n identical quantum systems from a single measurement on the joint system formed by bringing the n systems together, or does it suffice to combine separate measurements on the separate systems? A useful tool for our studies is the quantum Cramer-Rao bound with its companion notion of quantum information, introduced by Helstrom (1967). Actually there are several ways to define quantum information with different resulting Cramer-Rao type bounds (Yuen and Lax, 1973, Stratonovich, 1973, Belavkin, 1976, Holevo, 1982). Quantum statistics mainly consists of exact results in various rather special models; see the books of Helstrom (1976) and Holevo (1982). Just as in ordinary statistics, the Cramer-Rao bound on the variance of an unbiased estimator is rarely achieved exactly (only in so-called quantum exponential models). In any case, one would not want in practice to restrict attention to unbiased estimators only. There are results on optimal invariant methods, but again, not many models have the structure that these results are applicable and even then the restriction to invariant statistical methods is not entirely compelling. One might hope that asymptotically it would be possible to achieve the highly recommended: Sheldon Goldstein, 'Quantum mechanics without observers', Physics Today, March, April 1998; letters to the editor, Physics Today, February 1999.
Quantum Asymptotics
257
Cramer-Rao bound. However, asymptotic theory is so far very little developed in the theory of quantum statistics, one reason being that the powerful modern tools of asymptotic statistics (contiguity, local asymptotic normality, and so on) are just not available2 since even if we are considering measurements of n identical quantum systems, there is no a priori reason to suppose that a particular sequence of measurements on n quantum systems together will satisfy these conditions. Here, we make a little progress through use of the van Trees inequality (see Gill and Levit, 1995), a Bayesian CramerRao bound, which will allow us to make asymptotic optimality statements without assuming or proving local asymptotic normality. Another useful ingredient will be the recent derivation of the quantum Cramer-Rao bound by Braunstein and Caves (1994), linking quantum information to classical expected Fisher information in a particularly neat way. We will show that, for certain problems, a new Cramer-Rao type inequality of Gill and Massar (2000) does provide an asymptotically achievable bound to the quality of an estimator of unknown parameters. For some other problems the issue remains largely open and we identify situations where Peres and Wootter's question has an affirmative answer: there can be appreciably more information in a joint measurement of several particles than in combining separate measurements on separate particles. This clarifies an earlier affirmative answer of Massar and Popescu (1995), which turned out only for small samples to improve on separate measurements. It also clarifies the recent findings of Vidal et al. (1998). Helstrom wrote in the epilogue to his (1976) book: "Mathematical statisticians are concerned with asymptotic properties of estimators. When the parameters of a quantum density operator are estimated on the basis of many independent observations, how does the accuracy of the estimates depend on the number of the observations as that number grows very large? Under what conditions have the estimators asymptotically normal distributions? Problems such as these, and still others that doubtless will occur to physicists and mathematicians, remain to be solved within the framework of the quantum mechanical theory." More than twenty years later this programme is still hardly touched (some of the few contributions are by Brody and Hughston, 1998, and earlier papers, and Holevo, 1983) but we feel we have made a start here. In 20 ± e pages (even when ±e = +12) it is difficult to give a complete introduction to the topic, as well as a clear picture of recent results. The classic books by Helstrom and Holevo mentioned above are still the only books on quantum statistics and they are very difficult indeed to read for a beginner. A useful resource is the survey paper by Malley and Hornstein (1993). However the latter authors, among many distinguished writers both 2
though R. Rebolledo is working on a notion of quantum contiguity
258
Richard D. Gill
from physics and from mathematics, take the stance that the randomness occurring in quantum physics cannot be caught in a standard Kolmogorovian framework. We argue elsewhere (Gill, 1998), in a critique of an otherwise excellent introduction to the related field of quantum probability (Kummerer and Maassen, 1998), that depending on what you mean by such a statement, this is at best misleading, and at worst simply untrue. With more space at our disposal we would have included extensive worked examples; however they have been replaced by exercises so that the reader can supply some of the extra pages (but—unless you are Willem van Zwet—leave the starred exercises for later). Some references which we found specially useful in getting to grips with the mathematical modelling of quantum phenomena are the books by Peres (1995), and Isham (1995). To get into quantum probability, we recommend Biane (1995) or Meyer (1986). Also highly recommended are the lecture notes of Preskill (1997), Werner (1997) and Holevo (1999). This introductory section continues with three subsections summarizing the basic theory: first the mathematical model of states and measurements; secondly the basic facts about the most simple model, namely of a two-state system; and thirdly the basic quantum Cramer-Rao bound. That third subsection finishes with a glimpse of how one might do asymptotically optimal estimation in one-parameter models: in a preliminary stage obtain a rough estimate of the parameter from a small number of our n particles. Estimate the so-called quantum score at this point, and then go on to measure it in the second stage on the remaining particles. Section 2 states a recent new version of the quantum Cramer-Rao bound which makes precise how one might trade information between different components of a parameter vector. Section 3 outlines the procedure for asymptotically optimal estimation of more than one parameter, again a two-stage procedure. This is work 'in progress', so some results are conjectural, imprecise, or improvable. In a final short section we try to explain how some of our results are connected to the strange phenomenon of non-locαlity without entanglement, a hot topic in the theory of quantum information and computation. 1.1
The basic set-up
Quantum statistics has two basic building blocks: the mathematical specification of the state of a quantum system, to be denoted by p = p(θ) as it possibly depends on an unknown parameter 0, and the mathematical specification of the measurement, denoted by M, to be carried out on that system. We will give the recipe for the probability distribution of the observable outcome (a value a: of a random variable X say) when measurement M is carried out on a system in state p. Since the state p depends on an unknown parameter 0, the distribution of X depends on 0 too, thereby setting a sta-
Quantum Asymptotics
259
tistical problem of how best to estimate or test the value of θ. Since we may in practice have a choice of which measurement M to take, we have a design problem of choosing the best measurement for our purposes. There is also a recipe for the state of the system after measurement, depending on the outcome, and depending on some further specification of the measurement; see Preskill (1997), Werner (1997), Bennett et al. (1998), or Holevo (1999). We do not need it here. For simplicity we restrict attention to finite-dimensional quantum systems. The state of a d-dimensional quantum system can be summarized or specified by a d x d complex matrix p called the density matrix of the system. For instance, when we measure the spin of an electron in a particular direction only two different values can occur, conventionally called 'up' and 'down'. This is just one example of a two level system, requiring a d — two-dimensional state space for its description. Similarly if we measure whether a photon is polarized in a particular direction by passing it through a polarization filter, it either passes or does not pass the filter. Again, polarization measurements on a single photon can be discussed in terms of a two-dimensional system. If we consider the spins of n electrons, then 2n different outcomes are possible and the system of n electrons together (or rather, their spins) is described by a d x d matrix p with d = 2n. The future quantum computer might consist of an assemblage of n atoms at very low temperature, each of which could be found in its ground state or in an excited state; interacting together they have a 2n-dimensional state space. Definition 1.1 (Density matrix) The density matrix p of a d-dimensional quantum system is a d x d self-adjoint, nonnegative matrix of trace 1. 'Self-adjoint' means that p* = p where * denotes the complex conjugate and transpose of the matrix. That p is nonnegative means that ψ*pψ > 0 for all column vectors ψ (since p is self-adjoint this quadratic form is a real number). We often use the Dirac bra-ket notation whereby \φ) (called a ket) is written for the complex column vector ψ and (ψ\ (a bra) is written for its adjoint, the row vector containing the complex conjugates of its elements. The quadratic form ψ*pψ is then written (ψ \ p\ψ). The notation allows one to graphically distinguish numbers from matrices, as in (ip \ φ) versus \ψ) (ψ\, and to specify vectors through labels or descriptions as in | lable) or I description) The diagonal elements of a density matrix must be nonnegative reals adding up to one. Moreover by the eigenvalue-eigenvector decomposition of self-adjoint matrices we can write p = Σ 0, ivalent to its determinant det 2 2 2 or x + y + z < \ . It is convenient to write (6)
264
Richard D. Gill
where α = (α x , αy,αz) E i 3 and satisfies
(7)
\\α\\2 = αl + α2y + α2z 2, not all F satisfying trace(/g 1 F) < 1 are achievable, and it remains open to characterize exactly the class of achievable information matrices. 1
For the first part a series of preparatory steps are taken to bring us, 'without loss of generality', to a situation that allows exact computations. For simplicity take d = 2. If p(θ) lies in the interior of the unit ball, and θ has dimension one or two, one can augment θ with other parameters, raising its dimension to 3. This can be done in such a way that the crossinformation elements in the augmented IQ{Θ) are all zero. It then suffices to prove the inequality for θ of dimension 3, and then we may as well use the natural parametrization p(θ) = ^(1 + θ σ) with ||β|| < 1 since the quantity trace{IQ 1 F) is invariant under smooth reparametrization. If, on the other hand, p(θ) is a pure state model we can in the same way after augmenting θ assume that θ has dimension 2 and after reparametrization the model is p{θ) = \(l + θ σ) w i t h | | 0 | | = l .
For the next preparatory step we need the concepts of refinement and coarsening of a measurement. Definition 2.1 (Coarsening and refinement) A measurement M with sample space X is a refinement of Mι with sample space y, and M' is a coarsening of M, if a measurable function f : X -ϊ y exists with M'(B) —
The result of measurement of M1 then has the same distribution as taking / of the outcome of measurement of M. It follows that the Fisher information in the outcome of M' is less than or equal to that in M since under coarsening of data, Fisher information can only decrease. Now we show that any measurement M' has a refinement M for which M(A) = JA M(x)μ(dx) for some nonnegative operator-valued function M and bounded measure μ and for which M(x) has rank one for all x, thus M(x) = \x) (x\ for some (not necessarily normalised) vector function \x). Consequently it will suffice to prove the result for such maximally refined measurements M. Start with the measurement Mf with sample space y. r Define a probability measure v on 3^ by u(B) = tmce(M (B))/d; by taking Radon-Nikodym derivatives one can define Mf(y) such that M'(B) = f M w h e r e fBM'{y)v(dy). Since the rank of M (y) is finite, M'(y) = Σi i(y) each Mi(y) has rank one. Now refine the original sample space y to X = y x {1,..., G?}, defining M(A x {*}) = fA Mi(y)v{άy). Equivalently M(A) = JAMi(x)μ(dx,di) where μ is the product of v with counting measure.
274
Richard D. Gill
This brings us to the situation where the model is either full pure-state or full mixed-state, and where the measurement is maximally refined. We take the natural parametrization of either of these models, and without loss of generality work at a point θ where θ = (0,0) or (0,0,ξ). This is possible by the result of Exercise 1.5. Now we have a formula for IQ and for the derivatives of p with respect to the components of 0, both in the pure and the mixed case, and we have a representation for M in terms of a collection of vectors \x) which must satisfy the normalization constraint fχ \x) {x\ μ(dx) = 1 but which are otherwise arbitrary. Both p and IQ are diagonal. We simply compute trace IQ IM and show that it equals 1 in the case d = 2. We leave the details as an exercise for the diligent reader—the computation is not difficult but does not seem all that illuminating either. We would dearly like to know if there is a more insightful way to get this result! The same arguments work for arbitrary d though the details are more complicated; a full mixed-state model has ^d(d +1) parameters, a full purestate model \d(d + 1) — (d — 1) parameters, and a careful parametrization is needed to make IQ diagonal. In the second part (for d = 2 only) it is shown that for any F satisfying trace(/g 1 F) < 1, one can construct a measurement M for which IM = F. This measurement will be described in the next section. It typically depends on the point θ so a multi-stage procedure is going to be necessary to achieve asymptotically this information bound. That will be the main content of the next section, where we do some quantum asymptotics proving asymptotic optimality results for n —> oo of the resulting two-stage procedure. We only have partial results for n > 1. In two special cases the available scaled information matrices do not increase as n increases. One of these cases is the case of pure-state models. This case has been much studied in the literature and is of great practical importance. The other case is when we make a restriction on the class of measurements to measurements of product form (in the literature also sometimes called an unentangled measurement). We first define this notion and then explain its significance. 1
Definition 2.2 (Product-form measurements) We say that a measuren ment on n copies of a given quantum system is of product form if M^ \A) = JAM^n\x)μ(dx) for a real measure μ and matrix-valued function M^n\x) n where M^ \x) is of the form M\{x) ® ® Λίn(x), with nonnegative components. We described in the previous section a measurement procedure whereby we first carried out measurements on some of our n particles, and then, depending on the outcome, carried out other measurements on the remaining particles. Altogether this procedure constitutes one measurement on the
Quantum Asymptotics
275
joint system of n particles taking values in some n-fold product space. One can conceive of more elaborate schemes where, depending on the results at any stage, one decides, possibly with the help of some classical randomisation, which particle to measure next and how. It would be allowed to measure again a particle which had previously been subject to measurement. There exists a general description of the state of a quantum system after measurement, allowing one to piece all the ingredients together into one measurement of the combined system. A measurement which can be decomposed into separate steps consisting of measurements on separate particles only, is called a separable measurement. It turns out that all separable measurements, provided all outcomes of the component steps are encoded in the overall outcome x, have productform. On the other hand, product-form measurements exist which are not separable; see Bennett et al. (1998). The product-form measurements form a large and interesting class, including all measurements which can be carried out sequentially on separate particles as well as more besides. In the notion of separable measurement it is insisted that all intermediate outcomes are included in the final outcome. If one throws away some of the data, one gets an outcome whose distribution is the same as the distribution of a coarsening of the original measurement. Coarsening of a measurement can obviously destroy the product form, replacing products of nonnegative components by integrals of such products. Theorem 2.2 (Achievable information matrices n > 1) The scaled information matrices of measurements on a smooth model p®n(θ) remain {F : ^ F ) < 1} 1. when θ is one-dimensional; 2. in a pure-state spin-half model; 3. in a mixed-state spin-half model with the class of measurements restricted to measurements which can be refined to product-form. The theorem is proved exactly as before, again finishing in an unilluminating calculation. Since coarsening of a measurement can only decrease Fisher information, one need only consider product-form measurements in the mixed-state case. We have a counterexample to the conjecture that, for mixed states, the bound holds for all measurements. In the case n = 2, at the point p = ^1, 1 there is a measurement for which trace(IQ 1^/2) = 3/2, thus 50% more information in an appropriate measurement of two identical particles than any combination of separate measurements of the two. What the set of
276
Richard D. Gill
achievable scaled information matrices looks like and whether it continues to grow (and to what limit) as n grows, is completely unknown. The measurement has seven elements, the first six of the form gΠ[ψ], and the seventh Π^,], where Π^j denotes the projector onto the one-dimensional sub-space spanned by the vector | ^ ) . The various \φ) are \+z + z), \—z — z), \+x + x), \-x - x), \+y + y), \-y - y), \S). By \+z + z) we mean \+z) ® \+z) = \ez) ® \ez) and similarly for the next five. The last φ is the socalled singlet state jξ(\+z) ® \—z) — \—z) ® \+z)). As a pure state of two interacting spin-half particles, this is the famous entangled state resulting in the violation of the Bell inequalities, and hence of locality (according to some interpretations). Here it arises as part of a measurement of two completely non-interacting particles; however this measurement can never be implemented by doing separate operations on the separate particles. Similar examples occur in the paper of Vidal et al. (1998), extending the pure-state results of Massar and Popescu (1995) to mixed states. 3
Quantum asymptotics
The results of the previous section are in the form of a bound on the information matrix based on the outcome of any measurement (perhaps restricted to the class of product-form measurements) on n identical copies of a given spin-half quantum system with state depending on an unknown parameter θ. We will now explain how such a bound can be used to give asymptotic bounds on the quality of estimators based on those measurements. Furthermore, we show how the bounds can be achieved by a two-stage procedure based on simple measurements on separate particles only. As far as achieving the bounds is concerned, only for the full mixed-state model under the natural parametrization is the problem completely solved. For the other models, the results are conjectural. We will discuss two kinds of bounds: firstly, a bound on the limiting scaled mean quadratic error matrix of a well-behaved sequence of estimators, and secondly, a bound on the mean quadratic error matrix of the limiting distribution of a well-behaved sequence of estimators. Each has its advantages and disadvantages. In particular, since the delta-method works for (the variance of) limiting distributions but not for limiting mean square errors, stronger conditions are needed to prove optimality of some procedure in the first sense than in the second sense. 3.1
Two asymptotic bounds
Obviously a bound on the information matrix, by the ordinary CramerRao inequality, immediately implies a bound on the covariance matrix of an unbiased estimator. However, this is not a restriction we want to make. It turns out much more convenient to work via a Bayesian version of the
Quantum Asymptotics
277
Cramer-Rao inequality due to van Trees (1968), as generalised to the multiparameter case by Gill and Levit (1995). For a one-dimensional parameter the van Trees inequality is easy to state: the Bayes quadratic risk is bounded by one over expected information plus information in the prior. In the multiparameter case one has a whole collection of inequalities corresponding to different choices of quadratic loss function and some other parameters, more difficult to interpret. Let π(0) be a prior density for the p-dimensional parameter 0, which we suppose to be sufficiently smooth and supported by a compact and smoothly bounded region of the parameter space; see Gill and Levit (1995) for the precise requirements. Let C(θ) be&pxp symmetric positive-definite matrix (C stands for cost function) and let Vj£'(θ) be the mean quadratic error matrix of a chosen estimator of θ based on a measurement of n copies of the quantum system. Letting Θ denote a random drawing from the prior distribution π, it follows that E trace C(Θ)V^(Θ) is the Bayes risk of the estimator with respect to the loss function (0W - θ)τC{θ)(θ^ - θ). Let D(θ) be another p x p matrix function of θ. Let jj£\θ) denote the Fisher information matrix in the measurement. Then the multivariate van Trees inequality reads (17) (EtrBcei?(Θ))» 1
E trace C(Θ)- D(Θ)(/g)(Θ)/n)JD(Θ)τ +I(π)/n where
(is)
ϊw = /^Σ^)^(^(ίw»))^^
From Theorem 2.2 we have the bound trace IQ1(Θ){I^){Θ)/Π) < 1, where, in the mixed case, we restrict attention to measurements refinable to productform. We are going to assume that our sequence of measurements and estimators is such that the normalized mean quadratic error matrix V^\θ) converges sufficiently regularly to a limit V(θ). Our aim is to transfer the inequality of Theorem (2.2) to V obtaining the bound trace IQ1{Θ)V(Θ)~1 < 1. We will do this by making appropriate choices of C and D. We will need regularity conditions both on the sequence of estimators and on the model p(θ) in order to carry over (17) to the limit. Theorem 3.1 (Asymptotic Cramer-Rao 1) Suppose that on some open set of parameter values θ: 1. nVίn\ converges uniformly to a continuous limit V;
278
Richard D. Gill
2. IQ(Θ) is continuous with bounded partial derivatives; 3. V and IQ are non-singular. Then the limiting normalised mean quadratic error matrix satisfies (19)
trace IQ {Θ)V{Θ)1
< 1.
1
We outline the proof of the theorem as follows. First of all, we pick a point 0o and define VQ = V(ΘQ). Next we define (20)
C(Θ)
(21)
1
=
1
V0- IQ (Θ)V0-\
D(θ) = V^I^iθ).
With these choices (17) becomes (22) ίni E trace V^1 K1 (Θ)VQ-1 {nV^ (θ)) >
(E trace V Γ 1 /Q ^V© ) ) 2 ° " „
.
We can bound the first term in the denominator of the right hand side by 1, by the results of the last section. The second term in the denominator of the right hand side is finite, by our third assumption, and for n —> oo it converges to zero. By our first assumption (22) converges to (23)
Etrace V o - ^ g H θ ) ^ " 1 ^ © ) > ( E t r a c e V ^ J Q ^ Θ ) ) 2 .
Now replace the prior density π by one in a sequence of priors, concentrating on smaller and smaller neighbourhoods of ΘQ. Using the continuity assumptions on V and IQ, we obtain from (23) the inequality
trace V^I^iflύV^V* > or in other words, with θ = ΘQ, the required (24)
trace IQ {Θ)V'\Θ) 1
< 1.
In some situations it might be more convenient to have a bound on the mean quadratic error of a limiting distribution, assuming one to exist. At the moment of writing we believe the following: Theorem 3.2 (Asymptotic Cramer-Rao 2) Let Z be distributed with the limiting distribution of y/n{θ — θ). Suppose that: 1. θn is Hάjek regular at θ at root n rate;
Quantum Asymptotics
279
2. the asymptotic mean quadratic error matrix V — Έ(Z Zτ) is nonsingular; 3. IQ is non-singular. Then V satisfies trace/g1^)^)"1^!.
(25)
The proof should follow the lines of the similar result in Gill and Levit (1995), with a prior distribution concentrating on a root n neighbourhood of the truth. We will need similar choices of C and D as in the proof of Theorem 3.1, though the dependence of D on θ can now be suppressed. 3.2
Achieving the asymptotic bounds
At present we have essentially complete results in the full mixed-state spinhalf model with the natural parametrization. We believe they can be extended to smooth (C 1 ) pure- and mixed-state models. Give yourself a target mean quadratic error matrix W(θ) satisfying (26)
trace IQ(β)-ιW(θ)-1
< 1.
Is there a sequence of measurements M^ satisfying the conditions of Theorems 3.1 or 3.2 with limiting mean quadratic error matrix V(θ) equal to the target? Possibly we do not start with a target W but with a step earlier, with a quadratic cost function. For given C(θ) it is straightforward to compute the matrix W(θ) which minimizes trace C(Θ)W(Θ) subject to the constraint (26); the solution is W = t r a c e ( ( J ~ ^ C / ^ ) l ) J ^ ( / | c / | ) ^ / ^ . Now we pose the same question again, with the W we have just calculated as target. ι Let us call F = W~ the target information matrix. First we pretend θ is known and exhibit a measurement M on a single particle with the target information matrix at the given parameter value. In the previous section we omitted explaining how the bound of Theorem 2.1 can be attained. That theorem stated that, at a given parameter value, for any positive-semidefinite symmetric F satisfying trace IQ F < 1 there is a measurement M on a single spin-half particle with IM = F. What is that measurement? We describe it in the case of a full mixed-state spinhalf model with the natural parametrization, thus p(θ) = ^(1 + θ σ). The matrices IQ and F axe 3 x 3 . To start with, we compute the eigenvector-eigenvalue decomposition of IO2FIO2,
obtaining eigenvectors hi and nonnegative eigenvalues 7i, say.
The condition on F translates to ^ 7 ^ < 1. Now define gι = Ighi and
280
Richard D. Gill
three unit vectors Ui = gi/\\gi\\, and finally consider the measurement M taking seven different values, whose elements are 7iΠ(±ί?i), i = 1,2,3, and (1-Σ7ί)l It turns out by a straightforward computation (carried out, without loss of generality, at θ = (0,0, £)) that the information matrix for the measurement with the two elements U(±Ui) has information matrix gi®gi and hence the measurement M has information matrix Σ{ ηigi ®gι = F. This seven-outcome measurement can be implemented as a randomized choice between three simple measurements: with probability 7$ measure spin in the direction Ui, with probability 1 — ΣΊi do nothing. However, in practice this measurement is not available, since the directions Ui and probabilities 7; depend on the unknown θ. We therefore take recourse to the following two-stage measurement procedure. First measure spin in the x, y and z directions on \nα each of the particles, where 0 < α < 1 is fixed and the numbers are rounded to whole numbers. The expected relative frequency of 'up' particles in each direction is | ( 1 + 0;), i = 1,2,3, so solving 'observed equals expected' yields a consistent preliminary estimator θ of θ. If the estimate lies outside the unitball, project onto the ball, and stop. With large probability no projection is necessary. We can compute the eigenvalue-eigenvector decomposition of Iq2(Θ)F(Θ)IQ2(0), leading to fractions ηι and directions Ui as above. Measure the spin of a fraction ηι of the remaining particles in the direction iϊi. Solve again the three (linear) equations 'observed relative frequency equals expected', treating the ΰi as fixed. Project onto the unit ball if necessary. Call the resulting estimator 0. Our claim is that this procedure exhibits a measurement M^ on the n particles, and an estimator 0(n) based on its outcome, which satisfies the conditions of Theorem 3.1, with V(θ) equal to the target W(θ). Thus the bound of Theorem 3.1 is also achievable, and a measurement which does this has been explicitly described above. Apart from projecting onto the unit ball, the estimator involves only linear operations on binomial variables, so is not difficult to analyse explicitly. We need a preliminary sample size n of order nα and not, for example, of order logn, in order to control the scaled mean quadratic error of the estimator. There is an exponentially small probability—in n, not in n—that the preliminary estimate is outside of a given neighbourhood of the truth, and hence that the scaled quadratic error is of order n. One can further check that the estimator we have described also satisfies the conditions of Theorem 3.2. Possibly one is interested in a different parametrization of the model. Under a smooth (C 1 ) reparametrization, the delta method allows us to maintain optimality in the sense of Theorem 3.2. However optimality in the sense
Quantum Asymptotics
281
of Theorem 3.1 could be destroyed; in order for it to be maintained the reparametrization should also be bounded. Alternatively one must modify the estimator by a truncation at a level increasing slowly enough to infinity with n, cf. Schipper (1997; section 4.4) or Levit and Oudshoorn (1993) for examples of the technique. This approach can be extended to other spin-half models. The difficulties are exemplified by the case of the two-parameter full pure-state spin-half model. Locally, consider the natural parametrization θ = (#i,#2), #3 = (1 - θ\ - 0f) 1 / 2 , p = p(θ) at the point θ = (0,0). The quantum information matrix for three parameters 0χ, 0 2 , #3 contains an infinite element. However, the recipe outlined above continues to work if we add to a given 2 x 2 target information matrix a third zero row and column—infinities always get multiplied by zero. The third fraction 73 equals zero so simple measurements in just two directions suffice. The resulting procedure involves linear operations on binomial counts, projecting onto 5, and reparametrization. Under some smoothness we should finish with an estimator optimal in the sense of Theorem 3.2; under further smoothness, boundedness, and a sufficiently large preliminary sample also optimality in the sense of Theorem 3.1 should hold. If the target information matrix includes some zeros, i.e., one is not interested at all in certain parameters, the results should still go through; the preliminary sample should be of size of order n α , \ < α < 1, in order that the uncertainty in the initial estimate of the 'nuisance parameters' does not contaminate the final result.
4
Non-locality without entanglement
It would take us too far afield here to explain the notions of entanglement and of non-locality. For some kind of introduction see Kϋmmerer and Maassen (1998) and Gill (1998), and Gill (2000); see also the books of Peres (1995), Isham (1995), Penrose (1994), Maudlin (1994). However, we would like to discuss whether or not our finding that non-separable joint measurements on several independent (non-entangled) quantum particles can yield more information than any separate measurements on the separate particles, should be considered surprising or not. Recall that separable measurements, cf. Bennett et al. (1998), are measurements which can be decomposed into a sequence of measurements on separate particles, each measurement possibly depending on the outcome of the preceding ones, and whereby it is allowed to measure further a particle which has already been measured (and hence its state has been altered in a particular way) at an earlier step. Prom a mathematical point of view there should not be much surprise. The class of separable measurements is contained in the class of productform measurements, which is clearly a very small part of the space of all
282
Richard D. Gill
measurements whatsoever. The optimisation problem of maximising Fisher information (more precisely, some scalar functional thereof) must only be expected to have a larger outcome when we optimise over a larger space. The surprise for the mathematician is rather that for pure states, and for one dimensional parameters, there is no gain in joint measurements. And it is strange that mixed states should exhibit this phenomenon, whereas pure states do not: the difference is classical probabilistic mixing, which should not lead to nonclassical behaviour. However, physicists are and should be surprised. The reason is connected to the feeling of many physicists that the randomness in measurement of a quantum system should have a deterministic explanation (Einstein: "God does not throw dice") . We appreciate very well that tossing a coin is essentially a completely deterministic process. It is only uncontrolled variability in initial conditions which lead to the outcome appearing to be completely random. Might it be the case also that the randomness in the outcome of a measurement of a quantum system might be 'merely' the reflection of statistical variability in some initial conditions? Hidden variables, so called because at present no physicist is aware what these lower level variables are, and there is no known way directly to measure them. In fact there already exist arguments aplenty that if there is a deterministic hidden layer beneath quantum theory, it violates other cherished physical intuitions, in particular the principle of locality; see again Kϋmmerer and Maassen (1998), Gill (1998) for some introduction to the phenomenon of entanglement, and further references. But let us ignore that evidence and consider the new evidence from the present results. Consider two identical copies of a given quantum state. Suppose there were a hidden deterministic explanation for the randomness in the outcome of any measurement on either or both of these particles. Such an explanation would involve hidden variables CJI, CJ2 specifying the hidden state of the two particles. Since applying separate measurements to the two systems produces independent outcomes, and since the outcomes of the same measurements are identically distributed, one would naturally suppose that these two variables are independent and identically distributed. Their distributions would of course depend on the unknown parameter θ. Now when we measure the joint system, there could be other sources of randomness in our experiment, possibly even quantum randomness, but still it would not have a distribution depending on 0. So let us assume there is a third random element UM such that the outcome of the measurement M on the system p{θ) ® p(θ) is a deterministic function of CJI, CJ2 and UMΊ the first two are independent and identically distributed, with marginal distributions depending on 0, while the distribution of LJM given the other two is independent of θ. Thus the random outcome X of the measurement of M is just X(U;I,CU2,^M) 5 a random variable on the prob-
Quantum Asymptotics
283
ability space (Ω x Ω x ΩM), ((P# X P#) * P M ) , where P M is some Markov kernel from Ω x Ω to Ω ^ Now it is well-known from ordinary statistics that the Fisher information in θ from the distribution of any random variable defined on this space is less than twice the information in one observation of ωι itself seen as a random variable defined on (Ω,P^). Thus if one could realise any ΩM> P M and any X whatsover by suitable choice of measurement M, achievable Fisher information would be additive! What can we conclude from the fact that achievable Fisher information is not additive? If they exist, the hidden variables are so well hidden that we cannot uncover them from any measurements on single particles, i.e., it is not possible to realise any ( Ω M , P M ) and any X whatever by appropriate choice of experimental set-up. However, we can uncover the hidden variables better, apparently, from appropriate measurements on several particles brought together, even though these particles have nothing whatever to do with one another—their hidden variables are independent and identically distributed. Alternatively the explanation must be found in some pathological non-measurability or non-regularity of the statistical model we have just introduced. Whatever escape-route one chooses, it is clear that if there is a deterministic explanation for quantum randomness, it has to be a very weird explanation. God throws rather peculiar dice.
Acknowledgements. This paper is based on work in progress together with O.E. Barndorff-Nielsen and with S. Massar. I am grateful for the hospitality of the Department of Mathematics and Statistics, University of Western Australia. I would like to thank Boris Levit for his patient advice.
REFERENCES Barndorff-Nielsen, O.E. and Gill, R.D. (2000). Fisher information in quantum statistics. To appear in J. Phys. A. Preprint quant-ph/9802063, http://xxx.lanl.gov Belavkin, V. P. (1976). Generalized uncertainty relations and efficient measurements in quantum systems. Teoreticheskαyα i Mαtemαtiskαyα Fizikα 26, 316-329; English translation, 213-222. Based on author's candidate's dissertation, Moscow State University, 1972. Bennett, C.H., DiVincenzo, D.P., Fuchs, C.A., Mor, T., Rains, E., Shor, P.W., Smolin, J.A., and Wootters, W.K. (1998). Quantum nonlocality without entanglement. To appear in Phys. Rev. A. Preprint quant-ph/9804053, http://xxx.lanl.gov
284
Richard D. Gill
Biane, P. (1995). Calcul stochastique non-commutatief. pp. 4-96 in: Lectures on Probability Theory: Ecole deVe de Saint Flour XXIII-1993, P. Biane and R. Durrett, Springer Lecture Notes in Mathematics 1608. Braunstein, S.L. and Caves, C M . (1994). Statistical distance and the geometry of quantum states. Physical Review Letters 72, 3439-3443. Brody, D.C. and Hughston, L.P. (1998), Statistical geometry in quantum mechanics, Proceedings of the Royal Society of London Series A 454, 2445-2475. Gill, R.D. (2000). Lecture Notes on Quantum Statistics, h t t p :
//www.math.uu.nl/people/gill/Preprints/book.ps.gz Gill, R.D. (1998). Critique of'Elements of quantum probability'. Quantum Probability Communications 10, 351-361. Gill, R.D. and Levit, B.Y. (1995). Applications of the van Trees inequality: a Bayesian Cramer-Rao bound. Bernoulli 1 59-79. Gill, R.D. and Massar, S. (2000). State estimation for large ensembles. To appear in Phys. Rev. A. Preprint quant-ph/9902063, h t t p : / / x x x . l a n l . g o v Helstrom, C.W. (1967). Minimum mean-square error of estimates in quantum statistics. Phys. Letters 25 A, 101-102. Helstrom, C.W. (1976). Quantum Detection and Estimation Theory. Academic, New York. Holevo, A.S. (1982). Probabilistic and Statistical Aspects of Quantum Theory. North Holland, Amsterdam. Holevo, A.S. (1983). Bounds for generalized uncertainty of the shift parameter. Springer Lecture Notes in Mathematics 1021, 243-251. Holevo, A.S. (1999). Lectures on Statistical Structure of Quantum Theory. h t t p : //www. imaph. t u - b s . d e / s k r i p t e_. html Isham, C. (1995). Quantum Theory. World Scientific, Singapore. Kύmmerer, B. and Maassen, H. (1998). Elements of quantum probability. Quantum Probability Communications 10, 73-100. Levit, B.Y. and Oudshoorn, C.G.M. (1993). Second order admissible estimation of variance. Statistics and Decisions, supplement issue 3, 17-29. Malley, J.D. and Hornstein, J. (1993). Quantum statistical inference. Statistical Science 8, 433-457. Massar, S. and Popescu, S. (1995). Optimal extraction of information from finite quantum ensembles. Physical Review Letters 74, 1259-1263. Maudlin, T. (1994). Quantum Non-locality and Relativity. Blackwell, Oxford. Meyer, P.A. (1986). Elements de probabilites quantiques. pp. 186-312 in: 7 S eminaire de Probabilites XX, ed. J. Azema and M. Yor, Springer Lecture Notes in Mathematics 1204. Penrose, R. (1994). Shadows of the Mind: a Search for the Missing Science of Consciousness. Oxford University Press.
Quantum Asymptotics
285
Percival, I. (1998). Quantum State Diffusion. Cambridge University Press. Peres, A. (1995). Quantum Theory: Concepts and Methods. Kluwer, Dordrecht. Peres, A. and Wootters, W.K. (1991). Optimal detection of quantum information. Physical Review Letters 66, 1119-1122. Preskill (1997). Quantum Information and Computation. http://www.theory.caltech.edu/people/preskill/ph299
Schipper, CM.A. (1997). Sharp Asymptotics in Nonparametric Estimation. PhD thesis, University Utrecht, ISBN 90-393-1208-7. Stratonovich, R.L. (1973). The quantum generalization of optimal statistical estimation and testing hypotheses. Stochastics 1, 87-126. van Trees, H.L. (1968). Detection, Estimation and Modulation Theory (Part 1). Wiley, New York. Vidal, G., Latorre, J.I., Pascual, P., and Tarrach, R. (1998). Optimal minimal measurements of mixed states.
Preprint quant-ph/9812068, http: //xxx. lanl. gov. R.F. Werner (1997). Quantum information and Quantum Computing.
h t t p : //www. imaph. tu-bs. de/skripte_. html MATHEMATICAL INSTITUTE UNIVERSITY OF UTRECHT
Box 80010 3508 TA UTRECHT NETHERLANDS
[email protected]
ADAPTIVE CHOICE OF BOOTSTRAP SAMPLE SIZES
1
FRIEDRICH GOTZE AND ALFREDAS RACKAUSKAS
2
University of Bielefeld and University of Vilnius Consider sequences of statistics Tn(Pn,P) of a sample of size n and the underlying distribution. We analyze a simple data-based procedure proposed by Bickel, Gotze and van Zwet (personal communication) to select the sample size m = mn < n for the bootstrap sample of type "m out of n" such that the bootstrap sequence T^ for these statistics is consistent and the error is comparable to the minimal error in that selection knowing the distribution P. The procedure is based on minimizing the distance between L m (P n ) and Pn), where Lm{Pn) denotes the distribution of T^. AMS subject classiήcations: 62D05.
Keywords and phrases: Bootstrap, m out of n bootstrap, Edgeworth expansions, model selection.
1
Introduction
In this paper, we investigate an adaptive choice of the bootstrap sample size m in sampling from an i.i.d. sample of size n m-times independently and with (resp., without) replacement. To simplify the writing we shall abbreviate the notion of m out of n sampling as moon bootstrap. Assume that the random elements ΛΊ,..., X n ,... are independent and identically distributed from a distribution P on a measurable space (S, *A). Let Pn denote the empirical measure of the first n observations XL, ..., Xn. Throughout we assume that P G Vo C V, where Vo is a set of probability measures on (5, *4) containing all empirical measures Pn. Let Tn = T n (Xi,... ,X n ;P) denote a sequence of statistics, possibly dependent on the unknown distribution P in order to ensure that Tn is weakly convergent to some limiting distribution as n tends to infinity. A typical example is given by Tn = nα(F(Pn) - F(P)), where F : VQ -> R denotes a functional on VQ and α > 0 is an appropriate normalization rate. We are interested in the estimation of the distribution function (d.f.) Ln(P; α) of Tn by means of resampling methods. 1 2
This research was supported by the Deutsche Forshungsgemeinshaft SFB 343.
This research was supported by the Deutsche Forshungsgemeinshaft SFB 343 and Institute of Mathematics and Informatics, Lithuania.
287
Bootstrap Sample Sizes
The nonparametric bootstrap estimates the d.f. L n (P, α) by the plug-in method, that is, by the conditional d.f.
where X*,... , X* is a bootstrap sample from the empirical distribution P n . One of the major problems for the nonparametric bootstrap estimate Ln is its consistency. Various types of consistency can be considered. Usually, if d denotes a certain distance on the set of all distribution functions then Ln is said to be d-consistent (or, simply consistent, when d is fixed) in probability (resp., a.s.) provided that d(Ln,Ln(P)) —> 0 in probability 71—>OO
(resp., a.s.). Conditions ensuring the consistency were considered, e.g., by Bickel and Preedman (1981), Bretagnolle (1983), Athreya (1987) and Beran (1982, 1997). Extensive references and details on various bootstrap methods can be found in the recent monograph by Davison and Hinkley (1998). A number of examples, where the bootstrap fails to be consistent together with positive results suggest that the consistency of the bootstrap estimate Ln requires the following conditions: 1) for any Q from a neighborhood V(P) of P, Ln(Q) has to converge weakly to a limit I/(Q), say, and the convergence has to be uniform on V(P); 2) the function Q —> L(Q) has to be continuous. The moon bootstrap with replacements (shortly, ra/n-bootstrap) estimates the d.f. Ln(P,α) by L m (P n ,α), whereas the moon bootstrap without replacements (shortly, Q)-bootstrap) estimates Ln(P,ά) by
L*m(Pn,a) = I
V
l{Tm(Xh,...,Xim;Pn)
< a}.
Under very weak conditions, the moon bootstrap resolves problems of consistency of the classical bootstrap by choosing m = o(n) bootstrap samples. It was first suggested by Bickel and Preedman (1981) and investigated in Bretagnolle (1983), Gόtze (1993), resp. Bickel, Gόtze and Zwet (1997), Shao (1994), Politis and Romano (1994) (examples of nonregular statistics), Swanepoel (1986), Deheuvels, Mason and Shorack (1990) (extreme value statistics), Shao (1996) (model selection), Datta and McCormik (1995), Datta (1996), Heimann and Kreiss (1996) (first order autoregression models), Athreya (1987), Arcones and Gine (1989) (heavy tailed population distributions). If d denotes a distance between distribution functions (e.g., Kolmogorv, Levy-Prokhorov, or bounded Lipschitz distance) then a measure of risk in estimating Ln(P) by some estimator Ln is given by
EPd(Ln,Ln(P)). For Ln being the moon bootstrap estimator Lm(Pn) the 'generic' nonparametric case is described by the nonconsistency of this estimator for m ~ n
288
Friedrich Gόtze and Alfredas Rackauskas
due to the essential randomness of its limit distribution under Pn for such m. Introducing h ~ n/m as a 'bandwidth' type parameter in this nonparametric estimation problem, the case h ~ 1 is characterized by the fact that the variance of the bootstrap estimate may not tend to zero as n tends to infinity. On the other hand for large values of h the variance decreases in 1 many cases of order ©(/i" ). Since the moon bootstrap actually estimates I/ m (P), the difference d(Lm(P),Ln(P)) will be significant for m/n small (or h large) and contributes a bias term which dominates the estimation error in this case. Thus, as in most nonparametric problems one has to look for a tradeoff choice of m minimizing the estimation error. On the other hand for 'parametric' problems where the bootstrap works, like in the estimation of the distribution of Student's test statistic under the hypothesis, one can show by higher order approximations that the bias as well as the variance of Lm(Pn) essentially decrease as m grows up m ~ n, see Hall (1992). One would like to find a common recipe for choosing m effectively for both nonparametric as well as parametric situations in order to obtain a uniformly consistent and effective estimate for the distribution Ln(P). One way would be to look for a sample size m minimizing some crossvalidation measure by a jackknife estimate related to the risk under the unknown distribution. This has been suggested by Datta and McCormik (1995) in a first order regression problem. Unfortunately, this method is computationally rather involved and the performance of this scheme is difficult to analyze. Bickel, Gόtze and van Zwet (personal communication) suggested to base the choice of m on the discrepancy between Lm(Pn) and Lrnj2{Pn) We motivate this idea by showing that the (random) distance between Lm(Pn) and Ln(P) as a function on m is stochastically equivalent to the (random) distance between Lm(Pn) and Lm/2(Pn) as n -> oo and m = m(n) —> oo. More precisely, consider for some distance d between distributions (like e.g. Kolmogorov's distance) (1.1)
and Δ m :=
Am~d(Lm(Pn),Ln(P))
d(Lm(Pn),Lm/2(Pn)).
In Theorem 2.3 we prove that, under certain conditions, for some model dependent rate 0 < α < 1 (like = 1/2,1/3,1/4 etc. ) and sequences ra(rc) such that the limit limn m(n)/nα, say 7 G [0,00] exists, we have
(1-2)
|^
A
where ξo and ξoo axe constants depending on P and, for 0 < 7 < 00, ξΊ denotes a random variable depending on P. Here, —> denotes as usual convergence in distribution. Typically, we find that EpAm/EpA where cΊ(P) is a constant depending on 7 and P.
289
Bootstrap Sample Sizes
Based on these observations, we suggest m* = argmin 2 < m < n Δ m as a random choice of m for the ra/rc-bootstrap. We will show that this choice is as good as choosing the optimal m when knowing the unknown distribution P as long as m/n —> 0 holds. Simulations show that the method works in the region m ~ n as well but behavior in this region is difficult to analyze for general models of distributions. The reasons for such a choice are illustrated by the following simple example. Example 1.1 Consider the statistic Tn = Tn(Xu . . . ,Xn\P) = n(Xn)2, where A Ί , . . . , Xn is an i.i.d. sample from a distribution P on the real line with zero mean and let Xn denote the sample mean. Let Ln(P;r) denote the d.f. of Tn. The corresponding m/n-bootstrap approximation is the d.f. Lm{Pn\ r) of the statistic T* = m(X%), where X*,..., X^ is a sample from the empirical distribution P n , and X™ denotes the corresponding sample mean. Assume that EXf < oo and that P satisfies Cramer's condition of smoothness. Consider the uniform errors introduced in (1.1), based on the Kolmogorov distance d. Let Y denote a standard normal random variable and assume that the sequence m = m(n) is chosen such that m —> oo and m/n -» 0. In Section 4 we prove the following. If m/nιl2 -» oo, then (1.3)
(n/m)(Δm,Δm)
A
(cλY2\cλY2!/2),
where c\ denotes an absolute constant. If m/n 1 / 2 -> 0, then (1.4)
mΔm
A
co(P), and m Δ m
A
where co(P) is a constant depending on P. If m/n1/2
n^2(Am,Άm)
(1.5)
A
71—>OO
c o (P), -> c = const, φ 0, then
(fo(Y)Ji(Y)),
where /o, /i are certain measurable functions of Y (see Section 4 for details). Thus, if limm/n 1 / 2 = 7 G [0,oo] exists, then (1.3)-(1.5) imply (1.2). Moreover, ξo = 1 and £oo = 1/2. Under the same conditions we obtain as well EpAm/EpAm — 71—>
Note that the value of m obtained in this way by minimizing Δ m strongly depends on the particular sample. For instance if by chance in this example ~Xn approximates the true value EX\ — 0 very accurately, that is nιl2Xn = o(l) (which happens rarely), the approximation of Ln(P) by the random bootstrap distribution Ln(Pn) might be accidentally precise as well (compare (3.1) and (3.2)). In this case the bias as well as the variance of the estimate Lm(Pn) will decrease with 7τι which leads to a choice of a large sample size
290
Friedrich Gόtze and Alfredas
Rackauskas
m ~ n. Such a case cannot be detected by an average criterion for the choice of m like say Ep d ίL m (P n ), Ln(P)J, which would lead to a much less adaptive and accurate choice of m for such an exceptional sample. Similar arguments apply to the moon bootstrap without replacements. We will show under certain conditions (see Theorems 2.1, 2.3 and Remark 2.1) that, for L2-distances of distribution functions say d, the random distance Δ£j = d(L^(P n ),L n (P)) is stochastically equivalent to the distance Δ ^ = d(L^(P n ), L^2(Pn)). More precisely, for some model dependent rate a 0 < a < 1 and sequences m(n) such that 7 = limn m(n)/n € [0,oo] exists we have Δ* 771
T) ί y
.
where 770 is a constant depending on P and, for 0 < 7 < 00, ηΊ is a random variable depending on P. Typically wejξet EpA^/EpA^ -» c 7 (see Remark j 21) Thi i * Δ 2.1). This motivates m* = argminΔ^ as a random choice of m for the (m) "bootstrap.
Figure 1. Smoothed graphs of the functions m -» Δ m (dot line) and m -¥ Δ m (solid line) from Example 1.1, where P is a centered χl distribution. Sample size n = 400.
In Figure 1 we consider Example 1.1. It shows (smoothed) graphs of the functions m -» Δ m (dot line) and m -> Δ m (solid line), where P is a centered χ1 distribution. The simulations were done for a sample size of n = 400. In Fig. 2 the true and estimated Kolmogorov distances m -> Δ ^ and m -> ΔJ^ are smoothed and plotted based on an individual sample of size n = 400 and m < n/2 when sampling without replacement. The first plot shows the behavior of sampling without replacement in the setup of Fig. 1 with P = χι. The second and third plot represents a parametric case: Student's t-test with P = 7V(0,1) and sampling with/without replacement. The third one represents a nonparametric case: distances for the normalized distribution of the largest order statistic for P = Uniform(0,1) and sampling
Bootstrap Sample Sizes
291
Sum Λ 2, without repl.
2
Student's T, with repl.
exact estimated
exact estimated
8
1 !
I I
60
80
Student's T, without repl.
100
200
40
60
400
Minimum uniform, without repl.
exact estimated
20
300
exact estimated
80
20
40
60
80
Figure 2. True and estimated Kolmogorov distances m —> Δ ^ and m -> Δ™ smoothed and plotted based on an individual sample of size n = 400 and m < n/2 when sampling without replacement.
without replacement, see also Example 2.2 below. The paper is organized as follows. In Section 2 we investigate the moon bootstrap without replacements. To this aim Hoeffding expansions for Ustatistics are used for m/n = o(l) in order to evaluate the error of the random approximations. In Section 3 we investigate as well the moon bootstrap with replacements representing our statistics in terms of empirical processes. Here, following Beran (1997), we require that the sequence of statistics should be locally asymptotically weakly convergent. Furthermore, we shall use Edgeworth expansions to prove the stochastic equivalence of the random distances in the examples studied in this paper. Finally, Section 3 contains the proofs of our results. Throughout the paper we write m G n(α^) to indicate that m = m(n) α is a sequence such that m -> oo, m/n ->• 0 and limn m/n = 7 exists allowing 7 £ [0,00]. 2
Moon bootstrap without replacements
In this section we let Xi,
, Xn denote a sample of i.i.d. random elements
292
Friedήch Gδtze and Alfredas Rackauskas
from an unknown distribution P on a measurable space (S,A), and let Tn = Γ n (-XΊ,... ,X n ; P ) denote a sequence of statistics with distribution Ln{P). We assume that Tn converges in distribution to a random variable T ^ . Let θn(P;α) = Eh(Tn;α), where h : R x R -> R is a real measurable bounded function, denote a family of parameters indexed by α G JR. The moon bootstrap without replacements estimates θn(P\α) by
θmn(Pn;α)
= L*m(Pn)h(;α) = J L
]Γ
h{Tm(Xil,...,Xin;Pn);α).
As distance d between Ln(P) and L^(Pn) we choose the Z^-distance between θn{P;α) and β m n ( P n ; α ) writing
Δ^ = (jR{θmn{Pn α)-θn{P ,
r
A
*rn=(JR
,α))\{dα))l'\
^
9
{Omn(Pn\α) - θMn(Pn]
vi/2
α)) μ(rfα))
, M = m/2,
where μ is a probability measure. For indicator functions h(x; ά) = l{x < α}, ΔJ^ reduces to the integrated square error between the distribution function L n ( P ; α) and its Q ) -bootstrap estimator L ^ ( P n ; α). For special discrete measures μ, ΔJ^ may then be written as
^ shall give conditions which ensure the stochastic equivalence of ΔJ^ and A^. First, we impose some restrictions on the sequence of statistics Tn. Assumption (I). There exist measurable functions «, ξ m : S x R —> R such that E (Λ(Tm; α)|Xχ) = ^ ( Γ ^ ; α) + πΓλl2κ{X^
α) + ξ m ( X i ; α),
where / β E κ 2 (Xi; α)μ(dα) < oc and / β E ^(-XΊ; α)μ(dα) = Assumption where
(J).
$R{θmn{Pn\ά)
- θmn{P;α))2μ{dα)
(Λ
Σ
KmJ
l (P, GP o gQ) weakly.
298
Friedrich Gδtze and Alfredas
Rackauskas
Example 3.3 The fact that a characteristic function is real if and only if the corresponding distribution function is symmetric at 0 suggests the statistic +oo / / +oo
/
for testing symmetry. Here
^
\ 22 \
2
^2Im(cQ^(t))
+ α{t)J g(t)dt,
/•-f /•-f-oo
,n(t) = / J —O
denotes the empirical characteristic function corresponding to the distribution Q, g is an integrable weight function, and α(t) satisfies / ^ α2(t)g(t)dt < oo. This statistic Tn is locally .T7-weakly convergent at any symmetric distribution P, when the class T is chosen as T = {x —> cosfcr, t G R}. A parametric version of the following proposition is given in Beran (1997). Proposition 2.2 Assume that T is P-Donsker and that the statistic Tn is locally f-weakly convergent at P. Let {m(n), n > 1} denote any sequence of positive integers such that m(n) —> oo and m{n)/n —» 0 as n —> oo. Then L m ( n )(P n ) is d-consistent in probability, where d is any metric metrizing the weak convergence. Proof We have \\mι/2(Pn°gn-P)-Q\\r
—t 0 a.s.
n—> o o
By Definition 3.1, Lm(Pn ogn(ω)) converges weakly to ί/(P,0) for almost all ω eΩ. This completes the proof. • Next we investigate the stochastic equivalence of the random distances d(Lm(Pn), Ln(P)) and rf(Lm(Pn), Lm/2(Pn))> where d is either the Kolmogorov or the bounded Lipschitz distance. A unified way to consider both distances is to consider the more general class of uniform distances over a class of measurable bounded functions Ή, say. Define for distributions F, Q, the uniform distance dn(F,Q) = sup \Fh-Qh\. hen Indeed, if Ή. is chosen as the class of indicator functions I{(—oo, α]}, α E R, dκ(F,Q) will coincide with the Kolmogorov distance. If % is the class of measurable functions h : R —• R such that supα|Λ(α)| + supα_^6 \h(α) —
299
Bootstrap Sample Sizes
h(b)\/\α—b\ < 1, then du{F, Q) corresponds to the bounded Lipschitz metric. Note that distances du , where % consists of higher order smooth functions, have been used as well for investigating the accuracy of the bootstrap. Write Anm
= dn(Lm(Pn),Ln(P)),
Anm
=
dn(Lm(Pn),Lm/2(Pn)).
Theorem 2.3 Suppose that the sequence of statistics satisfies assumptions (A), (£?), (C), and (£>), which are stated and discussed below. Assume furthermore that T is a P-Donsker class, and that \\upin\\jr is uniformly integrable. For sequences m G n(l/2,7) we may choose norming sequences τrnn = (n/m)1/2 and ra1/2 corresponding to 7 = 00 and 0 < 7 < 00 such that for random variables £,£1,^2 and a constant c\ > 0 which axe defined in the proof in Section 4
^
(£^)
where (writing c := 1-2" 1 / 2 , d = 2ι'2-l, ) (ξoc,r?oo) = (l,c)ξ, (&,%) = (1, d) cι and (f7, ηΊ) = (ξu ξ2) for 0 < 7 < 00. Thus Theorem 3.3 yields the stochastic equivalence of the random distances A-um and Δ ^ m , as n -» 00. In order to formulate the assumptions (A) - (D), fix a distance d on the set Vo and, for given constants Co > 0 and c\ > 0, consider the neighborhood V(P) C Vo of P defined by
V(P) = {Q e Vo : d(Q,P) < cu
\\nι'2{Q - P)\\ < CQ}.
The first assumption concerns the local ^*-weak convergence property of the sequence of statistics. Roughly speaking, we assume that parameterized expansions for Ln(Q)h hold uniformly in the neighborhood V(P). A parameι 2 terization will be given by the quantity n l (Q — P) considered as an element in ίooiJ7). In many cases, T will consist of a finite number of functions only. Assumption (A). For each Q G V(P), there exist a set {L(Q, z), z G ί of probability distributions on R and a set {ί(Q, z),z G 4o (F)} of real valued functions on Ή such that for every h G %
Ln{Q)h = L{Q,nιl2(Q - P))h + n-l'2i{Q,nll2{Q +
- P))h
ι 2
Rn{Q,n l {Q-P),h),
where
sup sup RniQ,^2^ Qev(P) hen
- P),h) = oin-1/2).
Furthermore, we assume first order smoothness for L(Q, z) and a Lipschitz condition for ^(Q, z) as a function of z G ^0
300
Friedrich Gδtze and Alfredas Rackauskas
Assumption (B). For each h E Ή and Q E V(P) we have L(Q, z)h = L(Q, 0)Λ + Li(Q, Λ)2r + Λ(Q, h, z\ where L\(Q,h) is a bounded linear functional on ί^F) and 2
sup sup\R(Q,h,z)\ c τ (P), where cΊ (P) is a constant depending on 7 and P. Example 2.6 Let H be a separable Hubert space with the norm || || and inner product ( , •). Consider random elements Λ Ί , . . . , X n ,.. in H that are independent and identically distributed with distribution P. Assume that EX\ = 0 and P is taken from a class Po, with the following properties: i) Q is non symmetric around zero, ii) JH \\x\\4:Q(dx) < 00, and iii) the covariance operator VQ of Q has at least 13 (counted with multiplicities) eigenvalues exceeding a given β > 0. The eigenvalues of a positive operator V : H —» H will be denoted by λi(V) > λ2(Vr) > . It is well known (see, e.g., Gohberg and Krein (1969)) that |λj(Vί) — λj(T^)| < ||Vi — V^|| for linear completely continuous positive operators Vi, V2 on H. Let || H2 denote the Hilbert-Schmidt operator norm. One easily checks that 1
en
f \\x\\4P(dx).
It follows that, with probability tending to one, at least d > 13 eigenvalues of the covariance operator Vp will exceed a number βo > 0. Hence, without loss of generality we can assume that VQ contains the empirical distribution
PnofPeVo
For Q € VQ and αE H, α φ 0, define n
Tn,α{Zu •• ,Zn;Q)
= n\\n-1 Σ,Zk + α|| 2 . fc=l
Let L n j O (Q, r) denote the distribution function of this statistic. Furthermore, let T denote the class of functions on H given by x\(z) = (x, z) together with 2^2(2) = (s, z)2 indexed by x £ H with ||x|| = 1. Note that the evaluation at 7 a point y E H defines an embedding H C ^ocC? ), via y{x\) := x\{y) — (z,y) and foryeH and 2/(2:2) := 2:2(1/) = (x,y)2 It is easy to verify that the statistic Tn,α is locally ^-weakly convergent at any P E VQ which has zero mean. We aim to prove the stochastic equivalence as n -¥ 00 of the uniform errors Δ m and Δ m defined by (3.1).
302
Friedrich Gδtze and Alfredas Rackauskas
Denote by μQ the mean zero Gaussian measure on H with covariance operator VQ, and let D^(x)μQ be the jth directional derivative of μ in the 2 direction of x. Set Vr(z) = {x E H : \\x + z\\ < r}. For P,QG VQ, consider the distance d(Q, P) = \JH \\x\HQ(x) - P(x)]\ + \\VQ - VP\\. ι 2
Define V(P) = {Q E V : d(P,Q) < cu n l \\Q - P ^ < c 2 }. An inspection of the proof of Theorem 2.1 in Bentkus (1984) gives the following uniform expansions (compare with assumption (A)). For any Q G V(P) we have 1 2
3
Ln(Q,r) = μQ(VΓ(α + zn))+ln- / ED (Z1)μQ(Vr(α 0
+ zn))
(2.5)
where zn = nll2EZ\ and, for any ε > 0 and Co > 0, there exists a constant c > 0 such that SUp SUp SUp Rn(Q, 2, 0 = C7l~ 1+ε . (2ί)μg(Vr(α)) + Λ(Q, z; r),
where sup
sup sup|,R(Q,z;r)| < c 2 (P)|k||3τ.
QeV(P)\\α\\0
Indeed, by standard arguments (inversion formula, Lebesgue lemma) it easily follows that (2.7)
Dt{z)μQ{Vτ{α)) = (2π)"1 £ j
~
[(it)~ιe~itrqt(α + τz)\ \τ=Qdt,
where qt(α) is the characteristic function corresponding to μQ(Vr(α)). By elementary calculations one proves that (2.8)
\D2(z)μQ(Vr(α))\ < C^Q)
• X,(Q))-1/2 max{l, ||α|| 2 }||z|| 2 .
Hence, (2.6) follows by Taylor's formula and (2.8). Note, that (2.6) implies assumption (B). By (2.7) (which is assumption (C)) we get (
' '
\ED\Z{)μQ{Vr{α + z)) - ED\Zx)μQ{α)\ < C(λ x (Q) • • . λ 9 ( Q ) ) - 1 / 2 m a x { i ) | | α | | 4 } | N | £ J | | Z | | 3 .
Finally, collecting the bounds (3.2), (3.3), and (3.6) we have proved the following result.
303
Bootstrap Sample Sizes
Theorem 2.4 Suppose that Q E V(P) and J xQ(dx) φ 0. Then, for any constants Co > 0 and ε > 0, there exists a constant c > 0 such that L n (Q,r) = μQ(Vr(α)) + 1
D(zn)μQ(Vr(α)) 3
where sup sup\Rm,n(r)\ < c ( n ~
sup
1+e
' " |H|O
3
1
2
+ | | 2 n | | + n- / ||z n ||).
Arguing as in Gδtze and Zitikis (1995) one can prove that sup sup I/is (VΓ(α)) - μp{Vτ{α))\ =
||α||0
n
^
and sup s u p K 1 Y^D\Xk)μp IMI1)
- - ^
m where Yn = (nι/2Xn)/σ. (3.2)
Furthermore,
P (n(Xn)2
< t2) = 2Φ(ί/σ) - 1 + -
Set
and ζ m (ί) = ζm(t) - Cm/2(*) If n/m2 -> 0, then (4.1) and (4.2) together yield -sup|Cm(ί)| = yn2sup|Φ"(t)|+oF(l) TΠ t>0
ί>0
and - sup \ζm(t)\ = 2-^ ^
t>0
sup |Φ"(ί)| + op(l). t>0
Combining this with the central limit theorem we get (1.2). The proof of (1.3) immediately follows by (4.1), (4.2), and the law of large numbers. For the proof of (1.4), we have, by (4.1), (4.2) combined with the law of large numbers,
ί>0
= sup |(m/n 1 / 2 )Φ"(ί)y n 2 - 2Q[(t)Yn + 2{nιl2/m)Q2(t)\ t>0
and t>o
|
|
= sup |(m/2n 1 / 2 )Φ / / (t)y n 2 - 2(n 1 / 2 /m)Q 2 (t)| t>o
+ oP(l)
305
Bootstrap Sample Sizes It is not difficult to verify that (1.5) is valid with fι(Y) ι
2Q[(t)Y + 2Ί- Q2(t)\
and f2(Y) = sup t > 0
= sup t > 0
\ηΦ"(t)Y2-
\2'^
The proof of Theorem 2.1 is using two results about [/-statistics due to Vitale (1992) which we state below. For a sequence of i.i.d. random elements Xι,...,Xn taking values in a measurable space (S,A) and a sequence of functions (/ι m ), where hm : Sm -» R is a real-valued kernel of degree m < n, define the [/-statistic Un^mhm by /
Unrnhm=
I V'V
\-i
I
2, l < 2 <
R is defined by = /
Js fs
Js
hm(xι,...,xm)P{dxk+ι)
'P(dxm)
Then the Hoeίfding decomposition of Un,mhm is given by
where
The degree of degeneracy of the [/-statistics Un,mhm is the largest integer r such that /ιm(fc) = 0 for k = 1,..., r. Lemma 3.1 Suppose that Uniπιhm is the U-statistic based on the symmetric kernel hm satisfying Eh^ < oo. If the degree of degeneracy of Un is equal to r — 1, then /m\2
/ m \
var(ϋn) < ^ - var(/> m | r ) + ^ ( V a r ( / ι
m
) - (7) v
\r+V
\r)
Lemma 3.2 if [/n?m/ιm is as in Lemma 3.1, the sequence where k = r , . . . , m is nondecreasing.
Proof of Theorem 2.1
Write
Cn,m( n = (m/n) / + m~ l and fixing constants CQ > 0,c\ > 0, define the set Ωo = {ll^nllj 7 < coίn/m) 1 / 2 } Π {d(Pn,P) < ci}. On this set Ωo, we have by assumption (A) H{h) := Lm(Pn)h
- Ln(P)h = L(Pn,m^2(Pn n
- P))h -
- P))h ^
307
Bootstrap Sample Sizes where s u p Λ e 7 ί \Rm,n(h)\ = Write αn(h)
o
( m ^
P
= Lι{Pn,h)(vp,n),
bn{h)
Using assumptions
= £(Pn,0)h.
(5) and (C) we get H(h) = {m/n)ιl2αn{h) where s u p ^ e ^ |Γ m n (Λ)| = op(τmn). (3.4)
+πC1l*bn{h)
+Tmn(h),
In view of assumption (2?), we conclude
H(h) = (m/n)ι/2αn(h)
+ m~ιl2b{h)
+
Tmn(h),
where αn{h) = Xi(P,Λ)(i/p>n), b(h) = £(P,0)h. Now (3.4) yields (3.5) H*(h) := Lm(Pn)h-
Lm/2(Pn)h _
-Φb{h)
1)m
where sup / ι G ^ |T^ n (/ι)| = o p ( r m n ) . Since J 7 is P-Donsker and ||^p,n||.F is uniformly integrable, we obtain (3.6)
A ^oc
sup\αn(h)\ hen
n
sup|Li(P,Λ)G P |. hen
Using (3.4), (3.5), and (3.6) it is easy to complete the proof of the theorem. Moreover, we observe that ξ = sup / ι 6 ? ΐ \Lι(P, h)Gp\, c\ — sup / ι G ^ | whereas ξι = sup I7L1 (P, Λ)Gp + b(h) \ and 1
2
ξ 2 = sup | 7 ( 1 - 2" / )L 1 (P,/ i )Gp + (1 - y/2)b(h)\. hen This concludes the proof • Acknowledgements The first author would like to thank Peter Bickel and Willem van Zwet for a number of stimulating and very helpful discussions on the construction of subsampling procedures.
308
Friedrich Gόtze and Alfredas Rackauskas
REFERENCES Arcones, M.A. and Gine, E., (1989). The bootstrap of the mean with arbitrary bootstrap sample size. Ann. Inst. Henri Poincare 25, 457-481. Athreya, A.C., (1987). Bootstrap of the mean in the infinite variance case. Annals of Statistics 15, 724-731. Bentkus, V., (1984). Asymptotic expansions for distributions of sums of independent random elements of a Hubert space. Lithuan. Math. J. 24, 305-319. Bentkus, V. and Gotze, F., (1999). Optimal bounds in non-Gaussian limit theorems for [/-statistics. Ann. Probab. 27, 454-521. Beran, R., (1982). Estimated sampling distributions: the bootstrap and competitors. Ann. Statist. 10, 212-225. Beran, R., (1997). Diagnosing bootstrap success. Ann. Inst. Statist. Math. 49, 1-24. Bickel, P.J., Gotze, F. and Zwet, W.R., (1997). Resampling fewer than n observations: gains, losses, and remedies for losses. Statistica Sinica 7, 1-32. Bickel, P. J. and Freedman D. A., (1981). Some asymptotic theory for the bootstrap. Ann. Statist. 9, 1196-1217. Bretagnolle, J., (1983). Lois limites du bootstrap de certaines fonctionelles. Annales Inst. Henri Poincare 19, 281-296. Cremers, H. and Kadelka, D. (1986). On weak convergence of integral functionals of stochastic processes with applications to processes taking paths in Lξ. Stoch. Proc. Appl. 21, 305-317. Davison, A.C. and Hinkley, D.V., (1998). Bootstrap Methods and Their Applications. Cambridge University Press, United Kingdom, New York, Melbourne. Datta, S., (1996). On asymptotic properties of bootstrap for AR(1) processes. J. Stat. Plan. Infer. 53, 361-374. Datta, S. and McCormik, W.P., (1995). Bootstrap inference for a first order autoregression with positive innovations. JASA 90, 1289-1300. Deheuvels, P., Mason, D. and Shorack, G., (1993). Some results on the influence of extremes on the bootstrap. Ann. Inst. H. Poincare 29, 83-103. Dudley, R.M., (1984). A course on empirical processes. Lect. Notes in Math. Springer, Berlin 1097, 1-142. Dudley, R.M., (1985). An extended Wichura theorem, definition of Donsker class and weighted empirical distributions. Lect. Notes in Math. Springer, Berlin 1153, 141-178.
309
Bootstrap Sample Sizes
Gohberg, I. C. and Krein, M. G., (1969). Introduction to the Theory of Linear Nonselfadjoint Operators. American Mathematical Society, Providence. Gόtze, F., (1985). Asymptotic expansions in functional limit theorems. J. Multivariate Anal. 16, 1-20. Gotze, F., (1993). IMS Bulletin. Gotze, F. and Zitikis, R. (1995). Edgeworth expansions and bootstrap for degenerate von Mises statistics. Prob. Math. Stat. 15, 327-351. Hall, P., (1992). The Bootstrap and Edgeworth Expansion. Springer, New York. Heimann, G. and Kreiss, J.P., (1996). Bootstrapping general first order autoregression. Stat. Probab. Letters 30, 87-98. Politis, D. N. and Romano, J. P., (1994). A general theory for large sample confidence regions based on subsamples under minimal assumptions. Ann. Statist. 22, 2031-2050. Shao, J., (1994). Bootstrap sample size in nonregular cases. Proc. Amer. Math. Soc. 122, 1251-1262. Shao, J., (1996). Bootstrap model selection. JASA 91, 655-665. Swanepoel, J. W. H., (1986). A note on proving that the (modified) bootstrap works. Comm. Statist. Theory Methods 15, 3193-3203. Vitale, R. A., (1992). Covariances of symmetric statistics. J. Multiυar. Ann. 41, 14-26. DEPARTMENT OF MATHEMATICS BIELEFELD UNIVERSITY POSTFACH
100131
33501 BIELEFELD GERMANY
[email protected] DEPARTMENT OF MATHEMATICS VILNIUS UNIVERSITY NAUGARDUKO 24
2600 VILNIUS, LITHUANIA alfredas. rackauskas @maf. υu. It
CONFORMAL INVARIANCE, DROPLETS, AND ENTANGLEMENT
GEOFFREY GRIMMETT 1
University of Cambridge Very brief surveys axe presented of three topics of importance for interacting random systems, namely conformal invariance, droplets, and entanglement. For ease of description, the emphasis throughout is upon progress and open problems for the percolation model, rather than for the more general random-cluster model. Substantial recent progress has been made on each of these topics, as summarised here. Detailed bibliographies of recent work are included. AMS subject classifications: 60K35, 82B43.
Keywords and phrases: Conformal invariance, droplet, large deviations, entanglement, percolation, Ising model, Potts model, random-cluster model.
1
Introduction
Rather than attempt to summarise the 'state of the art' in percolation and disordered systems, a task for many volumes, we concentrate in this short article on three areas of recent progress, namely conformal invariance, droplets, and entanglement. In each case, the target is to stimulate via a brief survey, rather than to present the details. Much of the contents of this article may be expressed in terms of the random-cluster model, but for simplicity we consider here only the special case of percolation, defined as follows. Let £ be a lattice in Rd; that is, C is an infinite, connected, locally finite graph embedded in Rd which is invariant under translation by any basic unit vector. We write C = (V,£), and we choose a vertex of C which we call the origin, denoted 0. The cubic lattice, denoted Zd, is the lattice in Rd with integer vertices and with edges joining pairs of vertices which are Euclidean distance 1 apart. Let 0 < p < 1. In bond percolation on £, each edge is designated open with probability p, and closed otherwise, different edges receiving independent designations. In site percolation, it is the vertices of C rather than its edges which are designated open or closed. In either case, for A,B CV, we write A B if there exists an open path joining some a G A to some b E B, 1
This work was aided by partial financial support from the Engineering and Physical Sciences Research Council under grant GR/L15425.
Conformal Invaxiance, Droplets, and Entanglement and we write A «•> oo if there exists an infinite open self-avoiding path from some vertex in A. Let Pp denote the appropriate product measure, and let θ(p) = Pp(0 1, define 7iγ(7; α, β) = P(ra
\ : \-
• •
\
\
:
L
r
f
Ψ
r
•β
Figure 2.1. An illustration of the event that ra is joined to rβ within r7, in the case of bond percolation on Z 2 .
The first conjecture is the existence of the limit (2.1)
τr(7;α,/?) = lim πr(η\a,β)
for all triples (75 a,/3). Some convention is needed in order to make sense of (2.1), arising from the fact that rη lives in the plane R2 rather than on the lattice £; this poses no major problem. Only in very special cases is (2.1) known to hold. For example, in the case of bond percolation on Z 2 , self-duality enables a proof of the existence of the limit when 7 is a square and α, β are opposite sides thereof; for this special case, the limit equals 1/2. Let φ : R2 —> R2 be conformal on the inside of 7 and bijective on the curve 7 itself. The hypothesis of conformal invariance states that (2.2)
π(7; α, β) = π( 0. Possibly there is a unique weak limit, and Aizenman has termed an object sampled according to this limit as the 'web'. The fundamental conjectures are therefore that there is a unique weak limit, and that this limit is conformally invariant. Further work in this direction may be found in Aizenman and Burchard (1999). The quantities π(7;α, β) should then arise as crossing probabilities in 'web-measure'. This geometrical vision may be useful to physicists and mathematicians in understanding conformal invariance. Mathematicians have long been interested in the existence of long open connections in critical percolation models in Md (see, for example, Kesten (1982), Kesten and Zhang (1993)). An overall description of such connections will depend greatly on whether d is small or large. When d = 2, a complex picture is expected, involving long but finite paths on all scales whose geometry may be described as 'fractal'. See Aizenman (1997, 1998) for accounts of the current state of knowledge. A particular question of interest is to ascertain the fractal dimension of the exterior boundary of a large droplet (see Section 3 of the current paper). Such questions are linked to similar problems for Brownian Motion in two dimensions. The (rigorous) conformal invariance of Brownian Motion has been used to derive certain exact calculations, some of which are rigorous, of various associated critical exponents (see Lawler and Werner (1998) and Duplantier (1999), for example). Such results support the belief that similar calculations are valid for percolation. The picture for large d is expected to be quite different. Indeed, Hara and Slade (1999a, 1999b) have recently proved that, for large d, the twoand three-point connectivity functions of critical percolation converge to appropriate correlation functions of the process known as Integrated SuperBrownian Excursion. In one interesting 'continuum' percolation model, conformal invariance may actually be proved rigorously. We drop points {Xi,X2,. } in the
316
Geoffrey Grimmett 2
plane R in the manner of a Poisson process with intensity λ. Now divide 2 R into tiles {T(X1),T(X2),...}, where T(X) is defined as the set of points 2 in R which are no further from X than they are from any other point of the Poisson process (this is the 'Dirichlet' or 'Voronoi' tesselation). We designate each tile to be open with probability ^ and closed otherwise. This continuum percolation model has a property of self-duality, and, using the conformal invariance and other properties of the Poisson point process, one may show in addition that it satisfies conformal invariance. See Aizenman (1998) and Benjamini and Schramm (1998). We note that Langlands et al. (1999) have reported a largely numerical study of conformal invariance for the two-dimensional Ising model. 3
Droplets and large deviations
Consider the Ising model on a finite box B of the square lattice Z 2 with + boundary conditions, and suppose that the temperature T is low. (We omit a formal definition of the Ising model, which is known to many, and which is not central to this short review.) The origin may lie within some region whose interior spins behave as in the — phase, but it is unlikely that such a region, or 'droplet', is large. What is the probability that this droplet is indeed large? Conditional on its being large, what is its approximate shape? For low T, such questions were answered by Dobrushin, Kotecky, and Shlosman (1992), who proved amongst other things that droplets have approximately the shape of what is termed a Wulff crystal (after Wulff (1901)). In later work, such results were placed in the context of the associated randomcluster model, and were proved for all subcritical T; see Ioffe (1994, 1995), Ioffe and Schonmann (1998), and the references therein. In a parallel development for percolation on Z 2 , Alexander, Chayes, and Chayes (1990) explored the likely shape of a large finite open cluster when p > Pc They established a Wulίf shape, and proved in addition the existence
of η(p) € (0, oo) such that (3.1)
—\= logP p (|C| = n) -> η(p) as n -> oo
where C denotes the set of vertices which are connected to the origin by open paths. The geometrical framework for such results begins with a definition of 'surface tension'. The details of this are beyond this article, but the very rough idea is as follows. Let k be a unit vector, and let σ(k,p) denote 'surface tension in direction k'. For the Ising model, σ(k,p) is defined in terms of the probability of the existence of a certain type of interface orthogonal to k; for percolation, one considers the probability of a certain type of dual path of closed edges which is, in a sense to be defined, orthogonal to k. When
Conformal Invariance, Droplets, and Entanglement suitably defined, these probabilities decay exponentially, and the relevant exponents allow a definition of 'surface tension' in each case. Given a closed curve 7, one may define its energy 7/(7) as the integral along 7 of σ(k,p), where k denotes the normal vector to 7. We say that 7 encloses a 'Wulff crystal' if 7/(7) < 7/(7') for all closed curves 7' enclosing the same area as 7. We make this discussion of surface tension more concrete in the case of two dimensions, following Alexander, Chayes, and Chayes (1990). For a unit vector k and an integer n, let [nk] be a vertex of Z 2 lying closest to nk. The existence of the limit
σ(k,p) = Jim^ | - 1 logPi_p(0 ** [nk]) J follows by subadditivity, and this may be used as a definition of surface tension. Consider now the percolation model on Ί? with p > pc. If \C\ < 00, the origin lies within some closed dual circuit 7. For a wide variety of possible 7, the circuit 7 contains with large probability a large open cluster of size approximately θ(p)\ ins(7)|, where ins(7) denotes the inside of 7. It turns out that, amongst all 7 with 0(p)|ins(7)| = n, say, the 7 having largest probability may be approximated by the Wulff crystal enclosing area n/θ(p). The length of such 7 has order >/n, and one is led towards (3.1). A substantial amount of work is required to make this argument rigorous. It is a great advantage to work in two dimensions, and until recently there has been only little progress towards understanding how to prove such results in three dimensions. Topological and probabilistic problems intervened. However, a recent paper of Cerf (1998) has answered such problems, and has shown the way to a Wulff construction in three dimensions. Cerf has proved a large deviation principle from which the Wulff construction emerges. A key probabilistic tool is the 'coarse graining' of Pisztora (1996), which is itself based on the results of Grimmett and Marstrand (1990); see also Grimmett (1999, Section 7.4). Cerf's paper has provoked a further look at the Ising model, this time in three dimensions. Bodineau (1999) has achieved a Wulff construction for low temperatures, and Cerf and Pisztora (1999) have proved such a result for all T smaller than a certain value Tsiab believed equal to the critical temperature T c . The latter paper used methods of Pisztora (1996) concerning 'coarse graining' for random-cluster models. 4
Entanglement
The theory of long-chain polymers has led to the study of entanglements in systems of random arcs of M3. Suppose that a set of arcs is chosen within R 3 according to some given probability measure μ. Under what conditions
317
318
Geoffrey Gήmmett
on μ does there exist with strictly positive probability one or more infinite entanglements? Such a question was posed implicitly by Kantor and Hassold (1988), and has been studied further for bond percolation on the cubic lattice 3 Z by Aizenman and Grimmett (1991), Holroyd (1998), and Grimmett and Holroyd (1999). It is first necessary to decide on a definition of an 'entanglement'. Let 3 E be the edge set of Z . We think of an edge e as being the closed line 3 segment of M joining its endpoints. For E C E, we let [E] be the union 3 of the edges in E. The term 'sphere' is used to mean a subset of M which is homeomorphic to the unit sphere. The complement of a sphere S has two connected components, an unbounded outside denoted out(5), and a bounded inside denoted ins(5). For E CE and a sphere 5, we say that S separates E if S Π [E] = 0 but [E] has non-empty intersection with both inside and outside of S. Let E be a non-empty finite subset of E. We call E entangled if it is separated by no sphere. See Figure 4.1.
Figure 4.1. The left graph is not entangled; the right graph is entangled.
There appears to be no unique way of defining an infinite entanglement, and the 'correct' way is likely to depend on the application in question. Two specific ways propose themselves, and it turns out that the corresponding definitions are 'extreme' in a manner to be explained soon. Let E be a (non-empty) finite or infinite subset of E. (a) We call E strongly entangled if, for every finite subset F of £7, there exists a finite entangled subset F' of E satisfying F CFf. (b) We call E weakly entangled if it is separated by no sphere. Note that all connected graphs are entangled in both manners, and that a finite subset of E is strongly entangled if and only if it is weakly entangled. Let £, (respectively E^) be the collection of all strongly entangled sets of edges (respectively weakly entangled sets). It is proved in Grimmett and Holroyd (1999) that 8, C E^, and that these sets are extreme in the sense that ί , C ί C EQQ for any collection E of non-empty subsets of E having the following three properties:
Conformed Invariance, Droplets, and Entanglement
319
(i) the intersection of £ with the set of finite graphs is exactly the set of finite entangled graphs; (ii) if E G E, then E is separated by no sphere; (iii) let JE?i, E 0 has been proved by Holroyd (1998). The list of open problems concerning entanglement in percolation includes: proving the almost sure equivalence of the notions of strong and weak entanglement, establishing an exponential tail for the size of the maxnt imal finite entanglement containing the origin when p < pl , and deciding whether or not there exists an infinite entanglement of a given type when p equals the appropriate critical value. Uniqueness of the infinite entanglement when p exceeds the corresponding critical value has been proved recently by Haggstrόm (1999), but the critical process (whenp equals the critical value) is still not understood. The following combinatorial question may prove interesting. Let ηn be the number of finite entangled subsets of E which contain the origin and have exactly n edges. Does there exist a constant A such that ηn < An for alln?
REFERENCES Aizenman, M., (1995). The geometry of critical percolation and conformal invariance. Proceedings STATPHYS 1995 (Xianmen), 104-120, (eds: Hao Bai-lin). World Scientific. Aizenman, M., (1997). On the number of incipient spanning clusters. Nuclear Physics B 485, 551-582. Aizenman, M., (1998). Scaling limit for the incipient spanning clusters. Mathematics ofMultiscale Materials (eds: K. Golden, G. Grimmett, R. James, G. Milton, P. Sen). IMA Volumes in Mathematics and its Applications 99, 1-24. Springer, New York. Aizenman, M. and Burchard, A., (1999). Tortuosity bounds for random curves. Preprint. Aizenman, M. and Grimmett, G. R., (1991). Strict monotonicity for critical points in percolation and ferromagnetic models. Journal of Statistical Physics 63, 817-835.
Conformed In variance, Droplets, and Entanglement
321
Alexander, K. S., Chayes, J. T., and Chayes, L., (1990). The Wulff construction and asymptotics of the finite cluster distribution for two-dimensional Bernoulli percolation. Communications in Mathematical Physics 131, 1-50. Bain, A. F. R., (1999). Unpublished. Benjamini, L, Schramm, O., (1998). Conformal invariance of Voronoi percolation. Communications in Mathematical Physics 197, 75-107. Bodineau, T., (1999). The Wulff construction in three and more dimensions. Preprint. Cardy, J., (1992). Critical percolation in finite geometries. Journal of Physics A: Mathematical and General 25, L201-L201. Cerf, R., (1998). Large deviations for three dimensional supercritical percolation. Asterisque, to appear. Cerf, R. and Pisztora, A., (1999). On the Wulff crystal in the Ising model. Preprint. Dobrushin, R. L., Kotecky, R., and Shlosman, S. B., (1992). Wulff construction: a global shape from local interaction. AMS, Providence, Rhode Island. Duplantier, B., (1999). In preparation. Grimmett, G. R., (1995). The stochastic random-cluster process and the uniqueness of random-cluster measures. Annals of Probability 23, 1461-1510. Grimmett, G. R., (1997). Percolation and disordered systems. Ecole dΈte de Probabilites de Saint Flour XXVI-1996 (eds: P. Bernard). Lecture Notes in Mathematics 1665, 153-300. Springer, Berlin. Grimmett, G. R., (1999). Percolation (2nd edition). Springer, Berlin. Grimmett, G. R. and Holroyd, A. E., (1999). Entanglement in percolation. Proceedings of the London Mathematical Society, to appear. Grimmett, G. R. and Marstrand, J. M., (1990). The supercritical phase of percolation is well behaved. Proceedings of the Royal Society (London), Series A 430, 439-457. Haggstrόm, O., (1999). Uniqueness of the infinite entangled component in three-dimensional bond percolation. Preprint. Hara, T. and Slade, G., (1999a). The scaling limit of the incipient infinite cluster in high-dimensional percolation. I. Critical exponents. Preprint.
322
Geoffrey Grimmett
Hara, T. and Slade, G., (1999b). The scaling limit of the incipient infinite cluster in high-dimensional percolation. II. Integrated super-Brownian excursion. Preprint. Holroyd, A. E., (1998). Existence of a phase transition for entanglement percolation. Mathematical Proceedings of the Cambridge Philosophical Society, to appear. Ioffe, D., (1994). Large deviations for the 2D Ising model: a lower bound without cluster expansions. Journal of Statistical Physics 74, 411-432. Ioffe, D., (1995). Exact large deviations bounds up to Tc for the Ising model in two dimensions. Probability Theory and Related Fields 102, 313-330. Ioffe, D. and Schonmann, R. H., (1998). Dobroshin-Kotecky-Shlosman theorem up to the critical temperature. Communications in Mathematical Physics 199, 117-167. Kantor T. and Hassold, G. N., (1988). Topological entanglements in the percolation problem. Physical Review Letters 60, 1457-1460. Kesten, H., (1982). Percolation Theory for Mathematicians. Birkhauser, Boston. Kesten, H., Zhang, Y., (1993). The tortuosity of occupied crossings of a box in critical percolation. Journal of Statistical Physics 70, 599-611. Langlands, R., Lewis, M.-A., and Saint-Aubin, Y., (1999). Universality and conformal invariance for the Ising model in domains with boundary. Preprint. Langlands, R., Pichet, C, Pouliot, P., and Saint-Aubin, Y., (1992). On the universality of crossing probabilities in two-dimensional percolation. Journal of Statistical Physics 67, 553-574. Langlands, R., Pouliot, P., and Saint-Aubin, Y., (1994). Conformal invariance in two-dimensional percolation. Bulletin of the American Mathematical Society 30, 1-61. Lawler, G. and Werner, W., (1998). Intersection exponents for planar Brownian motion. Preprint. Pisztora, A., (1996). Surface order large deviations for Ising, Potts and percolation models. Probability Theory and Related Fields 104, 427-466. Watts, G. M. T., (1996). A crossing probability for critical percolation in two dimensions. Journal of Physics A: Mathematical and General 29, L363.
Conformal Invariance, Droplets, and Entanglement
323
Wulff, G., (1901). Zur Prage der Geschwindigkeit des Wachsturms und der Auflόsung der Kristallflachen. Zeitschrift fur Krystallographie und Mineralogie 34, 449-530. STATISTICAL LABORATORY, DPMMS UNIVERSITY OF CAMBRIDGE 16 MILL LANE CAMBRIDGE CB2
1SB
UNITED KINGDOM
g. r. grimmett@stαtslαb. cam. ac. uk
NONPARAMETRIC ANALYSIS OF EARTHQUAKE POINT-PROCESS DATA
EDWIN CHOI 1 AND PETER HALL 1
Centre for Mathematics and its Applications, ANU Motivated by multivariate data on epicentres of earthquakes, we suggest nonparametric methods for analysis of point-process data. Our methods are based partly on nonparametric intensity estimation, and involve techniques for dimension reduction and for mapping the trajectory of temporal evolution of high-intensity clusters. They include ways of improving statistical performance by data sharpening, i.e. data pre-processing before substitution into a conventional nonparametric estimator. We argue that the 'true' intensity function is often best modelled as a surface with infinite poles or pole lines, and so conventional methods for bandwidth choice can be inappropriate. The relative severity of a cluster of events may be characterised in terms of the rate of asymptotic approach to a pole. The rate is directly connected to the correlation dimension of the point process, and may be estimated nonparametrically or semiparametrically. AMS subject classifications: Primary 62G05, 62G07; Secondary 63M30. Keywords and phrases: Bandwidth, bias reduction, correlation dimension, data sharpening, density estimation, epicentre, geophysics, intensity, Japan, Kanto, kernel methods, magnitude, pole, regular variation, smoothing.
1
Introduction
An earthquake-process dataset may often be interpreted as a realisation of a 5-dimensional point process, where the first three, spatial components denote latitude, longitude and depth below the earth's surface, the fourth represents time, and the fifth is a measure of 'magnitude', for example on the Richter scale. Goals of analysis can be very wide-ranging. At one level they may be purely descriptive, perhaps summarising features of the dataset. In this regard, some form of dimension reduction is often critical, putting the information on five dimensions into a form that is more readily accessible and interpretable. At another level the goals may be exploratory, suggesting directions for future analysis, or they may be more explicit and detailed, perhaps with the aim of elucidating properties of subterranean features that played a role in generating the data. Supported by the Australian Research Council.
Point Process Analysis
325
In this paper we discuss nonparametric methods for summarising earthquake data, for exploring the main features of the data, and for addressing more structural problems such as the location of poles and pole lines, the way in which those poles migrate with time, and the value of the correlation dimension of clusters of epicentres. (Poles and pole lines are points and line segments, respectively, at which the intensity of the point process asymptotes to infinity.) Many of our arguments are based on kernel-type estimators of intensity, while others employ methods that are parametric in simple cases but are nevertheless valid in contexts which are quite distant from the parametric model. The aim is to develop analytical tools that offer greater diversity, and robustness against departures from structural models, than more traditional parametric approaches. The latter include the popular Epidemic Type Aftershock Sequence (ETAS) model (Ogata, 1988), which is used to describe temporal behaviour of an earthquake series; and refinements of Hawkes' (1971) self exciting point process model, which describe spatial-temporal patterns in a catalogue. The paper by Ogata (1998) gives detailed discussion of recent extensions of these models. Disadvantages of parametric models in this setting include their instability when even small amounts of new data are added, and their relative insensitivity to anomalous events, arising from the fact that models tend to be formulated through experience of relatively conventional earthquake activity. Indeed, anomalies are typically the root cause of the aforementioned parameter instability. Since anomalous events are often of at least as much interest as conventional ones (see Ogata, 1989), procedures that tend to conceal anomalies are not necessarily to be preferred. Figure 1 depicts spatial components of the type of data that motivate this paper. They are part of the 'Kanto earthquake catalogue', and were compiled by the Centre for Disaster Prevention at Tsukuba, Japan. The points are longitude-latitude pairs representing the locations of earthquakes that occurred in the region of Kanto, Japan, between 1980 and 1993. We have restricted attention here to events whose location was between 138.6° and 139.7° longitude and 34.6° and 35.7° latitude, whose depth was less than 36 km, and whose magnitude was at least 2.0 on the Richter scale. There are 8187 points in the dataset. The diagonal line on the figure is a linear approximation to the location of the volcanic front of the Izu-Bonin Arc (Koyama, 1993), which is a known source of earthquake activity. The region with a dotted boundary defines a smaller subset, near the island of 0-shima, which will also feature in our analysis. Section 2 describes methods for intensity estimation based on point process data, and outlines applications to which such estimates may be put. Techniques for enhancing multivariate intensity estimates, and for deducing structure from them, are outlined in Section 3. Section 4 introduces methods
326
Edwin Choi and Peter Hall
Figure 1. Spatial coordinates of Kanto earthquake data. Data in the smaller region, indicated by the dotted-line boundary, will be used for analysis described in section 2.3. The dashed line diagonally across the figure represents a linear approximation, A, to the volcanic front of the Izu-Bonin Arc.
for estimating the locations and strengths of poles in intensity functions. Some discussion of use of the term 'magnitude', and of the 'Richter scale', is in order. The many different measures of magnitude include those based respectively on energy and on different measures of the amplitudes of shock waves produced by an earthquake. Local Magnitude, more popularly referred to as Richter Magnitude, is of the latter type and is representable in terms of the logarithm of the maximum trace amplitude, measured in micrometers on a standardised seismometer. The magnitude to which we refer in this paper is Local Magnitude, although we shall henceforth call it, and the scale on which it is measured, by its popular name. 2
2.1
Data summarisation and exploration
Dimension reduction
Information about depth in a seismic data vector is often not particularly
Point Process Analysis
327
accurate, and for example is typically represented in bins up to 10 kilometres wide. Reflecting this difficulty, we suggest pooling bins. The longitude and latitude components too are recorded with varying degrees of error, which depend on, among other matters, the spatial distribution of recording stations around the location of the event, and event depth. We shall not attempt to employ such information in our analysis — it is sometimes explicitly available (see e.g. Jones and Stewart, 1997), or deducible from other measurements — but it can be incorporated. Even after removing the depth dimension, data vectors can have as many as four components. We suggest looking at the two remaining spatial components separately, by projecting longitude-latitude pairs onto first one axis and then another. An appropriate axis is often clear from physical considerations; see Figure 1. Neglecting the magnitude component for the time being, we now have two bivariate datasets where in each case one component represents time and the other is a spatial coordinate. Each may be used to produce nonparametric estimates of point-process intensity, enabling perspective plots (where the third dimension represents intensity) to be produced. Of course, contemporary dimension-reduction methods, such as projection pursuit, might also be used to determine projections in the continuum that maximise the 'interestingness' of the associated bivariate scatterplots. In their full generality, such approaches can be hard to justify in the present setting, since rotations of axes that are as distinct as time and space are difficult to interpret. Even if dimension reduction is contemplated only for the spatial coordinates, physical interpretation can sometimes be facilitated by using information from outside the dataset (for example, in the case of the Kanto data, the physically-meaningful Izu-Bonin Arc) to determine an appropriate axis. Magnitude may be depicted by adding colour or a grey shade to graphs of estimated intensity. Therefore, magnitude can be included on the plots described above, without increasing the complexity of the set of projections. It can be shown separately, however, in plots broadly similar to those for intensity. Since magnitude is recorded with error, and only at scattered points in space and time, it is generally necessary to smooth magnitude measurements.
2.2
Kernel estimation of intensity
If space-time data pairs (Xi,Ti) are available after projection of spatial coordinates onto an axis, then the space-time intensity per unit area at (x, t) is estimated by 1 dimensions, and employing an estimator based on nonnegative, symmetric kernels, then we would use the data-sharpening method suggested two paragraphs above, employing bandwidth 2~ιl2h in the sharpening step based on the NadarayaWatson estimator, regardless of the value of p. Bias would be reduced from O(h2) to O(hA). More generally, however, if we were using kernels of order r then we would replace 2" 1 / 2 by r~χlτ. The method has simple analogues for other linear estimators, for example those based on orthogonal series or singular integrals.
3.2
Implementation
Rather than reanalyse the data addressed in Section 1 we shall introduce a new data set, this time wholly spatial rather than involving both space and time. It has the advantage of requiring relatively little variation of bandwidth. Panel (a) of Figure 4 depicts epicentres of earthquakes with magnitude at least 2 on the Richter scale, occurring between 100° and 160° longitude and -30° and 30° latitude during the years 1984 to 1995. The number of events which satisfy these criteria is 24471. Data were compiled by NOAA and USGS (1996); see also the printed account in the PDE catalogue (1997). A kernel estimator of intensity at x — (x^,x^), based on data such as these, is
(compare (2.1)), where now Xi = (x\ \x\ ) represents a spatial data pair, K is a bivariate kernel and h a bandwidth. The sharpened data are
where K and h are as at (3.1). The 'diagonal' terms may be dropped from the numerator and denominator at (3.2), without affecting first-order asymptotic properties. We suggest substituting Xi for Xi at (3.1), to compute the sharpened form of v\
Panels (a) and (b) of Figure 5 show v and ί>s, respectively, computed from the data in panel (a) of Figure 4, using bandwidth h — 2.0 (chosen subjectively) and taking K to be the product of two univariate biweight
334
Edwin Choi and Peter Hall (a): Raw data
100
110
120
130 140 longitude
(b): One-fold sharpened data
150
160 ~Ϊ00 110
(c): Two-fold sharpened data
(d): Three-fold sharpened data
CO
30-
120 130 140 150 160 ' longitude -'
/.
•
T
\
\
CM
20-
.
10
•/:•
latitude o
10-
V 1
-10
-10-
-20
oo,
δny/n(αn
- α 0 ) - ^ max(y,0),
where Y has distribution N(0, σ 2 αJ" 1 1 (0,7)). The estimators δ n and m^ are asymptotically independent. T h e o r e m 1.2 ( α 0 = 1/2) Let (L1>-(L4) be satisfied. ίfα 0 = 1/2, then, as n -> oo? (1.13) holds true and (1.16)
d
v
^
The estimators άn and fhn are asymtotically
> independent.
T h e o r e m 1.3 ( α 0 G (1/2,1]) Let (1.1)-(1A) be satisfied. Ifα0 G (1/2,1), then, as n -» oo,
<M7> where Λ~ x (αo,7) denotes the inverse matrix to A(αo?7) If αo = 1, then, as n —> oo, and
δny/n(fhn - mn)/n -A M2, where Mi has distribution iV(0, σ 2 α^2"(l,7)) and M2 has distribution
lN(0,σ2α^(l,Ί))
+ iiV(0,σ2α22(l)7)).
The proofs are postponed to the next section. Here we discuss various consequences of the stated results. Note that the rate of consistency of αn depends neither on c*o nor mn while the rate of consistency fhn depends on αo The best rate of consistency of αn is reached for αo = 0, and the worst one for αo E (1/2,1]. This is in accordance with the results of Ibragimov and Hasminski (1981) concerning consistency of estimators in regular, almost regular and singular cases.
Estimators of Changes
349
If αo is known then comparing the results of the present paper with Theorems 1.1-1.2 in Huskova (1999) we realize that in case α 0 G [0,1/2] the limit behavior of fhn is the same as if αo is unknown. In case αo G (1/2,1] by Theorem 1.3 in Huskova (1999), as n -» oo, δny/τι(fhn — Tfij^jJTi —y iV(0, and 7 G (0,1),
ίG(o,7-e)u(7+6,i) α€(0,αo-e)U(α0+e,l) α€[0fl]
L
Q 1 / 2 ( α , α, t, t)
351
Estimators of Changes
for e > 0 small enough. This in combination with the property of the stochastic term on the right hand side of 2.3 implies the assertion of our lemma. • To obtain stronger properties than the above consistency we have to investigate both stochastic and nonstochastic terms in more details. The estimators fhn and αn can be equivalently defined as the maximizers over k e { 0 , 1 . . . , n - 1} and α E [0,1] of
It is useful to decompose the single terms as follows
(
( Σ?=i *
}
= Ak + 2δn(Bkι
+ Bk2 + Bk3 + BM) - δ2n{Ckι + 2Ck2 + Ck3 + CM),
for k = 1, ...,n, α G [0,1], where
_
2=1
n
Σ?=l Σ?=i
Ck2 = Σ{xik{α)
- Ximn{α)){ximn{α)
- Zimn(c*o))
x
i=\ ik\
352
M. Huskova
Σ?=i Notice that
fcl
+ ... + Bk4}
2
= σ {Ckι + ... + Ck4).
In the next few lemmas we investigate the single terms C'ks and B'ks for α close to αo and k close m n . We start with C'ks. L e m m a 2.3 Let assumptions (1.1)-(1.4) with α$ G [0,1] be satisfied. Then, as α —> αo and | m n — fc|/n ->- 0, n -> oo: = (α - α o ) 2 α n ( α o , 7 ) ( l + o(|α - αoΓ + 1))
(2.6)
-Ckl
(2.7)
Icfc2 = ( α -
- (Jo1 (z x (l + o ( | α - α 0 | κ + l)). uniformly for \k — mn\ = o(n) for some K > 0. Proof We derive the assertion for one term only, since others are treated in the same way. We have α
mn\
fi-
Γ(f
= Γ(f\x-
= (α - α 0 ) f\x Jo
)f ]og(* -Ί)+dx)dβ
Ί
- 7)l Ω 0 Iog2(x - Ί)+dx(l + o(\α - <x)\κ + 1)),
Estimators of Changes
353
where we used elements of classical calculus. • Lemma 2.4 Let assumptions (Ι.l)-(IΛ) and n —>• oo, (1) forαe (1/2,1]: (2.9)
be satisfied. Then, as α -> α 0
uniformly for |A — m n | = o(n) for some K > 0. (2) for α = 1/2: (2.10) uniformly for \k — mn\ = o(n) for some K > 0.
(3) for αe [0,1/2): (2.11) i c f e 3 = 77.
oo
uniformly for \k — mn\ = o{n) for some n > 0. Proof The lemma is a slight generalization of Lemmas 2.2-2.4 in Huskova (1999) and therefore the proof is omitted. • Lemma 2.5 Let assumptions (1.1)-(1.4) be satisfied. Then for α G [0,1], as α -> αo and n -> oo, Bkι + Bk3 = (α - α o ) y n i ( l + o P (|α - α 0 Γ + 1)) uniformly for \k — mn\/n = o(l) for some n > 0, where (2.12)
Ynl =
o 1 ^ - 7 ) + α o log(g ~ 7)+ 0. Hence by Theorem 12.3 in Billingsley (1968) we have that the sequence Sn = {S n (α),α £ [0,1]} is tight and also
)> = Σ( ( ^ x(l + op(\α-α0\κ
+ l))
uniformly for α -» αo and n —)• oo. The lemma is proved. • The following lemmas are slight generalizations of Lemma 2.6-2.9 in Huskova (1999); their proofs are omitted. Lemma 2.6 Let assumptions (Ι.l)-(IΛ) as α ->> αo &nd n —>> oo,
be satisfied. Then for αo € [0? 1]?
= - I 7 ^ - ^ y 2 n (i + O p ( | α _
Sjb4
αo
| - + i))
uniformly for \k — mn\/n = o(l) for some K > 0, where
(2 13)
ΐ = 1
/o1^ - Ί?r-ldx - ^(y - T)?- 1 ^ fo(* ~ Lemma 2.7
fαo ^ [1/2,1]) Let assumptions (1.1)-(1A) be satisfied. Then,
as α —> αo a n d n —> oo,
(1) for αo€ (1/2,1]: 71
355
Estimators of Changes uniformly in \k - mn\/n = o(ί) for some K > 0. (2.14)
γM-α
(2) for α 0 = 1/2 mn-kl
A
/ΐ~
uniformly for \k - mn\/n = o(l) for some « > 0 . Next, we introduce the process VnOLp = {Ki,α(ί); |*| < Γ } with α E [0,1/2), T being a positive number and = δnBk2,
k = l,....,n,
and piecewise linear otherwise. Lemma 2.8 (α 0 G [0,1/2)) Let assumptions (1.1)-(1.4) be satisfied. Then for any e > 0 and 77 > 0 there exist Hjη > 0, j = 1,2, and n^ such that for n > nη P(
max
\\α-αo\ 0 large enough the nonstochastic terms δ^(Ckι + Ck2 + 2Cks + Ck^) dominate the stochastic terms δn(Bkι + Bk2 + Bks + Bk4) for (α, k) G Hp with probability close to 1. Since max\δn\\Bkι
+ ... + BkA\ = Op{\δn\y/n)
and since D can be chosen arbitrarily large we find that (2.15)-(2.16) hold true. In order to obtain the limit behavior of our estimators we investigate the maximum of (2.17)
2δn(Bkι
+ ... + Bk4) - δ2n(Ckl + 2Ck2 + C f c 3 + CkA)
over the set G/> a n Writing α = c*o + hiδnVn)"1 d k = m + t2n(δny/n)~ι we get by Lemmas 2.3-2.7 that our problem reduces to investigating the maximum of
with respect to t\ and t2. Since A(αo5 7)? defined in (1.8) is a positive definite matrix and since by the CLT (Yί n /\/n, (Y2n + Y3n)/V™) has asymptotically ΛΓ((O,O)τ,σ2A(αo,7)) distribution, we find after some standard steps that the assertion of Theorem 1.1 holds true. • Proof of Theorem 1.2 We proceed similarly as in the proof of Theorem 1.3. Checking the behavior of B^'s and CVs for αo = 1/2 (Lemmas 2.32.8) we realize that, as n —> 00 and α —>• αo, fc3
( ( ^ ) g (
n
) ) ,
CM = o(Ck3),
357
Estimators of Changes
_L β f c 2 = O p ( ] * z p l ( l o g ( n _ m n ) ) i / 2 ) 5
Bk4 = Op{Bk2)
uniformly for k - mn = o(n). The terms Ck\,Ck2 and Bk\,Bk$ are not affected in this way, which leads to the conclusion that the rate of consistency of αn is the same as in case α0 G (1/2,1] while for fhn we have, as n -» oo and α -> αo ^ ^ V ^ ) " 1 log~ 1 / 2 (n -
)
Moreover, it is enough to study the maximum of 2δn(Bkι
+ Bk2 + Bk3) - δ2n(Ckl + Cks)
over a properly modified set Hp. The proof is now finished in the same way as of Theorem 1.3. • Proof of Theorem 1.1 This is omitted since it is in principle the same as that of Theorem 1.2. • Acknowledgements The author thanks two anonymous referees for their helpful comments.
REFERENCES Antoch, J., Huskova, M. and Veraverbeke, N., (1995). Change - point estimators and bootstrap. J. Nonpαrαm. Statist. 5, 123 - 144. Billingsley, P., (1968). Convergence of Probability Measures. Wiley, New York. Csorgό, M., and Horvath, L.,(1997). Limit theorems in change point analysis. Wiley, New York. Hinkley D., (1971). Inference in two-phase regression. J. Amer. Statist. Assoc. 66, 736 - 743. Huskova, M., (1999). Gradual changes versus abrupt changes. To appear in Journal Statististical Planning and Inference. Ibragimov, LA., and Hasminski, R.Z., (1981). Statistical Estimation: Asymptotic Theory. Springer Verlag, New York. Jaruskova, D., (1998a). Testing appearance of linear trend. Journal Statistical Planning and Inference 70, 263 - 276. Jaruskova, D.,(1998b). Change-point estimator in gradually changing sequences. CMUC39, 551 - 561.
358
M. Huskova
Lombard, F.,(1987). Rank tests for change point problem. Biometrikα 74, 615 - 624 Siegmund, D., and Zhang, H., (1994). Confidence regions in broken line regression. In Change-point problems 23. IMS Lecture Notes - Monograph Series, 292 - 316. DEPT. OF STATISTICS CHARLES UNIVERSITY SOKOLOVSKA 83
CZ-186 00 PRAHA CZECH REPUBLIC
huskova @karlin. mff. cuni. cz
ESTIMATION OF ANALYTIC FUNCTIONS
I. IBRAGIMOV
1
St.Petersburg Branch of Steklov Mathematical Institute Russian Ac.Sci. In this paper we present a review of some results on nonparametric estimation of analytic functions and in particular derive minimax bounds under different conditions on these functions. AMS subject classifications: 62G05, 62G20. Keywords and phrases: Nonparametric estimation, Minimax bounds.
1
Introduction
The aim of this paper is to present a review of some results about nonparametric estimation of analytic functions. A part of it is written in exposition style and summarizes some recent work of the author on the subject (detailed versions have already been published). The rest of the paper contains new results in the area. Sometimes the proofs are only outlined and will be published elsewhere. Generally the problem looks as follows. We are given a class F of functions defined on a region D C Rd and analytic in a vicinity of D. It means that all / G F admit analytic continuation into a domain G D D of the complex space Cd. To esimate an unknown function / G F one makes observations Xε. Consider as risk functions of estimators / for / the averaged Lp(D)-norms
J/(t)-/(t)|Λj
,
l 00, means that l i m ^ = 1). His method heavily used the fact that in the case of L2 spaces the sets F are ellipsoids in Hubert space L2, and its Fourier transform g satisfy the following inequalities (see [8], ch. IV): (4.4)
\\g\\p < C\\~g\\q,
Hence for 2 < p < oo the L p -norm of the bias satisfies
(4.5)
\\f-Έffτ\\p 0
dt
e
Ty
Γ W)\ < ~ Γ
368
I. Ibragimov
Take here y = T^.
We find that
\φ(t)\dt
2 sin2T(Xi-x)
p/2
J-oo \ p/2-l
oo
,oo f { y ) d y
It follows from these two inequalities that (4.13)
E||/r -
sin a; x
dx
TV nj
369
Estimation of Analytic Functions
We always suppose that Γ = o(ή). The last inequality implies then that for 2
0, which will be the case, that
(4.15)
\J2^(x)\2kdx oo 2
limsupTΔ^(F) < 4ασ .
373
Estimation of Analytic Functions
Proof By the Paley-Wiener theorem the correlation function R(t) of the process is zero outside the interval [—α, α]. Thus the spectral density
/(λ) = ±-Γ R(t)dt. Zπ J We estimate R(t) by du
^\
>
0,
if
1*1
if \t\ > α.
and /(λ) by
Not difficult computations show then that
1
ra
2TΓ J-a
_2
Λ*-|t| r JO
JO
i R(t + u-— υ)R(t -u + v)) du dv
(/_>HT 2
Further i | | i ? | | | = | | / | | | < σ 2 and Γ Γ Λ(ti)c/^
< 2o||Λ||| < 4πασ2.
The theorem is proved. • Consider now problems when / G F2(M, c, p). (Recall that the last conp dition means that sup| z | < r \f(z)\ < Mexp{c|y| }, see p.6). Theorem 4.8 Let under the conditions of Problem I the observations be Xε(t),—oo < α < t < b < oo. If the unknown signal f E F2(M,c,p), then when ε -> 0 the minimax risk Δ p ( F , ε ) denned through Lp(α,b)-norm satisήes the following asymptotic relations
isp 00 the minimax risk Δ p (F,n) satisfies the following asymptotic relations A
_λ
/τ
, Inn
(4.32)
The constants depend on M, c, p, and p only. We give the proof of Theorem 4.8 only. The proof of upper bounds in (4.32) is based on the inequality (4.33) below and arguments from [1]; the proof of lower bounds is based on arguments from [14], [15]. A detailed version of the proof will be published elsewhere. Proof of Theorem 4.8 The proof repeats the main arguments of the proof of Theorem 1 from [1] and we omit the details. Evidently we may and will suppose that [α, b] = [—1,1]. Upper bounds. Consider the Fourier expansion of / with respect to the Legendre polynomials
and estimate the coefficients α^ = j \ Pk(x)f(x)dx by the statistics
Introduce now the statistics N
0
and study separately their bias
x) = f(x) ~ EfN(x) = f ) αkPk(x) ΛΓ+1
375
Estimation of Analytic Functions and the random term
1. The bias. Introduce into the consideration the Chebyshev polynomials Tfc(x) = ^ cos(fcarccosx). They are orthonormal on [—1,1] with respect to the weight (1 - a; 2 )" 1 / 2 . Let
The value of the best approximation of the function f in the Z/2-norm by polynomials Q of degree N is equal to α=
\1/2 \f(x)-Q(x)\ dx)
r1
/ 2
2
inf/ /
l/2
r\
< (inf /
\f(x) - Q(x)\\l -
The coefficients bk have the following representation (see [11], [16]) bk=
ί1 f{*)Tk{xJdx=J-i v 1 — x ft
Γ f (cosθ) coskθdθ Jo
= - Γ f {cos θ) cos kθdθ
The function / is analytic in the whole complex plane and we can apply the Cauchy theorem and integrate on the right along the circles of radii i ? " 1 and R respectively. We find that for R > 1 because of the definition of the class F2 Ifefcl < T Γ - ^ - ^ Γ \f(l/2{Reie + R~ιe-iΘ))\dθ J-π 1
< 2R~k max \f(z)\ < 2MR~k
exp{cRp}.
If we take here R = (£)', we find that |6 fc | < 2Mekl? (^Y
(4.33)
• But then
M ^JV-^.
It is shown in [1] that the number M of the points of the set 5 is bigger than The polynomials Pk{z) satisfy the inequality (see [16]): for all complex \Pk(z)\ < y
2
Hence for \z\ = R
\fa(z)\ 2 λ-VNN^N fe¥2
(1 - c ε - 2 Λ T 2 ^ ) .
If we take here N as the minimal integer for which cε~2N~2lN find that
sup 11/ - /||p >cεJ^gjQr,
< 1/2, we
c > 0.
The theorem is proved.
5
A problem of extrapolation
An analytic function f(z) possesses a remarkable property: being observed on an interval it becomes immediately known throughout its domain of analyticity. Of course the problem of recovering f(z) from such observations is an ill posed problem and it would be interesting to know what will happen if the observations are noisy. We consider below an example of this problem. Denote F = F ( M , σ) the class of integer functions f(z) such that / are integer functions of exponential type < σ, real valued on the real line and such that ||/II2 < M where || • H2 denotes the ^ ( i Z 1 ) norm. By the PaleyWiener theorem (see, for example, [20]) functions / E F admit the following representation
(5.1)
f(z) = ± j
378
I. Ibragimov
and hence (5.2)
|/(*)|
1/2
/GF(2,1)
if T > c(α)(ln ^)α. With slight changes the same arguments will work for Q any z, \z\ > c(α)(ln i ) . The theorem is proved. • Remark 5.1 The Theorems 5.1 and 5.2 mean that roughly speaking when ε —> 0 consistent estimation of f(z) is possible on the disks {|z| < < ln^} and impossible outside larger disks, namely on {|z| > > In ^)}. Acknowledgements I wish to thank Professors Chris Klaassen and Aad van der Vaart who read the manuscript and made many helpful comments.
382
I. Ibragimov
REFERENCES [1] I. Ibragimov, (1998). On estimation of analytic functions. Studia Sci Math Hungarica 34, 191-210. [2] I. Ibragimov, (1998). Estimation of analytic spectral density of Gaussian stationary process. ANU preprint No.SRR 006-98, Australian Nat Univ. [3] I. Ibragimov, R. Khasminskii, (1983). On the estimation of distribution density. J. of Soviet Math 24, 40-57. [4] I. Ibragimov, R. Khasminskii, (1977). A problem of statistical estimation in Gaussian white noise. Soviet Math Dokl 236, 1053-1056. [5] M. Pinsker, (1980). Optimal filtering of square integrable signals in Gaussian white noise. Problems of Information Transmission 16, 120-133. [6] E. Guerre, A. Tsybakov, (1998). Exact asymptotic minimax constants for the estimation of analytical function in Lp. Probab Theory Relat Fields 112, 33-51. [7] G. Golubev, B. Levit, A. Tsybakov, (1996). Asymptotically efficient estimation of analytic functions in Gausian white noise. Bernoulli 2, 167-181. [8] E. Titchmarsh, (1937). Introduction to the theory of Fourier integrals. Oxford Univ Press. [9] H. Rosenthal, (1970). On the subspaces of Lp(p > 2) spanned by sequences of independent random variables. Israil J Math 8, 273-303. [10] V.Petrov, (1995). Limit theorems of probability theory. Clarendon Press, Oxford. [11] A. Timan, (1963). Theory of approximation of functions of a real variable. McMillan, N.Y.. [12] I. Ibragimov, R. Khasminskii, (1989). On density estimation in the view of Kolmogorov's ideas in approximation theory. Ann.Stat. 18, 999-1010. [13] I. Ibragimov, R. Khasminskii, (1982). On density estimation within a class of entire functions (in Russian). Theory Prob Appl 27, 514-524. [14] I. Ibragimov, R. Khasminskii, (1982). Bounds for the quality of nonparametric estimation of regression. Theory Prob Appl 27, 84-99. [15] I. Ibragimov, R. Khasminskii, (1984). On nonparametric estimation of regression in Lp norms. J of Soviet Math 25, 540-550.
Estimation of Analytic Functions
383
[16] S. Bernstein, (1937). Extremal properties of polynomials (in Russian). ONTI, Moscow. [17] Szegό, (1959). Orthogonal polinomials, AMS Colloq.Publ vol.XXIΠ. [18] I. Ibragimov, R. Khasminskii, (1981). Statistical estimation: asymptotic theory. Springer, N.Y.. [19] I. Ibragimov, R. Khasminskii, (1984). On nonparametric estimation of the value of a linear functional in Gaussian white noise. Theory Probab Appl 29, 18-32. [20] R. Boas, (1954). Entire Functions. New York Academic Press. ST.PETERSBURG BRANCH OF STEKLOV MATHEMATICAL INSTITUTE RUSSIAN AC.SCI. FONTANKA 27 ST.PETERSBURG
191011, RUSSIA ibr32@pdmi. rαs. ru
THE DETERMINISTIC EVOLUTION OF GENERAL BRANCHING POPULATIONS
PETER JAGERS
1
Chalmers University of Technology and Gόteborg University Probability theory has a strength that extends beyond probabilistic results. The precise formulation of probabilistic models may lead to intuitive arguments that reach further than even sophisticated mathematical analysis of deterministic models. This is well known from the use of Brownian motion in exhibiting solutions of partial differential equations. Another illustration is provided by population dynamics. Branching processes focus on probabilistic problems, and rely on probabilistic methods. But the expected evolution of general branching populations is an interesting topic in its own right, that has much in common with structured deterministic population dynamics. Arguments based on Markov renewal theory demonstrate a remarkable strength as compared to traditional, differential equations based approaches in establishing exponential growth and the ensuing stabilization of population composition of expected populations. This is described in this article, aimed at a broad mathematical readership. AMS subject classifications: 60J80; 60F25, 92A15..
Keywords and phrases: branching processes, population dynamics..
1
From Galton and Watson to Markov Population Processes
Recall Galton's famous formulation, more than a century ago, of the population extinction problem: "A large nation, of whom we will only concern ourselves with the adult males, N in number, and who each bear separate surnames, colonize a district. Their law of population is such that, in each generation, αo per cent of the adult males have no male children who reach adult life; α\ have one such male child, α2 have two; and so on up to α§ who have five. Find (1) what proportion of the surnames will have become extinct after r generations; ..." Already this historical and pre-exact wording has much of the flavour typical of modern mathematical population dynamics: its starting point is a description of individual behaviour, in this case a probabilistic description of reproduction, and the properties asked for concern the population as a whole - in this case an extinction probability. The latter is typical. In the biologically - not mathematically! - simple Galton-Watson processes that were born 1
This work has been supported by the Swedish Natural Sciences Research Council..
Deterministic Evolution
385
out of the surname extinction problem, the investigated population properties always have this probabilistic character. Indeed, the deterministic part of evolution is so simplistic, that it does not warrant any attention. Not so if the processes evolve in real time, and have a minimal amount of generality: consider the one-type general branching process, i.e. a population whose reproduction (in deterministic terms) is determined by a reproduction function μ{u), giving the expected number of children up to (mother's) age u. In demographic and related theory μ{u) is often given as an integrated product of a survival probability and an age-dependent birth rate, JQWp(a)b(a)da. Then, if the population is started at time 0 from a new-born individual, at time t the total (expected) number of individuals born into the population can be written n=0
Here ) if* < 0 \ 1 otherwise, and = n
ί Jo
In this μ* (t) clearly stands for the size of the ra:th generation of the total population, i.e. born by time t. Prom an analytic viewpoint, the analysis of the (expected) dynamics of this type of populations is therefore little but the study of sums of convolution powers, a topic well investigated within the framework of renewal theory and integrated into the theory of branching processes since long (cf. Harris, 1963). However, even though these processes are "general" as compared to Galton-Watson processes and much demographic theory, they remain simplistic in assuming all individuals to be of the same type - even though allowing for them to meet with very different fates in life, by chance. During the past decade a general Markov renewal theory has developed - cf. the series of papers by Nummelin and coauthors and by Shurenkov - which allows analysis of populations where individuals may not only beget children at any age, and have positions in some state space, but where child-bearing, and individual life evolution in general, may be influenced by some (geno)type inherited from the mother. These are the Markov population processes, or equivalently general branching processes with abstract type spaces, surveyed by Jagers (1991). For a more technical presentation cf. Jagers and Nerman (1996).
386
Peter Jagers
In this case the reproduction function is replaced by a reproduction kernel, describing the (expected) child-bearing of an individual, given her type. Thus, let (£, S) be a measurable space, the type space, about which we only assume that the σ-algebra S is generated by some countable class of sets. Let B denote the Borel algebra on R + . The reproduction kernel is denoted μ(s, A x ΰ ) , the (expected) number of B-type children of an s-type mother, while she is in age-interval A, s £ S,B G S. Note that it is the type at birth that determines the kernel; we shall return to the question of individuals possibly moving in some state space during life. At this junction, let us only note that even though such movement may influence reproduction, the movement itself can be included in the reproduction kernel, which thus remains the entity determining the population dynamics. It does so much in the same manner as in the case of one-type populations, only convolution has to be replaced by a combination of convolution in age and Markov transition in type: Start the population from one, new-born s-individual at time 0. Her generation, the O:th, will have a trivial size and type distribution at time t that can be written μΌ(s,[0,t]xB):=lB(s)lκ+(t). (As before, we consider the total population, disregarding death for the time being. One with a suffix stands for indicator function.) The next generation, consisting of the ancestor's daughters has the size and type distribution μι{s,[0,t]xB):=μ{s,[0,t}xB), and so it continues: μ 2 (s,[0,ί] xB):=
μι(r, [0,t - u] x B)μ{s,du
f
x
dr),...
JR+XS
μ n + 1 (β,[M X £ ) : = ί
n μ
(r,[0,t-u] x
JR+xS
the total population size and time distribution at time t thus being oo
U.faB) := Ua([0,t] x B) := J]μ Λ ( β ,[0,ί] x B). 71=0
We shall allow ourselves to identify non-decreasing functions with measures on R + , even if the measure is actually defined on B x <S, as above or in μ(M,B):=μ(M0,t]xfl). Consider now some property D of an individual, that has a well defined probability, given the individual's type r (at birth) and age a (now). Denote it by p£)(r,α). The property D could be 'being in some specified subset of
Deterministic Evolution
387
a geographical or general state space'; it could simply mean 'being alive' so that if Lr denotes the distribution function of the life span of an r-type individual, then po(r^a) = 1 — Lτ(a). Actually D might also refer to the progeny of an individual, like the number of granddaughters. Here we shall assume that it can not refer backwards to progenitors. However, if a property refers only finitely many generations backward that can be remedied by the simple trick of moving the property backwards to the last common ancestor of those concerned, cf. Jagers and Nerman (1984) where also other cases are discussed. In the present paper we assume throughout that p/}(r, a) = 0 for α 0), is not above age u, and is still alive at age a. Other examples would be the probability of being in mitosis, or having a specified DNA-content or size. A particularly popular model is to let the individual be born at some starting position (note that this need not at all be the type of the individual, even though information about birthplace may be included in the latter) and then let it diffuse to other positions. The important matter, in our context, is that once D has been fixed, and PD is measurable, then the (expected) number of individuals having the property D at time t will be given by
(1)
Ms(t,D)~
f JR+xS
pD{r,t-u)Us(duxdr).
The study of (the deterministic part of) population evolution can thus be described as the analysis of the functions M S ( ,D) of time, for s £ S and various D. The results that follow have the form (2)
Ms(t,D)~h(s)eatpD(a)/aβ,
as t -» oo - with a slight modification in the so called lattice case, where the population dynamics display some inherent periodicities. Here, h : S —» R+ is a reproductive value function: it decribes the fitness of the types. Mathematically it arises as an eigenfunction. The a is the famous Malthusian growth parameter and β a time sealer, meaning the average age at childbearing, in a certain sense. As indicated by the notation, PD{&) is a Laplace transform evaluated at the Malthusian parameter: PD( 0, x standing for the individual cell size. Similarly there is a death intensity δ(x) > 0, death meaning the cell disappearing without giving birth to daughter cells. When a cell splits, its mass is assumed to be equally divided between the daughters. Individual cell growth is deterministic i. e. the same for all cells with given birth size, x' = g(x),x(0) = size at birth, g > 0. Assume that there is a minimal and a maximal cell size α and 4α so that 0 < α < x(0) < 2α, and no cell, smaller than 2α, can divide. (The following argument is not correct without some such condition, absent from a first version of this paper, as noted by Alexandersson (1998).) The growth equation yields dt = dx/g(x) and the distribution function for the size y at death or division of a cell with birth size x is
To obtain y-sized daughter cells the mother must herself attain size 2y and the expected number of y-sized daughters becomes
2b(2y)exp(- f *\b(ξ) JX
Once y has been fixed, the age u at division is determined by
s: &
dξ
5(0
= u.
In the notation
we can thus write the reproduction kernel μ(x,du
x dy) = 4^e-VW-ttχ»l{c{2y)_c{x)](du)dy,α
<x,y
0 if 7(0) > 1/4). Returning to the definition of the Perron root we see that
71=0
converges/diverges for t < or > 1. Thus a is the Malthusian parameter, and the kernel is recurrent. It is easy to formulate conditions for the kernel to be positive with respect to Lebesgue measure (on some interval) and for the strong recurrence condition 0 < β < oo. The condition that sups μ(s, [0, e] x S) < 1 for some e > 0 is clearly satisfied. Moreover, for x, y given, μ(x, dt x dy) gives mass only to the age point c(2y) — c(x). Prom the definition of d-lattices, we know that we are in the lattice case if and only if all splits occur at ages c(y) — c(x) + nd,n G Z + , i.e. iff c(2y) = c(y) + d - by the continuity of c, d can only appear once. Since ry
=
/
Ja
it is direct to check that this is the case precisely when g(2y) = 2g(y), as shown in Metz and Diekmann (1986). We see that we have asynchronous exponential growth if this is not the case, and a periodic behaviour otherwise, by applying Theorems 1 and 2, respectively. It is also easy to guess how slight changes in the model, e.g. the introduction of a quiescent period will destroy
396
Peter Jagers
the lattice properties, and thus salvage the pure asynchronous exponential growth. Acknowledgements. I thank Ziad Taib for timely and helpful reading of this manuscript.
REFERENCES Alexandersson, M. (1998). Branching Processes and Cell Populations. Licentiate Thesis, Dep. Mathematics, Chalmers U. Tech. and Gόteborg U. Arino, A. and Kimmel, M. (1993). Comparison of approaches to modelling of cell population dynamics. SIAM Journal of Applied Mathematics 53, 1480-1504. Diekmann, O. (1993). An invitation to structured (meta)population models. In: Levin, S. A., Powell, T., and Steele, J. (eds.) Patch Dynamics. Springer-Verlag, New York Diekmann, O., Gyllenberg, M., Metz, J. A. J., and Thieme, H. R. (1993). The "cumulative" formulation of (physiologically) structured population models. In: Ph. Clement and G. Lumer (eds.), Evolution Equations, Control Theory and Biomathematics, Marcel Dekker, New York, 145-154. Diekmann, O., Heijmans, H. J. A. M., and Thieme, H. R. (1984). On the stability of the cell size distribution. J. Math. Biology 19 227-248. Gyllenberg, M. and Webb, G. F. (1992). Asynchronous exponential growth of semigroups of non-linear operators. Journal of Mathematical Analysis and Applications 167 443-467. Harris, T. E. (1963). The Theory of Branching Processes. Springer-Verlag, Berlin. New ed. by Dover, New York, 1989. Jagers, P. (1975). Branching Processes with Biological Applications. J. Wiley & Sons, Chichester etc. Jagers, P. (1983). On the Malthusianness of general branching processes in abstract type spaces. In: Gut, A. and Hoist, L. (eds.) Probalitity and Mathematical Statistics. Essays in Honour of Carl-Gustav Esseen. Dep. Mathematics, Uppsala University. Jagers, P. (1989). General branching processes as Markov fields. Stochastic Processes and their Applications 32 183-242.
Deterministic Evolution
397
Jagers, P. (1991). The growth and stabilization of populations. Statistical Science 6 269-283. Jagers, P. (1997a). Towards dependence in general branching processes. In: Athreya, K. B. and Jagers, P.(eds.), Classical and Modern Branching Processes. Springer, New York. Jagers, P. (1997b). Coupling and population dependence in branching processes. Annals of Applied Probability 7, 281-298. Jagers, P. and Klebaner F. (1998) Population-size-dependent and agedependent branching processes. Preprint, Dep. Mathematics, Chalmers U. Tech. and Goteborg U. 1998:56. Jagers, P. and Nerman, O. (1984). The growth and composition of branching populations. Advances in Applied Probability 16 221-259. Jagers, P. and Nerman, O. (1996). The asymptotic composition of supercritical multi-type branching populations. In: J. Azema, M. Emery and M. Yor (eds.) Seminaire de Probabilites XXX. Lecture Notes in Mathematics 1626, Springer-Verlag. Klebaner, F. C. (1989). Geometric growth in near-super-critical population size dependent Galton-Watson processes. Annals of Probability 17 1466-1477. Klebaner F. C. (1991). Asymptotic behaviour of near-critical multitype branching processes. Journal of Applied Probability 28 512-519. Klebaner, F. C. (1994). Asymptotic behaviour of Markov population processes with asymptotically linear rate of change. Journal of Applied Probability 31, 614-625. Metz, J. A. J. and Diekmann, O. (eds.) (1986). The Dynamics of Physiologically Structured Populations. Lecture Notes in Biomathematics 68. Springer-Verlag, Berlin. Ney, P. and Nummelin E. (1984). Markov additive processes I. Eigenvalue properties and limit theorems. Annals of Probability 15 561-592. Niemi, S. and Nummelin, E. (1986). On non-singular renewal kernels with an application to a semigroup of transition kernels. Stochastic Processes and their Applications 22 177-202. Nummelin, E. (1984). General Irreducible Markov Chains and Non-negative Operators. Cambridge University Press, Cambridge. Shurenkov, V. M. (1984). On the theory of Markov renewal. Theory of Probability and its Applications 29 247-265. Shurenkov, V. M. (1989). Ergodicheskie protsessy Markova. Nauka, Moscow. Shurenkov, V. M. (1992). On the relationship between spectral radii and Perron roots. Dep. Mathematics, Chalmers U. Tech. and Goteborg U. Preprint. Shurenkov, V. M. (1992). On the existence of a Malthusian parameter. Dep. Mathematics, Chalmers U. Tech. and Goteborg U. Preprint.
398
Peter Jagers
Thieme, H. R. (1992). Analysis of structured population models by nonlinear perturbations of semigroups and evolutionary systems. 3rd International Conference on Mathematical Population Dynamics. Webb, G. F. (1985). Theory of Nonlinear Age-Dependent Population Dynamics. Marcel Dekker, New York. SCHOOL OF MATHEMATICAL AND COMPUTING SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY SE-412 96 GOTHENBURG SWEDEN
[email protected]
CHI-SQUARE ORACLE INEQUALITIES
IAIN M. JOHNSTONE
1
Stanford University We study soft threshold estimates of the non-centrality parameter ξ of a non-central Xd(ξ) distribution, of interest, for example, in estimation of the squared length of the mean of a Gaussian vector. Mean squared error and oracle bounds, both upper and lower, are derived for all degrees of freedom d. These bounds are remarkably similar to those in the limiting 2 Gaussian shift case. In nonparametric estimation of / / , a dyadic block implementation of these ideas leads to an alternate proof of the optimal adaptivity result of Efromovich and Low. AMS subject classifications: 62E17, 62F11, 62G07, 62G05. Keywords and phrases: Quadratic Punctionals, Adaptive Estimation, Gaussian sequence model, Efficient estimation, non-central chi-square.
1
Introduction
The aim of this paper is to develop thresholding tools for estimation of certain quadratic functionals. We begin in a finite dimensional setting, with the estimation of the squared length of the mean of a Gaussian vector with spherical covariance. The transition from linear to quadratic functionals of the data entails a shift from Gaussian to (non-central) chi-squared distributions χ^(ξ) and it is the non-centrality parameter ξ that we now seek to estimate. It turns out that (soft) threshold estimators of the noncentrality parameter have mean squared error properties which, after appropriate scaling, very closely match those of the Gaussian shift model. This might be expected for large d, but this is not solely an asymptotic phenomenon the detailed structure of the chi-squared distribution family allows relatively sharp bounds to be established for the full range of degrees of freedom d. We develop oracle inequalities which show that thresholding of the natural unbiased estimator of ξ at y/2 log d standard deviations (according to central χ^) leads to an estimator of the non-centrality parameter that is within a multiplicative factor 2 log d + βd of an 'ideal' estimator that can use knowledge of ξ to choose between an unbiased rule or simply estimating zero. These results are outlined in Section 2. Section 3 shows that the multiplicative 2 log d penalty is sharp for large degrees of freedom d, essentially by reduction to a limiting Gaussian shift problem. lr
Γhis research was supported by NSF DMS 9505151 and ANU.
400
Iain M. Johnstone
Section 4 illustrates thresholding in a well-studied nonparametric setting, namely estimation of / / 2 , which figures in the asymptotic properties (variance, efficiency) of rank based tests and estimates. We apply the oracle inequalities in the now classical model in which a signal / is observed in Gaussian white noise of scale e. When this model is expressed in a Haar wavelet basis, the sum of squares of the empirical coefficients at a resolution level j has a €2χlj(pj/e2) distribution with parameter pj equal to the sum of squares of the corresponding theoretical coefficients. Thus / f2 = ]Γ pj and this leads to use of the oracle inequalities on each separate level j . Section 5 contains some remarks on the extension of our thresholding results to weighted combinations of chi-squared variates. In addition to proof details, the final section collects some useful identities for central and noncentral χ 2 , as well as a moderate deviations bound for central χ 2 , Lemma 6.1, in the style of the Mill's ratio bound for Gaussian variates. 2
Estimating the norm of a Gaussian Vector
Suppose we observe y = (yi) £ R d , where y ~ Nd(θ, e 2 /). We wish to estimate p = ||0||2 A natural unbiased estimate is U = \y\2 — de2 — Σ{y2 — e 2 ). We propose to study the shrunken estimate (1)
pt = p(U;t) = (U-te2)+.
This estimate is always non-negative, and like similar shrunken estimators we have studied elsewhere, enjoys risk benefits over U when p is zero or near zero. We will be particularly interested in t = td = σ^\/2 log d, where σd = \pΐd is the variance of χ^, the distribution of |y| 2 /e 2 when θ = 0. [The positive part estimator, corresponding to t = 0, has already been studied, for example by Saxena and Alam (1982); Chow (1987).] The estimator pt may be motivated as follows. Let (2)
σ2 (p) = Var (U) =2e*d + 4e2p.
An "ideal" but non-measurable estimate of p would estimate by 0 if p < σ(p) and by U if p > σ(p). This rule improves on U when the parameter p is so small that the bias incurred by estimating 0 is less than the variance incurred by using estimator U. Hence, this ideal strategy would have risk min{p 2 ,σ 2 (p)}. Of course, no statistic can be found which achieves this ideal, because the data cannot tell us whether p < σ(p) for certain. However, we show that βt comes as close to this ideal as can be hoped for. To formulate the main results, it is convenient to rescale to noise level 6 = 1, and to change notation to avoid confusion.
401
Chi-square Oracle Inequalities
Thus, let Wd ~ χl(ξ) - we seek to estimate ξ using a threshold estimator
it{w) = (w-d-t)+. Write σ2(ξ) = 2d + 4ξ for the variance of Wd, and Fd(w) = P{χ2d > w) for the survivor function of the corresponding central χ2 distribution. Introduce two auxiliary constants (which are small for d large and t = o(d) large): (3)
771 = 2Fd+2(d
+ ί),
7/2 = ηι + t/d,
t > 2.
Let Dζ and Z)| denote partial derivatives with respect to ξ. Theorem 2.1 With these definitions, the mean squared error r(ξ,t) = E(ξt(Wd) - ξ)2 satisfies, for all d> l,ξ > 0 and t > 2, (4)
r(ξ,t) lλΈλθ\
Eθlη2(X).
405
Chi-square Oracle Inequalities
Proof The minimax risk R is bounded below by the Bayes risk £(τr), using prior distribution π and loss function Z(α, 0) : R > B(π) = inf [ π(dθ) Eθl(θ)[θ(X) θ J
- θ]2.
At least to aid intuition, it helps to convert this into a Bayes risk for squared error loss with modified prior π given above, so that 2
B(π) = L inf f π(dθ) EΘ[Θ(X) - θ} =: B(π). θ J
For squared error loss, now, the Bayes estimator 0^ that attains the minimum JB(τf) is given by the posterior mean, which in the two point case with 0o = 0 takes the simple form 0 # (s) = ^n[θ\x] = 0iP({0i}|x) = θiη{x), which implies the desired formula (16):
B(π)
= L f π(dθ)Eθ[θ*(X)
- θ]2 > Lπ({θι})EOl[θιη(X)
- 0i] 2 .
Proposition 3.1 Suppose X ~ N(θ, 1). Then as d -> cx>,
Proof In Lemma 3.1, let PQ correspond to X ~ N(θ, 1) and take l(θ) — [dΓι + {θ2 A I ) ] " 1 . Choose θ0 = 0 and θι = θd » 1 (to be specified below), so that lo = d and k = (1 + d " 1 ) " 1 . Set TΓO = 1/ log d and TΓI = 1 — τro so that L = KQIO + π\l\ ~ dj log d and the loss weighted prior π = (1 — e)i/o + e ^ d with e = π\l\jL ~ logd/d small. The idea is that with e small, we choose θd so that even for x near 0^, η[x) = P^({0}|x) « 1. Thus, with probability essentially e we estimate θjt w 0 even though θ = θd and so incur an error of about 0^. Now the details. Write g{x; θ) for the JV(0,1) density and, since we will recenter at 0^, put x = θd + z. Then the likelihood ratio
406
Iain M. Johnstone
Of course, the posterior probability η(x) can be written in terms of the likelihood ratio as (19)
η{βd + z) = [1 + ί^ZooCz; θd)]-\
Put αd = log log d and specify θd as the solution to η(θd + α 00,
Proof The rescaled variable X = (W - d)/y/2d has mean θ = ξ/y/2d, variance σ\(θ) = 1 + θy/8/d and is εtsymptotically Gaussian as d —> 00. # Let 1/4, efficient estimation at mean squared error rate n~ι is possible, while for a < 1/4, the best MSE rate is n~ r = r p ^ p r o b i e r n of "adaptive estimation" concerns whether one n-8α/(i+4α) can, without knowledge of α, build estimators that achieve these optimal
408
Iain M. Johnstone
rates for every α in a suitable range. Alas, Efromovich and Low (1996α,6) showed that this is not possible as soon as 0 < α < 1/4. They go on to adapt version of Lepskii's general purpose step-down adaptivity construction (Lepskii, 1991) to build an estimator that is efficient for α > 1/4 and attains the best rate (logarithmically worse than n~r) that is possible simultaneously for 0 < α < 1/4. The treatment here is simply an illustration of chi-square thresholding to obtain the Efromovich-Low result. Two recent works (received after the first draft was completed) go much further with the f f2 problem. Gayraud and Tribouley (1999) use hard thresholding to derive Efromovich-Low, and go on to provide limiting Gaussian distribution results and even confidence intervals. Laurent and Massart (1998) derive non-asymptotic risk bounds via penalized least squares model selection and consider a wide family of functional classes including ip and Besov bodies. Consider the white noise model, where one observes Yt = f£ f(s)ds+eWt, 0 < t < 1, where W% is standard Brownian motion and / G I/2([0,1]) is unknown. It is desired to estimate Qf = J f2. We use the Haar orthonormal basis, defined by h(t) = i"[o,i/2] ~ J[i/2,i] a n d ψi(t) = 2^2h(2H — k) for indices / = (j, A;) with j G N and k e lj = {1,... , 2j}. We add the scaling function ^(-i,o) = ^[0,1] ^ n t e r m s of Λe orthonormal coefficients, the observations take the dyadic sequence form (25)
yi = θt + ezi
where θj = Jφif and the noise variables zι are i.i.d. standard Gaussian. By Parseval's identity, Qf — £)0j, and we group the coefficients by level j:
where \lj\ = dj = 2J (and equals 1 for j = — 1). The corresponding sums of 2 data coefficients have non-central χ distributions:
We estimate Qf by estimating pj at each level separately and then adding. To quantify smoothness, we use, for simplicity, the Holder classes, which can be expressed for a < 1 in terms of the Haar wavelet coefficients as (26)
θa(C)
= {θ : \θi\ < C 2 " ( Q + 1 / 2 ^ for all J}.
See Meyer (1990, Sec. 6.4), or Hardle et al. (1998, Theorem 9.6) for a specific result. Thus, in terms of the levelwise squared l 0.
409
Chi-square Oracle Inequalities
In smoother cases, the low frequencies are most important, whereas in rough settings, higher frequencies are critical. For the lower frequencies, define l e — {I : j < j0}. The estimate combines unbiased estimation at these lower frequencies (where efficiency is the goal)
(28)
2
Qe = £ (y? - e ), ieXe
with thresholding at higher frequencies (29)
Qt= j
where, in notation matching Section 2, we put y/2dj
tj = y/2 log dj and
Of course jo &nd jι as just defined need not be integer valued. We adopt throughout the convention that a sum Σbj=α is taken to run over j = [α\ = floor(α) to j = [b] = ceiling(6). Below, c denotes a constant depending only on α, not necessarily the same at each appearance. Theorem 4.1 Let observations be taken from the Gaussian dyadic sequence model (25) and let the estimator Q = Qe + Qt of Qf = J f2 be defined via (28) and (29). Let r = 8α/(l + 4α), (i) For 0 < α < 1/4, (30)
sup E(Q - Qf)2 < cC2^2-r\e2y/\og(Ce-^)r{l
+ o(l)),
/Gθ«(C)
^
For α > 1/4, sup \E(Q-Qf)2-4e2Qf\
(31)
=o
() Proof. Decompose Qf = Σ,Pj = Qef + Qtf + Qrf where the ranges of summation match those of Qe and Qt in (28) and (29). From the triangle inequality for \\δ\\ =
(32)
2
2
2
y/E{Q - Qf) < \]E{Qe - Qef)
+ sjE{Qt - Qtf)
The tail bound is negligible in all cases: from (27) and (29) oo rj
z Jl
s cu
Δ
— co e .
+ QJ
410
Iain M. Johnstone Efficient Term. Since Qt is unbiased, we have, using (28) and (2), E(Qe - Qef)2
(33)
= Var Qe = 4e 2 ] Γ θj + 2 J 0 + 2 e 4 .
The second term is always negligible: from (28), 27 0 e 4 = e 2 (log 2 e~ 2 )~ 1 / 2 o(e 2 ). Further, using (27), for / G Θ α (C),
Xe
iθ + 1
Combining the two previous displays (34)
sup \E(Qe - QJ)2 - 4ε 2 Q/| = o(e2). σ fee {c)
Thresholding term. The rest of the proof is concerned with bounding
(35)
yjE(Qt - Qtf? < Σ \lE{pj - Pj)2 =: Te(/). JO
The oracle inequality (15) yields (36)
^(p,- - P j ) 2 < 2e4 + min{2p 2 , σ2(Pj)
+ ί^e4},
where σ2(pj) = 2 J + 1 e 4 + 4e 2 pj. First, we evaluate jl
(37)
f ^ ^ m i n ί p ^ ί . e2}. 30
Since pj = C22~2αi is geometrically decreasing in j and tjβ2 = cj 1 / 2 2 J / 2 e 2 is geometrically increasing in j , we must have T e < c(α)pj 2 , where j2 = j2(c; C, &) is the crossing point, namely the (usually non-integer) solution to 4 4 j 2 (i+4a)j = c c e - . As spelled out in (56) in the Appendix, as e -> 0, (38)
fe < c(α)p j 2 = c C 2 2 - 2 ^ ' 2 - c C 2 " r ( e 2 logίCe" 1 )) 7 */ 2 .
We conclude by checking that on Θ α (C), Te(f) < cTe for small e. Looking at the terms in (36), we observe first that j\e2 = o(T€). Now let j% be the solution to pj = ί 2 e 2 , or equivalently j 2 ( 1 + 2 a ) J = cC2e~2. Again using (56),
and so (28) shows that for small e, js(e) < jo(e) From this it follows that for θ E θα(C) and j > j 0 we have σ2(pj) < cήeA, so that (37) is indeed the dominant term in (35).
411
Chi-square Oracle Inequalities
In the efficient case, (38) is negligible, so that (31) follows from (32) and (35). In the nonparametric zone 0 < α < 1/4, (38) shows that (34) is negligible relative to (35), from which we obtain (30). • Remark. Haar wavelets have been ingeniously used by Kerkyacharian and Picard (1996) in the context of estimating f f2 and especially J / 3 . However, thresholding is not used there, nor is adaptivity considered.
5
Remarks on weighted chi-square
Suppose, as before, that yk ~ ΛΓ(0fc,e2),fc = l , . . . d are independent, but that now we desire to estimate pa = 5 ^ OL\ Θ\ with α^ > 0. Such a scenario emerges in estimation of f(Dιf)2, for example. Then the natural unbiased a e2 ιs n 0 estimator pu^a = Σ i k{Vk ~~ ) l ° n g e r a shift of a chi-square variate. If the weights are comparable, say 1 < otk < ot for all fc, then an extension of the risk bounds of Theorem 2.1 is possible. We cite here only the extension of Corollary 2.2, referring to Johnstone (2000) for further results and details. Z
Proposition 5.1 With the above notations, set td = σ^\/2 log d. There exists an absolute constant 7 such that E[p{Ua; Std) - Pa? < Ίa2 [ ^
6
+ min{2p2, σ2(Pa)
+
H2}}.
e
Appendix
6.1
Central χ2 distributions
Write fd{w) = e~wl2wdl2-1 /2d/2Γ(d/2) for the density function of χ2d and Fd(w) = J°° fdiu)du for the survivor form of the distribution function. We note the relations (39) (40) (41)
wfdM= 2
w fd{w) Dwfd+2(w)
= = \[fd{w)
where Dw denotes partial derivative w.r.t. w. Recall that the Poisson p.d.f. is denoted by pχ(x) = e~xXx/Γ(x + 1). Prom (41) or via probabilistic arguments, (42)
Fd+2(w) - Fd(w) =
pw/2(d/2).
412
6.2
Iain M. Johnstone 2
A moderate deviations bound for central χ 2
Lemma 6.1 Let Wd ~ χ d and σ\ = VarWd = 2d.IfO<s
sσd) < (- + - ^ s
y/dβ, then
~f
σd
In consequence, if d > 16 and 0 < s < d 1 / 6 , then (44)
P{Wd -d>
sσd)
s) < φ(s)/s, to which it reduces as d -ϊ oo whenever s = o(dιl2). It may be compared with two existing bounds, each derived for more general purposes. First, the Tsirelson-Ibragimov-Sudakov inequality for Lipschitz functions of a standard Gaussian vector yields, for 5 > 0, P(Wd -d>
V2sσd + s2) < e~s2l2,
while the more refined inequality of Laurent and Massart (1998, Lemma 1) has as corollary, for positive s:
Substituting 5 = y/2 log d in this latter inequality shows that it does not suffice for conclusion (12). Proof For w > d, fd(w) < ^+2(^)5 so it suffices to bound Fd + 2(si), where we have set s\ = d + sσd. Equalities (39) and (41) combine to give
Now use the idea behind the bound Φ(s) < φ(s)/s : for w > si, 1 — d/w > 1 — d/s\ and so (45)
< 2(1 - d / s i ) " 1 / ^ ^ ! ) .
Fd+2(Sl)
Stirling's formula, Γ(d/2 + 1) > λ /2^e- ίi / 2 (ίί/2)( oo for φn(x)
= n " 1 sin(2πnx)
whereas sup-noτm of the solution un{x,y) = smh.(2πny) sin(2πnx) / (2πn2) tends to infinity for any y > 0. So the problem is called ill-posed in the Hadamard sense. Nevertheless this problem is the important geophysical problem of interpreting the gravitational or magnetic anomalies (see Lavrentiev (1967) and Tarchanov (1995)). The usual approach to ill-posed problems deals with the recovery of a solution based on a "noisy" data. In order to guarantee consistent recovering some additional information about the function φ(x) is required. It is assumed as a rule that φ belongs to a compact set /C in a suitable space. The performance of the optimal solution depends on K and on the definition of the noisy data. Usually (see Tichonov &; Arsenin (1977), Engl &, Groetsch (1987)) it is assumed that the observed data are ε
ψ {x) lr
ε
=φ(x)+n {x),
Γhe research was supported by NSF grant DMS-9600245 and DMS-9971608.
420
G. Golubev and R. Khasminskii ε
where n (x) is an unknown function from a Hubert space with the norm ε \\n \\ < ε. On the other hand there exists another approach to ill-posed problems, ε which is based on the assumption that n (x) is a random process (see Sudakov & Khalfin (1964), Mair & Ruimgaart (1996), Sullivan (1996)). In the ε present paper we assume that n (x) is a white Gaussian noise with spectral 2 ε density ε . Of course, n (x) does not belong to a Hubert space and usually solutions of ill-posed problems obtained by deterministic and stochastic approaches are different. To illustrate, consider a simple ill-posed problem. Suppose we need to estimate the first derivative of a function /(#), x E [0,1] based on the noisy data ε
ε
z (t) = f(x)+n (x). Assume also that / belongs to the Sobolev ball
Wj(P) = {/ : jf1 f{u) du + j\ί{β\u)
fdu
1. It is not very difficult to show that as ε —> 0 the minimax rates of convergence are given by inf sup sup /' feWξ(P) \\nε\\(exp(4πy) -
Minimizing the right-hand side in the above equation with respect to N we get the optimal bandwidth . ( ! / - L)k) where the minimum is taken over all integers. It follows from the above formula that
1 4πL
D(L-y)(ψ«y)-Ά ε2y
+
where \c\ < 0.5. Substituting N(ε) in (7) one arrives at the following formula for the risk of the projection estimator as ε -> 0:
χ
,og_2
We prefer to write D/ε2 because it is dimensionless expression, and it is 2 easy to see that the above equation is uniform in D such that D/ε -> oo. It is not difficult also to show that the above rate of convergence cannot be improved in the class of all estimators. But the goal of the present paper is more delicate. We try to find the asymptotic minimax risk up to a constant. We will see that in the general case the asymptotically minimax estimator is nonlinear and only in some special cases one can use a linear or a projection estimators to achieve the optimal constant. 2
Main results
In this section we compute up to a constant the minimax risk oo
rε(y, L) = inf sup Rφ(u,u) = inf sup Eφ V ] b\{y)\φk - φk\2,
Cauchy Problem for the Laplace Equation
423
where inf is taken over all estimators. In order to describe the asymptotic behavior (ε -> 0) of this risk we consider an auxiliary statistical problem. 2 For convenience denote vectors from R by bold letters. Assume that one observes (8) zk = θk + σξk, where ξk are iid Λ/*(0, /) and / is the identity matrix. We also suppose that the unknown vectors θk belong to the set
& = \θk: f ; \\θk\\2e** e
(9)
2
Lk
y as (10)
p(σ,2/,L)=infsupE β £
\\θk -
Let θ* = {..., θ*k_x, θk, 0fc+i,...} be the minimax estimator in (8)-(10) with the components z
lm
We construct now a counterpart φ* of θ* in the model (4). Let (11)
W{ε) = argmin 1 -
-j=
where the minimum is taken over all integers. Denote for brevity
For \k\ < y/W(ε) we set
We continue this estimator over the set of all integers as follows Λ
Ψk
*-/'
I l ( )
~ \ 4 , |Aτ| < W(ε) The following theorem determines relations between the estimators φ*, θ* and the minimax risk rε(y,L).
424
G. Golubev and R. Khasminskii
T h e o r e m 2.1 For
0
y/W, \k + W\ >
Since Φe C Φ, it follows that oo
(16)
rε{y,L) > inf sup E^ ^
ε
\φ-k
6fc(y)|c?fc-^|2
\k-w\
^W
This concludes the proof. • In proving Theorem 2.1 we have assumed that y is fixed. On the other hand we see from Theorem 3.2 that ρ(σ,y,L) w O(y~1) for small y, thus indicating that the rate of convergence of the minimax risk (see Theorem 2.1) is not uniform with respect to y. To cover this gap we describe in the next theorem the asymptotics of the minimax risk and an asymptotically minimax estimator, when y is in a vicinity of 0. Theorem 3.3 Let yε be such that
Then uniformly in y G [0, yε] as ε -¥ 0
432
G. Golubev and R. Khasminskii
and the projection estimator ε
uw(x,y,z )=
ε
Σ
2
ikx
W(ε) = ^
h(y)z ke * ,
log ^
\k\<W(ε)
is asymptotically
minimax.
Proof The upper bound for the minimax risk follows from the Taylor formula for δfc(y), when y E [0,yε]
h(y) =y + (2π where ζy E [0, yε]. Therefore, we have (see (7))
rε(y,L)
< supRφ{uw,u)
2, the Smirnov test loses its first basic property. In correspondence with the above, we define a solution of the two-sample problem to be a 'natural' stochastic process, based on the two samples, which is (α) asymptotically distribution free under the null hypothesis, and which is, intuitively speaking, (β) as sensitive as possible to all alternatives. Despite the fact that the two-sample problem has a long and very diverse history, starting with some famous papers in the thirties, the problem is essentially still open for samples in JR™,™ > 2. In this paper we present an approach based on measure-valued martingales and we will show that the stochastic process obtained with this approach is a solution to the two-sample problem, i.e. it has both the properties (α) and (β), for any m G JN. AMS subject classifications: 62G10, 62G20, 62G30; secondary 60F05, 60G15, 60G48. Keywords and phrases: Dirichlet (Voronoi) tessellation, distribution free process, empirical process, measure-valued martingale, non-parametric test, permutation test, two-sample problem, VC class, weak convergence, Wiener process.
1
Introduction
Suppose we are given two samples, that is, two independent sequences {-X^-}™1 and {X"}™2 of i.i.d. random variables taking values in m-dimensional Eum clidean space i R , m > 1. Denote with Pi and P2 the probability distributions of each of the X[ and X" and write Pni and Pn for the empirical 1 2 distributions of the first sample and of the pooled sample {X'j}™ U {X"}™ respectively, i.e.
(1.1)
Pni(B) = ±-J 712
+ Σ 1B(X")
\
,
n = m + n2,
Research partially supported by European Union HCM grant ERB CHRX-CT 940693. Research partially supported by the Netherlands Organization for Scientific Research (NWO) while the author was visiting the Eindhoven University of Technology, and partially by the International Science Foundation (ISF), Grant MXI200. 2
m
The Two-Sample Problem in M
435
m
where B is a measurable set in JR and 1B is its indicator function. Consider the difference
(1.2)
υn(B) = ^
y
(Pni(B) - Pn(B)), B e B,
and call the random measure υn( ) the (classical) two-sample empirical process with the indexing class B. Throughout we avoid the double index (ni,ri2); this can be done without any ambiguity letting n\ — n\{n) and ^2 = ri2(n). We will always assume ni,n 2 —> oo as n —> oo. The indexing class B is important for functional weak convergence of vn and will be specified in Sections 3-5. The problem of testing the null hypothesis HQ : Pi = P 2 , called 'the twosample problem', is one of the classical problems of statistics. The literature on the two-sample problem is enormous. In here we are able to mention only very few of the papers on the subject, namely those in direct relation to the aims of the present work. The specific feature of the two-sample problem is that the under HQ presumed common distribution P(= Pi = P 2 ) remains unspecified and can be any within some typically very large class V. Hence, it is important to have some supply of test statistics such that their null distributions, at least asymptotically as n -^ oo, are independent of this common distribution P £V. Such statistics are called asymptotically distribution free. The classical solution of the two-sample problem when the dimension m = 1 is associated with Smirnov (1939) where first the two-sample empirical process /nin\ 1 / 2
(1.3)
««(*) = ( ^ f )
~
(Fnι(x)-Fn{x)),
χem\
was introduced, where Fni and Fn stand for the empirical distribution functions of the first and the pooled sample respectively, and the limiting distribution of its supremum was derived. This limiting distribution was shown to be free from P provided P E Vc, the class of all distributions on M1 with a continuous distribution function. This classical statement was an early reflection of the now well-known fact that the process (1.4)
vnoF-\t),
t€[0,l],
converges in distribution, for all P G P c , to a standard Brownian bridge v (see, e.g., Shorack and Wellner (1986)). Then, for a large collection of functionals 2, was studied in Bickel (1969). Though asymptotically distribution free processes or statistics were not obtained in that paper, the general approach was well-motivated. Namely, to obtain an asymptotically correct approximation for the distribution of statistics based on υn, like, for example, the Smirnov statistic sup^^m |^ n (x)|, he studied the conditional distribution of vn given Fn. This conditioning, also adopted in Urinov (1992), and being also a part of the approach of the present paper (see Sections 3 and 4), is motivated by the fact that, under iϊo, one can construct the two-sample situation as follows. Let {Xi}™ be a sample of size n from a distribution P e ? . Let also {Si}™ be n Bernoulli random variables independent of {Xi}™ and sampled without replacement from an urn containing n\ Ones' and ri2 'zeros'. Now the set of X^s with δι = 1 is called the first sample and those with δ{ = 0 is called the second sample. Any permutation of {(Xi,δi)}™ independent of {δi}™ will not alter the distribution of {δi}™. Hence, for statistics φ({Xi}ι^{δi}ι) their conditional distribution given Fn is induced by a distribution free from P. Actually, this is the basic approach of all permutation tests and dates back at least as far as Fisher (1936) and Wald and Wolfowitz (1944). Wellknown permutation tests for the multivariate two-sample (and multi-sample) problem were developed in the mid-60's (see, e.g., Chatterjee and Sen (1964) and Puri and Sen (1966, 1969)). It should be noted, however, that most of the permutation tests are based on asymptotically linear in {5i}™, and hence asymptotically normal, statistics. To essentially nonlinear statistics, like the Smirnov statistic, this approach was first applied in Bickel (1969), to the best of our knowledge. There are several other methods for obtaining statistically important versions or substitutes of the two-sample empirical process, see, e.g., Friedman and Rafsky (1979), Bickel and Breiman (1983), Kim and Foutz (1987), and Henze (1988) for interesting approaches. Though we just discussed the two-sample problem and its solution, the precise mathematical formulation of the problem has not been given yet. The requirement of asymptotically distribution freeness can not be sufficient to formulate the problem for it can be trivially satisfied. Another condition on 'sensitivity' towards alternatives must be also imposed.
The Two-Sample Problem in Mm
437
In this paper we propose two related formulations of the problem (Section 2), one of them imposes quite strong requirements. Then in Section 3 we construct a (signed-)measure-valued martingale M n , which is a generalization of the process (1.5) of Urinov (1992), and its renormalized versions un and wn. We prove limit theorems for the asymptotically distribution free modifications un and wn as well as for M n , both under the null hypothesis (Section 4) and under alternatives (Section 5) and show that under natural conditions un and wn are solutions of the two-sample problem. 2
General notations; some preliminaries; formulation of the twosample problem
As we remarked in the Introduction, in the classical two-sample problem in Mι it is required that under Ho the common distribution P belongs to the class Vc of distributions having continuous distribution functions, and for this class of P's, the Smirnov process υn o F~ι and the Urinov process Mn o F~λ are asymptotically distribution free. In iR m , we also need some requirements under HQ. Let μ denote Lebesgue measure and let from now on V denote the class of all distributions P with the properties (Ci) P([o,i] ro ) = i; (C2) / = dP/dμ > 0 a.e. on [0, l ] m . Condition (Cl) is not an essential restriction since it can be satisfied in several ways. For example, if Yi, ...,ym denote the coordinates of some absolutely continuous m-dimensional random vector Y and if Gi,...,G m are some absolutely continuous distribution functions on JR such that the range of Yk is contained in the support of Gfc, k = 1,..., m, then the random vector X with coordinates Xk = Gfc(lfc), k = 1,..., m, has an absolutely continuous distribution on [0, l ] m . Another, perhaps better, possibility is to reduce the pooled sample to the sequence {Ri}™, where the coordinates of each R{ are the normalized coordinatewise ranks of the corresponding coordinates of the ΐ-th observation. (Note that the thus obtained two-sample empirical process is equal to vn o (i^" 1 , . . . , F ^ ) , where Fjn, j = 1,... ,ra, are the pooled marginal empirical distribution functions.) Though there is definitely no absolute continuity of the distribution of Aj, i = 1, ...,n, we will indicate below how the subsequent program can go through for these ranks (see e.g. Lemma 3.5). Condition (C2) represents a certain restriction. Observe, however, that the processes un and wn, defined below, have limiting distributions which depend on P only through its support. Besides the classical two-sample empirical process Vn there can be many other random measures which are also functionals of Pnι and Pn and could also be called two-sample empirical processes. We will obtain versions of
438
J.H.J. Einmahl and E.V. Khmaladze
such processes which will be asymptotically distribution free from P £ V. It is also needed that such a process is sufficiently sensitive to possible alternatives. To formulate both requirements precisely we need to describe the class of alternatives. In fact, it will be the class of all compact contiguous alternatives to the two-sample null hypothesis. Here is the precise condition: (C3) The distributions Pi and P2 of each of the X[ and X", respectively, depend on n and are, for each n, absolutely continuous w.r.t. some P E V, and the densities dPj/dP, j = 1,2, admit the following asymptotic representation
and
-> 0, j = 1,2, for some h with 0 < \\h\\2 :=
(hjn - hfdP
fh2dP
< 00, while m/rc -> p 0 € (0,1).
The distribution of the pooled sample {X^}^1 U {X"}™2 under P is certainly the n-fold direct product Pn = P x x P . It is well-known (Oosterhoff and van Zwet (1979)) that its distribution under the alternative (2.1), which is the direct product P™1 x P^12, is contiguous with respect to P n , and that under Pn (2.2) .
Ln = \nd{P\pnP*)
-M N{-\
with iV(μ, σ 2 ) denoting a normal random variable with mean μ and variance σ 2 . Hence, under Pf 1 x P 2 n 2 (2.3)
Ln->diV(i|H|2,||Λ||2).
Note that, in (2.1), it could seem more natural to start with some functions h\n and /i2n converging to h\ and Λ2? instead of converging both to h. However it can be shown that this general situation reduces to (2.1) as it stands, when we replace P by a strategically chosen new P, namely the one such that (P^P) is 'closest' to (Pχ,P2), where this closeness is measured in terms of the distance in variation between Pn and Pf 1 x (2.4)
n2
n
1
2
n
d(P^ x P 2 , P ) = Pf x P? {Ln > 0) - P {Ln > 0),
a very proper distance in a statistical context. Indeed, it is clear that = max (&(α) - α), Uι (see (3.15)) is a strong P-solution, and we consider also a sequence of differently normalized random measures {wn}n>ι (see (3.16)) and show that under natural assumptions it is a weak P-solution. In conclusion of this section we give a few remarks which may illuminate the possible nature of strong and weak solutions. The first remark is that for an appropriate indexing class B (see Theorem 5.3) the classical two-sample empirical process vn possesses property (/?), though not property (α). When m = 1, however, the processes vn o F~ι in (1.4) and Mn o F ^ 1 , with Mn as in (1.5), do satisfy (α) and (/?), and hence are strong TVsolutions to the two-sample problem. For any m > 1, the process wn below remains in one-to-one correspondence with vn for each n (Lemma 3.1 and definition (3.16)) and therefore contains the same amount of 'information' as vn for each finite n. However, as n —>• 00, some 'information' (though not much) is asymptotically 'slipping away' (Theorem 5.3 and following comments). As the second remark we note one 'obvious' weak solution, which nevertheless is quite interesting for practical purposes: let ζ ~ Λ/"(0,1) be independent from vn (say, is generated by a computer programme) and consider ξn(B) = υn(B) + Pn(B)ζ. Since υn converges to a P-Brownian bridge it is immediate that ξn converges to a P-Brownian motion under HQ. Then it can be renormalized exactly in the same way as un below (put ί = 1 in (3.15)) and will become asymptotically distribution free, however, because of the randomization involved, ξn will loose property (/?) though it will retain property (7). Curiously enough, in many practical situations the loss is not big (Dzaparidze and Nikulin (1979)). Finally remark, as shown in Schilling (1983), that the asymptotically distribution free process of Bickel and Breiman (1983) though very interesting from some other points of view can not detect (in a goodness-of-fit context) any of 1/v^-alternatives. Whether the process of Kim and Foutz (1987) connected with the same initial idea of uniform spacings can detect such alternatives remains formally unclear. However we believe that the phenomena discovered in Chibisov (1961) explain why it may not be likely. For the omnibus statistic of Friedman and Rafsky (1979) the recent result of Henze and Penrose (1999), Theorem 2, leaves little hope that it can detect any of 1/v^-alternatives. So, to the best of our knowledge, the two-sample problem, as described in this section, is essentially still open when m > 2.
The Two-Sample Problem in Mm 3
441
Two-sample scanning martingales
The main object of this section if not of the whole paper is the set-indexed process Mn - see (3.2) below. Though its proper asymptotic analysis requires certain mathematical tools, nothing really is required for the basic idea behind it. Suppose we agreed on some order in which we 'visit' or 'inspect' the elements of pooled sample {-X»}i, so that we first visit X(χ), then X(2) and so on. Suppose this order is independent from the indicators {Si}™. (This order is formalized by the scanning family A below.) Then the classical empirical process (1.2) can be written as
(3.1)
Ί
vn(B) = (JL-) £ MX(θ)(* - -)•
where n\jn is, obviously, the unconditional probability of drawing One' on the 2-th draw (see (3.4)), while the process Mn is defined as (3.2)
Mn(B)=
\nin2
^
where pi is the conditional probability of drawing 'one' given that many 'ones' were found before the draw: pi = number of remaining 'ones' /{n — i + 1) - see (3.5). This is the only difference between Mn and vn. Observe in particular that the processes M n , and un and wn in the sequel, are indexed by the same multiυαriαte i?'s as v n , and hence that the, in general, univariate scanning family A does not lead to 'univariate' processes. In several aspects the behaviour of Mn seems simpler and more convenient than that of vn. At least, we know now how to standardize Mn. At the same time, like υn, Mn preserves 'all information' that is contained in the samples themselves (Lemma 3.1 and Theorem 5.2. Our final processes un and wn are simply weighted versions of Mn. Now, let Λ = {Aut E [0,1]} be a family of closed subsets of [0, l ] m with the following properties: 1)
Ao = 0, Ai = [0, l ] m ,
3)
μ(At) is continuous and strictly increasing in t.
2) At C A* if t < t>\
This family will be called a scanning family. Denote with X{ an element of the pooled sample {X^1 U {X"}?2> with X[ and X1! reordered in some arbitrary and for us unimportant way. Under the two-sample null hypothesis this pooled sample {Xi}ι is simply a sequence of i.i.d. random variables with distribution P EV each. Let (3.3)
t(Xi) = min{t : X{ G At},
442
J.H.J. Einmahl and E.V. Khmaladze
denote with {ti}™ the order statistics based on {t(-Xt)}Γ and let {X(ί)}ι be the correspondingly reordered X^s. Put also to = 0 and t n +i = 1 when needed. Later it will be useful to have in mind that absolute continuity of P (condition (C2)) implies that all the U are different a.s. Under HQ the sequence of Bernoulli random variables δ{ — TL{X(i) £ first sample} , is independent of the {X^}™ and the distribution of the {δi}™ is that of sampling without replacement from an urn containing n\ 'ones' and n2 'zeros' (see Section 1). Now we define the filtration based on the scanning family A. Let
te(0,l], -3 < If P is continuous, then the conditional distribution of Pnι given T§ is free from P, but conditioning on T^ also produces a simple distribution for Pni free from P : (3.4)
P{X(i)
e first sample |JΓ0} = lP{Pni(X{i))
= ±\F0} = ^
and, for j < i — 1, (3.5)
P{X{i)
e first sample
where Ac = [0, l] m \A; note that nPn(A^.) = n — j a.s. We will write
Consider now t>n along with the filtration {Tt^t E [0,1]} in a way similar to the construction introduced in Khmaladze (1993), i.e. for each B consider vn(B Π At), or, equivalently, consider Pni(B Π A t ), as Pn is ^o-measurable. By doing this we obtain a new object in the two-sample theory, which is for each B a semimartingale with respect to {T^t G [0,1]} and for each t a random measure on BQ. Hence we gain the possibility to apply to vn and P n i the well-developed theory of martingales and of marked point processes; see e.g., Bremaud (1981)and Jacod and Shiryayev (1987). More specifically, for given B consider the (normalized) martingale part of the submartingale {Pni{BnAt),Tut po € (0? 1) (i) For α.α. sequences {P n }n>i; ^ e Λαve conditionally on T§, for all B E BQ (un(B, •)>(*) -> μ ( ^ Π A t ) α.s. (n -> oo). (ii) Suppose the class B is a VC class. Then sup C£a'(B)
\μn{C) — β{C)\ -> 0 a.s. (n -> oo).
It could be noted that the initial observation behind the proof of the lemma is that, according to the Kolmogorov strong law of large numbers (SLLN), for each B G Bo
-Σ Σ TΓrϊ*^ TΓrϊ1*^ 71
~>»W »W((=//77 )
, i J \ i )
as
J
Before we prove this lemma let us introduce another normalization of Mn. Consider the Dirichlet (or Voronoi) tessellation of [0, l ] m associated with the sequence {Xi}™- for each X{ let Δ(Xi) = { x E [0, l ] m :\\x-Xi\\=
min \\x - Xj\\}
The Two-Sample Problem in Mm
447
and let for each C
Cn=
U Δ(X0, μn(C)=μ(Cn)
=
Now introduce (3.16)
wn(C) =
Then again, since the sequence {μ(Δ(X ( i ) ))}? is .^-measurable, wn(BΠAt) is for each t a random measure in B and for each B a martingale in t, and, in the obvious notation,
(u^iJ,.))(«) = j £ - ~ T h e expression in (3.16) also can be viewed as another empirical analogue of (3.14): r
1 3i
Mn(dx),
since the step-function fn{x) = (nμ(A(Xi))) ι for all inner points x £ (and let it be 1 on the boundaries A(Xi)Γ)A(Xj)) can be considered as a density estimator, though an inconsistent one. Its analogue on JR is essentially the 1-nearest neighbour estimator. Denote p(Xi) =
max
eA{X)
\\x — Xi\\.
We shall consider {Xi}™ that do not necessarily form a random sample, in order to justify to some extent the possibility of using the normalized ranks {Ri}ι as mentioned in Section 2. For these more general X^ the δ{ which determine first and second sample are as in Section 1. Lemma 3.5 Suppose that the Xi, 1 < i < n, are random vectors in [0, l ] m with Xi φ Xj α.5. for iφ j , such that for their empirical distribution Pn we have Pn -+w P a.s. (n -» oo), for some P eV. (i) Then p* = max p{Xχ) -> 0 a.s. (n -> oo). l po E (0,1), then for α.α. sequences {Pn}n>ι, conditionally on
Ifni/n
sup cec
\(wn(C, )>(1) - μ(C)\ -> 0 a.s. {n -> oo).
(iii) Also, under (3.17) sup cec
\μn(C) - μ(C)\ -> 0 a.s. (n -* oo).
Proof of Lemma 3.4
Consider
|(*) - μ(B Π At)\
oo, Z1 1
-y—
= μ(B Π Λ t ) a.s.
So it suffices to consider 1
"
9
i
1
n
f(Xi)
/(x
έί «)
n
First we show that the last term above converges to 0 a.s. We will split this sum in the sum involving the Xu\^ for which X^ G A\-ε and the sum involving the -Xφ's for which Xφ £ A\-ε. Since P E ? , we have P{A\-ε) < 1 and hence it follows from a kind of conditional GlivenkoCantelli theorem that max
1
0 a.s. In -+ oo). J f c
ι
This is, however, a routine matter: since B is a VC class and L ^ f~ dP — 1 < oo, the class of functions {f~ιΆc C G αr(B)} is a Glivenko-Cantelli class. •
Proof of Lemma 3.5 (i) For k £ IV, let Ήk be the finite set of hypercubes of the form ΠJLjr j/λ;, (r, + 1)/Λ], r^ E {0,1,..., k - 1}. Since Pn -*w P a.s. and P £ P, we see that for all k £ IN, sup# GKfc |P n (ff) - P(ίf)| ~> 0 a.s. But since inf/f€^fc P(iί) > 0, this easily implies that p* —>- 0 a.s. ^ We now prove part (iii). Let ε > 0. Since p* —)• 0 a.s. and for all C EC, Cn C Cε and (C n ) c C (C c ) ε , that is C e C Cn C C ε , as soon as p* < ε, we have limsup
sup |μ n (C) — μ(C)| < sup μ(Cε\Cε) a.s.
Now (3.17) proves part (iii). Finally we consider part (ii). Because of part (iii), it is sufficient to show that sup | K ( C , ) ) ( l ) - μ n ( C ) |
cc
= sup | ] cec ~ 0 a.s.
f
This last expression, however, can be treated in much the same way as 1
n
,2
in the proof of Lemma 3.4.
The Two-Sample Problem in JRm
451
Weak convergence under HQ: property (α) Let B C Bo be the indexing class for the random measures Λfn, un and wn defined by (3.2), (3.15) and (3.16) respectively, and consider the space ίoo{B) as the space of trajectories of these measures. To prove the convergence in distribution in £oo(B) one needs the convergence of the finite-dimensional distributions and the asymptotic equicontinuity property, studied in the empirical process context, e.g., in Pollard (1990), see Theorem 10.2, and Sheehy and Wellner (1992). This property follows from Lemma 4.2 below (in combination with Lemmas 3.3-3.5), which in turn follows from appropriate exponential inequalities. The first lemma of this section provides these inequalities. Consider the process (4.1) with ^Q-measurable coefficients 7;, i = 1,... ,n. The process ξ is a martingale and with U z I ^o} < 2e~ 222 /r(i).
Corollary 4.1 For z > 0
{ ( {
\Mn(B) - Mn(C)\>
/
z
2 \ 1/2
[ — ) \n\Π2 J /
2 \ ιl2
\un(B)-un(C)\ > z ( — J \ n l n 2/ /
2 \ 1/2
"I
2
I To \ < 2exv(-2z /Pn(BAC)), J Λ
\TΛ< J
2exp(-2z2/μn(BAC)),
1
|tι;n(B) - ti;n(C)| > W - ^ - J | ^o > < 2exp(-2* 2 /μ n (BΔC)). \nιn2j J Proof Take in inequality (4.2), 7* equal to 1 B (Xφ) - lc(-X"(»)) multiplied by n-V2, (n/nίXμ)))- 1 / 2 and ( μ t Δ ^ ) ) ) 1 ^ , respectively. .
452
J.H.J. Einmahl and E.V. Khmaladze
Proof of Lemma 4.1 The proof follows the well-known pattern. We give it here, though briefly, because the references we know about, represent the exponential inequality for a martingale in Bennett's form (see, e.g., Freedman (1975) or Shorack and Wellner (1986, pp. 899-900) rather than in Hoeffding's form (4.2). Observe that n
(
e
i+1
λ27?
P ~P*) < e~«^,
which can be found by expanding the In, as a function of λ72 , up to the second term and observing that the second derivative is bounded by 1/4. Therefore lEle^^-P^-^lTi-ύ < 1 which proves (i). Now
Minimization of this bound in λ leads to (4.2). • The next lemma is the main step towards the asymptotic equicontinuity property of our random measures. For the rest of this section we assume our indexing class to be a Vapnik-Chervonenkis (VC) class (see, e.g., Dudley (1978)). Before we proceed formally let us remind again under what distributions this asymptotic equicontinuity will be obtained. All the three random measures considered are functions of {X(i)}ι (or of Pn) and of {δi}™ which is independent of Pn and has distribution described in Section 3 - see (3.4) and (3.5). We consider the distributions of Mn,un and wn induced in ^oo(#) by the distribution of {£i}" with Pn fixed and call them conditional distributions given P n , or given TQ. Observe that with this construction there is no need to care about possible non-measurability of Pn as a random element in 4»(#) nor to require 'enough measurability' of B. So, in most of the cases B will be required to be just a VC class. There are two further reasons, specific for the two-sample problem, to use indexing classes no wider than a VC class. The first is, that though we have to study weak convergence under a fairly simple sequence of distributions there are several different distances induced by different distributions on m [0, l ] which occur in the inequalities of the above corollary. We would need to make assumptions on covering numbers ΛΓ( , Q) of B in each of these distances, that is, for Q being P n , μn or μ n , n £ IV, which would be, from the point of view of applications, inconvenient. However, for VC classes we have a uniform-in-Q bound for lniV( , Q) - see Dudley (1978), Lemma 7.13, or van der Vaart and Wellner (1996), p. 86, or the proof of Lemma 4.2.
The Two-Sample Problem in Mm
453
below - and this makes any VC class an appropriate indexing class for each of Mn,un and wn. The second reason is this: though Mn is not the process of eventual interest for the two-sample problem since it is not asymptotically distribution free, we want it to have a limit in distribution for each P E V. Therefore the indexing class B should be P-pregaussian for each P EV (see, e.g., Sheehy and Wellner (1992)). However, if the class is pregaussian for all P then it must be a VC class (Dudley, 1984, Theorem 11.4.1). Though our V is more narrow than the class of all distributions (on [0, l] m ), still it seems wide enough to motivate the choice of B being a VC class. Let us formulate now the next lemma. For a finite (non-negative) measure Q on Bo and some subclass B c Bo, let B\e,Q) = {(A,B) E B x B : Q(AAB) < ε}. Call {Mn}n>χ conditionally asymptotically equicontinuous, uniformly over the discrete distributions, (CAECwd) if for any δ > 0 (4.3) lim limsup sup JP{ £i
Pn
°
sup
\Mn(A) - Mn{B)\ > δ\Fo} = 0,
(A,B)£B'(ε,Pn)
where Pn runs over all discrete distributions on [0, l ] m , concentrated on at most n points. Call {un}n>ι and {wn}n>ι CAECud if for these sequences a property similar to (4.3) holds with B'(ε,Pn) replaced by #'(ε, μ n ) and β'(ε,/ϊ n ), respectively. See Sheehy and Wellner (1992). Lemma 4.2 Let B C Bo be a VC class. Then under the null hypothesis Pi = Pi ; {un}n>ι and {wn}n>ι are CAECu f |JF0}.
Using Corollary 4.1 the first term on the right in (4.4) can be bounded from above by
< 2exp where for the last inequality n is taken large enough. Now taking εo small enough, this last expression is not larger than 2 exp
(
2α In
i
S2
Now consider the probability in the second term on the right in (4.4). We have for small enough εo and large enough n F{sup \Mn(B0) - Mn(B)\ > ||JF 0 } BeB
a£j
< f; Ψ (sup \Mn(Bj) - Mn(Bj+1)\ > 2 ( ^π
BeB
\
ln(1/£j
Po\L — Po
3=0
n\Π2
n
OO
OO
2 ^ e x p ( - α l n ( l / ε i + i ) ) =2^ei Hence, though it is possible to show the convergence in this sense to the appropriate limits if Vn is replaced by its leading first term (see the proof of Theorem 5.1 below), the eventual statement of convergence is true under the unconditional distributions Pn and Pf 1 x P 2 n 2 only. Write H(t) = JAα hdP and let
where t(x) is defined as in (3.3). Remark that the linear operator that maps h into g is norm preserving (though not one-to-one since it annihilates constant functions):
(5.2)
I g2dP = ί{h-
ί hdPfdP = ί h2dP{= 2
Now denote with Z a iV(0, \\h\\ ) random variable (Z will be the limit of Ln — Zn\ cf. also (2.2)) such that {Wp,Z) is jointly Gaussian, that is, for any finite collection of Bu , Bk E B the vector (W>(Bi), , WP{Bk), Z) is Gaussian, and let Cσv(WP(B), Z) = JB gdP. Similarly, let (W,Z) be v v 1/2 jointly Gaussian with Cov(W(B),Z) = fB gf dμ. Let ' -> ' and ' -> ' denote convergence in distribution under Pn and P x n i x P ^ 2 , respectively. Theorem 5.1 If the class B C Bo is such that Mn -> Wp and/or un -> W (n —> oo) in ^ ( B ) , ^Λen
458 (5.3)
J.H.J. Einmahl and E.V. Khmaladze Mn -> WP+ I gdP
(n -> oo)
αnd/or (5.4)
un % W + ί gf1/2dμ
{n -> oo)
The proof of this theorem is deferred to the second half of this section, but to explain the nature of the function g already here, let us remark that the leading term of Ln and Vn has the following explicit representation (see (4.1)): (5.5)
where gn(x) = h(x) — (fAC
hdPn)/Pn{A^χΛ
has the same form as the func-
tion g only with P replaced by the empirical distribution Pn. The equality (5.5) can be derived from (3.8) or verified directly. Now let us consider whether it follows from this theorem that un has property (/?). Let QUn and QUn denote the distributions of un under Pn and P™1 x P£2 respectively, and let Q and Q denote the distributions of W and W + / gfιl2dμ respectively. Theorem 5.2 // the indexing class B generates Bo, then for each sequence of alternatives satisfying (C3) d(Qun,Qun)->d(Q,Q)
=\
(n->oo).
Hence, Theorems 4.1 and 5.2 show, that if B is a VC class generating BQ and (C4) holds, then un is a strong P-solution of the two-sample problem. Remark that the process Mn also possesses property (/?). It only lacks property (α). Let us now consider wn. To find out what is the limiting covariance between wn(B) and Ln we need to study the limit of the expression
where the multipliers pi(l —ft) are not essential from the point of view of convergence. On the unit interval, i.e. m = 1, it can be proved that (5.6)
1/2
- V ίB(Xi)g(Xi) vV(Δ(Xi))ί> -)n
i=ι
kfgΓ dP
I
The Two-Sample Problem in Mm
(5.7)
459
= kfgfWdμ, B
with k = I ^ , using, e.g., the general method presented in Borovikov (1987). It follows, heuristically speaking, from the fact that the nμ(Δpf t )) behave 'almost' as independent random variables each with a Gamma(2) distribution with scale parameter 2f(Xϊ), and so k stands for the moment of order \ of a Gamma(2) distribution with scale parameter 2. However, in the unit cube, [0, l ] m , we will need to keep (5.6) for some k < 1 as an assumption. Let (W',Z') be again jointly Gaussian with the same marginal distributions as that of W and Z, but with covariance Cov(W(B),Zf) = */B 9fl/2dμ. D
Theorem 5.3 // the class B C BQ is such that wn -> W in ί^B) (5.6) is true, then
(5.7)
wn i
and if
W + k ί gf1/2dμ.
Let Q(fc) be the distribution of the right hand side of (5.7). If B generates Bo then
(5.8)
d(QWn, QWn) -> d(Q, QW) = 2Φ Q
Prom (5.8) it follows that under the conditions of Theorem 5.3 the process wn certainly possesses property (7) although not property (β) because Φ(^Λ||Λ||) < Φ(^||/ι||). So, wn is a weak P-solution of the two-sample problem. Finally, we present the postponed proofs of Theorems 5.1 and 5.2. The proof of Theorem 5.3 is much the same and will therefore be omitted. Proof of Theorem 5.1 Since the sequence of alternative distributions {Pf1 x P^2}n>i is contiguous to the sequence {P n } n >i, the CAEC^ property of Mn and/or un will be true under the alternative distributions as well. Hence (5.3) and (5.4) will follow if we show the convergence of the finite dimensional distributions of Mn and/or un to the proper limits. Let us focus on un - the proof for Mn is similar and simpler. The convergence
will follow from the Cramer-Wold device, the convergence
460
J.H.J. Einmahl and E.V. v
Khmaladze
k
(5.9) where {αj}j_ 1 and β are any constants and
Zn = and from LeCam's Third Lemma. To see that (5.9) is true observe that, given P n , the left hand side is the value of a martingale in t
Σ
1/2 /n
(cf. (4.1)) at the last point t = n. Hence if we verify that
fl/2
(5.10)
κ
J
[0,l]
"*1**
+
β g
dP
a.s.
(n —>• oc)
m
V{Pn)
V
for a.a. {P n }n>i? then actually ' —>• a.s.' will be proved and hence ' -> ' as well. However, (5.10) will follow from the SLLN if we show that the functions fn and gn can be replaced by / and g respectively and use the truncation applied in the proof of Lemma 3.4. We have
sup I f hdPn - ί hdP\ -> 0 and sup \Pn{Act) - P(Act)\ -> 0,
0 0} of such functions under (array forms of) standard strong mixing conditions. One objective of the present paper is to introduce a potentially much weaker and more readily verifiable form of strong mixing under which the limiting distributional results are shown to apply. These lead to characterization of possible limits for such ζτ(I) as those for independent array sums, i.e. the classical infinitely divisible types. The conditions and results obtained for one interval are then extended to apply to joint distributions of {ζτ{Ij) '• 1 < j' < p} of (disjoint) intervals A, J2, Ip, asymptotic independence of the components being shown under the extended conditions. Similar results are shown under even slightly weaker conditions for positive, additive families. Under countable additivity this leads in particular to distributional convergence of random measures under these mixing conditions, to infinitely divisible random measure limits having independent increments. AMS subject classifications: 60F05, 60E07, 60B15. Keywords and phrases: central limits, dependence, array sums, exceedance measures.
1
Introduction
By a random additive function (r.a.f.) we mean a random function ζ(I) defined for subintervals I = (α, b] of the unit interval and additive in the sense that ζ(α,6] + ζ{b,c] = ζ{α,c] when 0 0} be a family of r.a.f.'s, defined on such intervals and additive in the above sense, i.e. satisfying
ζτ(I U J) = ζτ(I) + ζτ(J) for each Γ > 0, whenever J, J are disjoint intervals, whose union / U J is an interval (i.e. I and J abut). The domain of definition of r.a.f.'s may of course be extended by linearity to include finite unions of intervals, and the notation usage will reflect this where convenient. This family of r.a.f.'s {ζr : T > 0} is assumed to satisfy a mixing condition Δ, which will be defined as follows: Write for 0 < r,Z < 1, Δ(r,Z) =sup|£exp{it(Cτ(/i) + Cτ(/2))} - εexp{itζτ(h)}
εexp{itζτ(h)}\,
where the supremum is taken over pairs of disjoint intervals / I = ( O L , 6 I ] , 4 = (α2,b?\ satisfying 0 < a\ < b\ < a2 < b2 < 1, a2 — b\ > I and 62 — ^2 < T. Then {ζx} is said to be A-mixing if for each real t, A(rτJτ) —> 0 for some TT = o(l) and lτ = o{rτ), as T -> 00. Note that Δ(rr, lτ) depends on T and also on t. This mixing condition has the same type of array form under which basic limiting theory for random additive functions is developed in Leadbetter and Rootzen (1993). However, it substantially weakens that in Leadbetter and Rootzen (1993) by considering only very special types of random variables, exp{itζτ(I)} for intervals /, instead of all random variables which are measurable with respect to the σ-field Bj = σ{ζτ{u,υ) : u, v G /}, or some substantial subclass thereof. The condition Δ allows consideration of the limiting distribution of ζτ(I) for a single interval /. It will be extended in Section 5 to conditions Δ p (Δi = Δ) used in determining limiting joint distributions of ζτ(h), - - -, ζτ{IP) for p (disjoint) intervals Iu I2,..., Ip. The following negligibility condition will be further assumed as needed, (using m for Lebesgue measure): sup{P{|ζ Γ (/)| > e} : m{I) < lτ} -> 0 as T -> 00, for each e > 0,
467
CUT under Weak Dependence which is readily shown to be equivalent to the condition (1) 3
7
Γ = sup{l - f exp(-|ζ Γ (I)|) : m(I) < lτ} -> 0, as T -> 0.
Asymptotic independence
In this section it will be shown under A-mixing that if an interval / is written as the union of appropriate disjoint abutting subintervals ij, the characteristic function of ζτ(I) is approximated by the product of those for the subintervals Ij. This substantially generalizes Lemma 2.1 in Leadbetter and Rootzen (1993). A simplified version of this lemma will also be shown for the positive and stationary case (e.g. where ζr is a random measure). These are key basic results leading to a classical central limit problem for ζτ(I) for any fixed interval in (0,1]. Extended conditions and results for joint distributions of ζτ{I) for more than one interval /, are considered in Section 5. For integers A T—»ooasT-»oo,a /^-partition of the interval I will mean a partition of/ into fcx disjoint subintervals Ij (= ITJ) and (for convenience) m(Ij) < r τ , j = l,...,fc τ . Lemma 3.1 Let {ζτ> T > 0} be α A-mixing family of r.a.f's for some constants {ΓT}, {IT} and let an interval I = (α, b] (which may depend on T), have a kx-partition {Ij} where (2)
kτ{A(rτ,
lτ) + 7τ) -• 0, as T -> oo.
Then, uniformly in \t\ < M, given M < oo, kτ
(3)
εexp{itζτ(I)}
- Y[£exp{itζτ(Ij)}
-> 0, as T -> oo.
j=i
Proof Take Ij = (α^-i, aj], 1 < j < k:τ for a — αo < a\ < ... < akτ = b, without loss of generality. Write Ij=(aj-ι+lτ, αj] and JjWj-Zj, j= l,...,fcτ? and for simplicity suppress the subscript T in lτ, kτ-> Γr Now, clearly since ζτ{I) = ζτ{^ZΪIj) + Cτ(4) + Cr(JΪ), - exp{itζτ(I*k)}\, and it follows from Δ(r,/) applied to the two intervals U^Jlj, Jjt that ztCτ(U J tί/i)} f exp{itCτ(/fc)}| < Δ(r, 0-
468
M.R. Leadbetter, H. Rootzen and H. Choi
Since \ε exp{itζτ{h)} - S exp{itζτ{ΐk)}\ from this and above inequalities that
< S\l - exp{iίCr(^)}|, we obtain
fe-i
k
εexp{itζτ(Ik)}\ < A(r,l) + 2ε\l-exp{itζτ(Γk)}\. Applying this repeatedly gives k
k
3=1
3=1
Now the first term on the right tends to zero by (2) and since |(1 — eιθ)/(l — e ~~'β')| < K for some K > 0 and all real 0, the second term does not exceed 2KJ2*j=i £|l-exp{-|tCτ(ίj)|}| This is clearly dominated by K\t\kTΊτ (with appropriately changed K) which tends to zero so that (3) follows. • It will be further seen that more definitive results are obtainable under simple conditions if an r.a.f. QT is assumed to be both positive and stationary, in the sense that ζτ{I + h) = Cτ(^)5 for each h and interval / with J, I + h C (0,l] Note that for positive variables it is convenient to work with Laplace instead of Fourier transforms and hence it is natural to define a Δ-mixing coefficient with Laplace transforms. Specifically the same definition is used but εe~tζτW replaces Seitζτ^ for an interval /. It is then possible to obtain the similar result to Lemma 3.1 without assuming the negligibility condition (1), using a fc^-partition which consists of intervals of presumably different lengths. However, it is more desirable to consider a "uniform " partition if the stationarity of an r.a.f. (T is assumed. This yields a simple proof and it is sufficient to evaluate only one Laplace transform εe~tζτ^rτ^ when approximating £e~ t C τ ( J ). Lemma 3.2 Let the positive and stationary r.a.f. family {ζr} be A-mixing (defined with Laplace transforms) for some constants {/T}? {rτ} where λ r^ A(rτJτ) —• 0, as T ->• oo. Let I be a subinterυal of (0,1], which may depend on T, but with kr = [m(I)/rτ] —> oo. Then without assuming the negligibility condition (1), as T —> oo, for any ί > 0, (4)
εexp{-tζτ(I)}
-
(εexp{-tζτ((0,rτ))})kτ-*0.
Hence also (5)
ε exp{-tζτ(I)}
- (£exp{-ίCr((0,
469
CLT under Weak Dependence
Proof Again for simplicity we suppress the subscript T in Iψ, kr, TT and take I = (0, α] without loss of generality, since ζψ is stationary. Write Ij for the interval ((j - l)r, jr], j = 1,2,.... First of all it is shown that (4) holds for the interval I = (0, o] with kr = α. It is sufficient to show (4) as T -> oo through any sequence such that (£exp{—tζτ(I*)})k converges to some p, 0 < p < 1. Consider separately the following two possibilities: (i) p = 1. Following the same steps as in Lemma 3.1, we obtain k
- ]Jεexp{-ίCr(/j)}|
3=1
l) + 2k(l-εexp{-tζτ(I*ι)}), since the Ij all have the same length / and ζr is stationary. Since p = 1, it follows that fclog£exp{—tζτ(I*)} -ϊ 0, so that εexp{—tζτ{I*)}) -> 0. Thus the right hand side of the above inequality tends to 0 as T —> oo and hence (4) holds. (ii) p < 1. It is possible to choose θ = θψ -» oo such that kθA(r,l) -> 0 and θl = o(r) since fcΔ(r,/) —>• 0 and Z = o(r). Hence for sufficiently large T, θ + 1 intervals Ji, J 2 , . . . , J0+1 congruent to 7* may be chosen in 7^, all mutually separated by at least /. Let J ^ be the interval separating J m and J m + i , 1 <m Σm=i{Cτ(Jm) + C τ ( ^ ) } + Cτ(Λ+i) and CT is positive,
< εexp{-tζτ(l[)} θ-l
< εexp{-t[Σ(ζτ(Jm) + ζτ(Jn)) + ζτ(Jθ) + m=l
) +Cτ(Jθ)]} εexp{-tζτ(Jθ+ι)}+Δ m=l
by the mixing condition. This latter expression is equal to θ-i
εexp{-t[Σ
(Cτ(Jm) + CrGE,)) + Cτ(Jθ)}} £exp{-tCτ(/i*)} +
by the assumed stationarity. Applying this repeatedly gives (6)
εexp{-tζτ(h)}
0 and hence that 8 exp{-tζτ(h)}) -> 1, i.e. ζτ(h) A 0. Now to prove (4) for the interval / = (0, α] with kr ψ α it is sufficient to show that εexp{-tζτ{(O,kr])} - S exp{-£ζτC0} -> 0. But this difference does not exceed 1 — £exp{—tζτ{(kr, α])} which tends to zero since as noted ζτ{h) -^ 0 and hence ζτ{{kr,ά\) A 0. (ra((fcr, α]) < m(I\) = r and ζy is positive and stationary.) Hence (4) holds. Since k ~ m(I)/r, it is readily seen that (£exp{-ίCr(/i)})fc
- (εexp{-tζτ(h)}Γ^Γ
-> 0.
Hence it follows from this and (4) that (8)
(£exp{-ίCτ(J)}) -
(5exp{-tCr(/i)})m(/)/r^0.
Then (5) is readily obtained by applying (8) to the unit interval (0,1] and /. .
4
Limiting distributions
The results of Section 3 show (partial) asymptotic independence of ζτ{Ij) for a kτ-pαrtition of an interval 7. These will now be used to show that classical central limit theory is obtained under A-mixing by considering an independent array with the same marginal distributions as ζτ(Ij). As for independent r.v.'s, the array {ζτ(Ij)}
corresponding to a kτ~
pαrtition U ^ / j = / of an interval /, will be termed uniformly asymptotically negligible (uan) if ζτ(Ij) A 0 uniformly in j , i.e. for every e > 0 /j)| > e}} -> 0 as T -> oo.
CLT under Weak Dependence
471
For each T let {(TJ : 1 < j < kr} be independent random variables with Cτ,j = ζτ{Ij), 1 < j ' < &τ Such a family will be called an independent array (of size kr) associated with ζτ(I) Note that such a partition and independent array of course are not unique. However, as can be seen, the following result is independent of the choice of array and immediately obtained from Lemma 3.1. Theorem 4.1 Let {ζr} be A-mixing with kτ{Δ(rτJτ) + 7τ) —> 0 and let {CTJ} an independent array for {ζτ{I)} based on a kτ-partition {Ij} of an interval I. Then ζτ{I) has the same limiting distribution (if any) as Σζτ,jIn particular if the array {ζτ(Ij)} is uan, any limit is infinitely divisible. Proof Since ζτ{I) = ΣjUi Cτ(Ij) and by Lemma 3.1, for each t, as Γ —> oo, k
εexp{itζτ(I)}
k
=
Thus ζτ(I) -> V f° r some r.v. η if and only if ^ ζrj —> η> • General features of classical central limit theory apply under Δ-mixing to important cases such as normal and Compound Poisson convergence as follows. In these, FTJ will denote the distribution function of the contribution ζτ(Ij) of the interval Ij in a fc^-partition of / and
ί
xdPrj
J\x\ e} -> 0 as Γ -> 0, each e > 0, τ
2
(ii) £V ατ,j(τ) -> α, ^ crτj( ) -^ σ as T -»• oo, some r > 0.
472
M.R. Leadbetter, H. Rootzen and H. Choi
It is interesting and potentially useful in applications to note that if it is known that ζτ{I) converges in distribution to some r.v. η (not assumed normal) then normality of η actually follows from (i) alone and (ii) is automatically satisfied. This may be seen from the discussion of normal convergence in Loeve (1977), Section 23.5 and the "Central Convergence Criterion" of Section 23.4 of that reference.
(2) Compound Poisson convergence.
Let X = CP(\,F) denote
a Compound Poisson random variable X i.e. X = Σ Γ ^ ' where X{ are independent with a distribution function F and N is Poisson with mean λ. Then with the above notation, ζτ(I) —> CP(\, F) if the following conditions hold:
Σ j ( l - Fτ,j(x)) (Ji) Σ j ατ,j(τ)
-* λ(l - F(z)), x > 0, at continuity points x of F
-> λ /(| x | < τ ) x dF(x) for some fixed r > 0
(iii) l i m s u p ^ ^ Σ j Ήv & random variable for some (nondegenerate) interval I, then such convergence occurs for all intervals I and ηi is infinitely divisible with Laplace transform 8 exp(—tητ) = m φ(t) ^ where φ is the Laplace transform of η(0 ιy 5
Multivariate limits
It follows simply from the above results that if J i , / 2 , . . . , / p are disjoint abutting intervals, then
under Δ-mixing and negligibility assumptions. Hence if ζτ{Ij) -> Ήj. a r.v. for each j then
Σ ^ i X .
asT->oo,
473
CUT under Weak Dependence
and the {ηL} may be taken to be independent. However convergence of this sum does not necessarily imply joint convergence in distribution of the components (ζτ(h), , ζτ{IP)) to (η^,..., rj7 ), which requires the more general relation (9)
This latter convergence requires a more detailed version of the Δ-mixing condition, which may be tailored to the number of intervals Ij involved. Specifically, for each p > 1, {ζr} is said to be Ap-mixing if there exist some constants TT = o(l) and fr = o(rτ) for each real £i,..., tp, such that
Δp(rr,/r) = p
- εexp{Y^itjζτ(Ij)}
εexp{itpζτ(Ip+ι)}\ -> 0 as T ->oo,
i=i
where the supremum is taken over any (p+1) disjoint intervals Ij = (αj, 6j], j = 1,..., (p + 1) with 0 < αi < 6i < α2 < ... < 1 αp+ι — bp > I and Remark 5.1 Note that Δi is the previous Δ»condition and putting selected tj = 0 shows that Δp-mixing implies Δm-mixing for 1 < m < p. The following result is then readily shown along now familiar lines. Theorem 5.1 Let the r.α.f. family {ζr} satisfy the condition Δ p for some {rτ}i 0 τ } Then (9) holds for given disjoint intervals Ji, I2,..., I p and any tj which are uniformly bounded in j and T, if the interval Ij has a kτ,jpartition j = 1, ...,p such that ]Cj=i ^τ,j = &τ and (10)
fcr(Δp(rr,
h) + 7τ) -> 0 as T -> 00.
Moreover, if ζτ(Ij) ->*?/. α r -^ / 0 Γ e α c Λ ^ (ηΓ ,... ,77, ) wΛere η. are independent.
ίΛen
(Cτ(Λ),
,ζτ{IP)) ->
Proof It will be convenient to write a partition of Ij as Ϊ7j?i, Uj^ , ϋj,^and define U^m and C/Jjm for j = 1,..., p and m = 1,..., kj as ϊj and Ij were defined in Lemma 3.1,' where ^ = 1 kj = kτ. Again for simplicity suppress the subscript T in fcr^τ,^r
474
M.R. Leadbetter, H. Rootzen and H. Choi We readily obtain from Lemma 3.1 and Remark 5.1 that for j = 1,... ,p ε
exp{itjζτ(UJ9m)}\ < kjΔp(r,l) + 2
Hence, using the inequality n 1
n 1
n 1
it follows that p
p
j=l
k
j
j=l m=l
(11)
< &exp(-cz)
{0 2 and K < oo ι p
(44)
/ (s(l - s)) / dK(s)
=: K < oo.
Jo
For each n > 1 let Hn be a class of functions on (0,1) such that each h £ Hn can be decomposed into the difference (45)
h = hι-
h2,
where hi and h2 are nondecreasing left continuous functions on (0,1) satisfying for all 0 < a < 1/2 (46)
sup / {s(l-s))ι/pd[hι{s) heHn Jo
+ h2(s)}
αlog(ί + 1) for all i > 1
488
David M. Mason
and the constant α is as in (3). We get then that sup
δi,n
[0 1, 0 < p < l
(67)
and x > p
P{B{n,p) >nx} -n(x-p)2/2
ί -
e x p
\
U i - P ) + (p v (i -P)){X -P)β) '
We will also need the Dvorestzky, Kiefer and Wolfowitz (1956) inequality. See also Massart (1990) for the best possible constant. F a c t 3 . 3 . For any integer (68)
n>2 and x > 0
P{||αn||>x} 0 and τ > 0. For any n > 1 and 1 < i < n - 1, define i^~2^ i 2 + 1 An(i,r)=ωn(-^-,-,-Ίr). T
(69)
490
David M. Mason
Lemma 3.1. For universal positive constants A\ and c\ for all τ > 0
W ^ ™ *l ^Δ n (i,τ)/(i/τi) 1 / 2 - ι / >τ\
1. We shall consider two cases. Case 1. First assume r iι~2δ jn < 1/2. In this case, by Fact 3.1 we have P{Δ n (i,τ) > m- 1 / 2 * 1 / 2 -"} < A e x p ( - β ΐ 2 ( ^ ) τ ^ ( l ) ) .
(71)
x
2δ
Case 2. Now assume r i ~ jn > 1/2. In this situation, by noting that Δ n (i,r) < 2 | | α n | | , we get l 2 l 2 v Γ L ί r τ n L P/Λ PSWn, I^ I ^L> • \ ^n\(i )π-\ ) ^^ τn~ l i l ~ \ ί 2i 2, 1 < i < n — 1 and all r > 1 (72)
P{Δ n (z, r)
Therefore for all r > 1
P{
> T j
0 (74)
P{Mn{δ) >τ} 1. Combining (76), (77), (78) and (79) we get Π2S\
This bound in conjunction with (75) yields for all z > 1 and i > 2 P < max
sup
,,..
Γ / M . ^
>
3z
t 3z}
1
sup
l/n oo sup which implies that as n —» oo SUp [0,l]nxC[0,l]
f
Thus by (88) and the arbitrary choice of ε we infer that n —> oo inf if JλoAfJo)JHn(fJo)dPn -> 1, v feTo,nJ J [o,i]»xc[o,i] v which by Lemma 4.1 implies (54). D
496
David M. Mason
4.2
Proof of Proposition 2.5
From now on, ηn = o(n~1/3) is as in (56), and Dn(f) is as in (87). First notice that by (55) and (50) for any choice of ε > 0 and all large n sup \Dn(f)\ < κjn sup y/n\αn(s) - Bn(s)\ +ε/2. F o<s n (/)| > e \ \< oo ^-^oo j
- κ{Pn))fdP^
> ^£- [1 z
o
y
J. τroy
The assertions remain true if liminf is replaced by limsup in (2.4) and (2.5). To obtain a bound which is as sharp as possible, one has to choose the sequence P n , n G N, such that X(PQ P^) is small, and Cn\κ{Pn) — «(Po)| large. This is the way how the theorem can be applied to particular problems. Moreover, a version with sequences is necessary for the application to diίferentiable paths. Prom the aesthetic point of view, the following version based on sequences of neighbourhoods may be more satisfying. Corollary 2.3 Let tyn C φ, n G N, be a nonincreasing sequence of sets containing PQ. Assume that the estimator sequence κίn\ n G N, is with rate Cn, n G N, asymptotically unbiased, uniformly on φn, n G N, i.e.
(2.6)
lim lim sup sup
κ(P))]dPn = 0.
(κW -
Let (2.7)
f := liminf sup Cn\κ(P) - κ(P0)\ > 0.
Then, ifά := limsup sup X(P$,Pn) n—> oo
(2.8')
< oo,
P£φn
2
lim liminf f Lu[cn(κ^
Furthermore, if 6 := limsup sup 2H(P$,Pn) lim liminf / Lu[cn(^n)
(2.8")
~*°°
< oo, then
- «(P 0 )] 2 dP 0 n
+ lim limsup sup / Lu[cn(κ^ u
^
- n(Po))] dP£ >
- κ{P))fdPn
>
71—>>OO P^OJr, J
Proof Apply Theorem 2.2 for a sequence Pn G φ n , n G N, such that liminf cn\(Pn) — K(PQ)\ = liminf sup Cn\κ(P) — K(PQ)|.
503
Cramer-Rao Bound This concludes the proof. • Proof of Theorem 2.2
(i). To simplify our notations, let
(2-9')
Kn := c ( κ W - κ(Po))
(2-9")
Kn ~ Cn(κV> - κ{Pn))
(2-10')
Gn,u := 1 1 Lu[Kn]dPS\
(2-10")
Gn,u:=\jLu[Kn]dPZ\
and
We have r = liminf |r n I. Let σ 2 := lim liminf /
Lu[Kn?dP^
Since assertion (2.5') is trivial for σ 2 = oo and for α 2 > πo, we assume σ 2 < oo and α 2 < πo in the following. (ii). For every ε > 0 there exists uε > 0 and nε G N such that the following relations hold true for u > uε
(2.11')
limsupG^ < ε n—> o o
(2.11")
limsupG n , ω < ε n—>cx5
(2.12)
Poi\Kn\ < u] > (1 - ε)π 0 ,
for n > nε.
Relations (2.11) follow from unbiasedness as defined in (2.3). Relation (2.12) follows from (2.1) and πo > 0. For reasons which will become clear later on, we assume that ε G (0,3/4). (iii). The following relation for an arbitrary probability measure Q | B will be used repeatedly () for 0 0, v > sup n € N \rn\ and u > max{uε^v} be fixed. If σ 2 < oo, there exists an infinite subset No C N (depending on w, v, and ε) such that
(2.14)
ILu+v[Kn]2dP£
< (1 + ε)σ 2 ,
for n G N o .
504
J. Pfanzagl
Since \l[-u,u)(Kn) - l[-U9U](Kn)\
for n e N ,
< l[u_ΌtU+υ](\Kn\),
we obtain (2 15)
\^nl[-u,u](Kn)
- knl[-u,u)(Kn)\
nε (recall α:=limsupX(P0-;P-)) n-»oo
(2.22)
Bn>u < α((l + υ 2 « " 2 ) ( l + ε)σ2 + 2vGn,u+v
+r 2 P 0 " { | X n |
max{uε,v} and all ε G (0,3/4) we obtain ΓTΓo ^ &\<J + Γ 7ΐ"o)
,
or
(2.25)
2 σ
>πf-Jl-α-).
506
J. Pfanzagl
Since r > liminf \rn\ =: r, relation (2.5') follows. n—>oo
(vi). The proof of (2.5") is the same, using instead of the bound (2.22) for i? n ? w , resulting from (2.207), the following bound resulting from (2.20") Lu[Kn]2dP^
ε) lim limsup [ u-»oo n-^oo J
+ (1 + v2u-2)(l
+ ε)σ2 + 2vGn,u+v
+ r2nP^{\Kn\
ε n 1 / 2 } = o{τΓι),
(3-4")
I rnl{\rn\ 0,
o{pΓχl2),
(3.4"') Observe that (3.3) implies / rndPo — 0. As pointed out by LeCam (see Pfanzagl, 1985, p. 25-27), condition (3.4) is equivalent to Hellinger differentiability of the path Pn^g with derivative g/2.
507
Cramer-Rao Bound
In most examples, the path Pn^g, n G N, can be chosen such that the sequence r n , n G N, of remainder terms fulfills a condition slightly stronger than (3.4), namely
(3.5) Theorem 3.1 Let K ' " ' , n ε N , be an estimator sequence such that (3.6)
lim limsupP o n {c n |κ( r ι ) - κ{P0)\ < « } = ! .
U—>OO
Assume, moreover, that κSn\ n G N, is asymptotically unbiased with the rate n 1 / 2 , uniformly on all paths fulfilling conditions (3.2) and (3.3) with rn, n G N, fulfilling (3.5). Then (3.7)
lim liminf / Lu[n^2(n^
- κ(P 0 ))] 2 dP 0 n > σ2(κ*).
In this theorem, the uniformity along paths refers to asymptotic unbiasedness only. If the uniformity along paths holds for convergence to a limit distribution ("regular estimator sequences") and if this limit distribution has expectation zero, then this implies asymptotic unbiasedness and (3.6), hence also (3.7). This is, however, of no interest, since for regular estimator sequences the Convolution Theorem yields a much stronger assertion. The attractive feature of Theorem 3.1 lies in the fact that it requires asymptotic unbiasedness only (without reference to any limit distribution). Actually, the asymptotic unbiasedness along the paths Pn,α**? ^ G N with α > 0 is all we need in the proof of Theorem 3.1. Though this condition is weaker, it is not more plausible than asymptotic unbiasedness along all paths P n ? p , n G N, with g G To. For readers who find it difficult to understand the meaning of "uniformity along paths" — as is the case with the author — we add the remark that this condition may, of course, be replaced by the stronger condition of "uniformity on {P G φ : X(PQ^P ) < α} for some Λ
α>0". Remark 3.1 Now we specialize Theorem 3.1 to the case of a 1-parametric family {P# : ΰ G Θ}, Θ C R such that (see relation (3.3))
with rn fulfilling relation (3.5). In this case,
508
J. Pfanzagl
Let ΰ(n\ n G N, be an estimator sequence which is asymptotically unbiased on the paths J \ + τ ι - i / 2 t , n G N. Since y2 > Lu(y)2 we obtain from (3.7) (3.8)
limsupn /(0 - ^fdP^
> 1/ [
έf(x,ΰ0)2dP#0.
This is the classical version of the Cramer-Rao bound, established here under a condition weaker than "unbiasedness for every n £ N " . Condition (3.5), required in Theorem 3.1, is slightly stronger than needed for LAN. Therefore, we supply another Theorem (with a slightly more complex result). Theorem 3.2 Let κ^n\ n E N, be an estimator sequence fulGUing (3.6). Assume, moreover, that κ^n\ n G N, is asymptotically unbiased with the rate n 1 / 2 , uniformly on all paths fulfilling conditions (3.1) and (3.2). Then Jkn^liminf (Lu[nιl2(κ^
- κ(P0)
implies for t sufficiently small
Since lim liminf / L u [ n 1 / 2 ( / ί { n ) - κ(PQ))]2dP2
I—KX)
71—> OO
> σ 2 (κ*)
J
under the conditions of Theorem 3.1, the use of Theorem 3.2 is, roughly speaking, restricted to the case that there exists a path from direction tn* fulfilling the conditions (3.4), but not the stronger condition (3.5). Remark 3.2 Comparable bounds exist for the concentration of estimator sequences which are asymptotically median unbiased, uniformly on paths fulfilling (3.1) and (3.2). This is proved in Pfanzagl (1994, p. 271, Corollary 8.2.5) for parametric families, but the proof extends immediately to the more general case of a family with paths fulfilling (3.1) and (3.2). Proof of Theorem 3.1 from Lemma 5.2 that
For paths Pn,9 fulfilling (3.3) and (3.5) we obtain
^^PZ
P^)
= exp[σ 2 ( 5 )] - 1.
Moreover, from (3.2),
lir^n^iKiP^g) - κ(Po)) = Jg(x)κ*(x)P0(dx)
.
509
Cramer-Rao Bound Applied with g = tn* we obtain from (2.5')
with
$(z) = z(2-exip[z})/(exp[z}-l). Since Φ attains its maximal value 1 for z -> 0, the assertion follows. • Proof of Theorem 3.2 For paths Pn,g fulfilling (3.1) we obtain from Lemma 5.1, applied with M = N^σ2(gy2, α^g)), that
^ J
= 1 ~ ex P [-σ 2 ( 5 )/8].
Moreover, from (3.2) Ym^n^^P^)
- κ(P0)) = J g(x)κ*(x)P0{dx).
Applied with g = ί«* we obtain from relation (2.5") Jir^liminf ί Lu[nιl2{κ^
-
κ{Pΰ))fdP^
+ lim limsup f Lu[n1'2(iςW
-
κ{PnM*))]2dP
with
Since Φ attains its maximal value 1 for z -> 0, the assertion follows.
4
Applications to nonparametric problems
In most nonparametric problems the optimal convergence rate is slower than n 1 / 2 . Usually it is of the type nαL(n) with α G (0,1/2) and L a slowly varying function, such as (logn) α . In these cases, estimator sequences converging locally uniformly to a limit distribution usually do not exist. (See Pfanzagl, 2000, for details.) In the present paper we shall apply Theorem 3.1 to show that estimator sequences which have at PQ a finite asymptotic variance with this rate cannot be asymptotically unbiased.
510
J. Pfanzagl
Theorem 4.1 Let $γ>
(4.1)
— Γp p ςr> . χ(pn.
p)
Ό
π-Kx>
n
Let κ( \ n G N, be an estimator sequence fulfilling (4.4) and, for some α > 0, (4.13)
2
n)
σ := lim limsup sup U-+OO
n
_^oo
peQnα
J
/ Lu[cn{ι^
2
n
- n(P))] dP
< oo.
(n)
Then the estimator sequence κ , n E N , cannot be asymptotically unbiased with the rate cn, n G N, uniformly on {P G φ : H{P$,Pn) < α] for some α > 0. Proof If κ^n\ n G N, is asymptotically unbiased, uniformly on {P G φ : H(P£,Pn) < α}, relation (2.8") implies that / (4.14)
σ* > 8 "
-
1/2
π0
α 2\V2 ι 1-4— α liminf n KX) 5 n (α),
y
πoy
~
514
J. Pfanzagl
with n
sn(α) = S U P K I K W - κ{P)\ : H(P£,P ) < α}. Since H2(P£,Pn) < nι/2H2(PQ,P), we have S n (α) > * n (α). Hence (4.14) holds with sn(α) replaced by sn(α). Since this relationship holds for every α > 0, relation (4.12) is in contradiction to (4.13). • Theorem 4.2 is closely related to Theorem 2 in Liu and Brown (1993, p. 4). In the expressive terminology of these authors their result reads as follows: At a singular point of irregular infinitesimality an estimator sequence cannot be both locally asymptotically unbiased and locally asymptotically informative. Theorem 4.2 differs from this result of Liu and Brown only in two aspects of minor importance, namely the use of Lu instead of ίu[y] = Lu[y] + u{l(u,oo)(y) — l{-oot-u)(y))i a n d the use of condition (4.12) in place of "irregular infinitesimality" (see Definition 2.6, p. 4, in the paper by Liu and Brown). We remark that Theorem 4.2 is stronger than the corresponding theorem with ίu in place of Lu (in relation (4.13) and in the definition of asymptotic unbiasedness). (4.13) with ίu in place of Lu is stronger, since Lu[y]2 < L[y]2 Moreover, (4.15)
uP{cn\nW
- κ{P)\ > u} < u~ι j tu[cn{^
κ{P))]2dPn.
-
Hence the definitions of asymptotic unbiasedness, uniformly on £}njQ? based on Lu or ίu, are equivalent if (4.13) holds with ίu in place of Lu.
5
Auxiliary results
L e m m a 5.1 Let Pn,Qn be sequences of probability measures on with qn G dQn/dPn. If (5.1)
Pn
with M | B a probability measure fulfilling J exp[υ]M(dv) — I, then (5.2)
Jirr^H(P n ,Q n )
2
= 1-
ίexp[v/2}M(dv).
Proof If q £ dQ/dP, we have
(5.3)
H (P, Q)2 :=1-J(^q-l)2dP
With (5.4)
IIn:
= -J(^q-
1) dP
(X,A)
515
Cramer-Rao Bound relations (5.3) may be rewritten as
(5.5)
2
H(Pn,Qn)
2
= \jy Hn{dy)
= - JyUn(dy).
Since y/ΰ — 1 = exp U logu\ — 1, we have Π n = ( P n o log qn) o(v^
exp[ϋ/2] - 1).
Since v H^ exp[v/2] — 1 is continuous, relation (5.1) implies (5.6)
Π
^ Π
n
0
: = M o ( ^ exp[v/2] - 1).
We have
lhninf yVπ^dy) > and, since y > —1,
liminf fyUn(dy) > I yU0(dy). n—>oo J
J
This implies
and
limsup#(P n ,Q n ) 2 < - fyUo{dy). n—ϊoo
J
Since
= -JyU0(dy)
= 1 - J exp[v/2]M(dv),
the assertion follows. • Lemma 5.2 Let Po,Pn, n E N , be probability measures with μ-densities po and pn, respectively. Assume there exists a function g G £2(^0) with f g(x)Po(dx) = 0 such that (5.7)
^ Po
=
i
+
n
-V25
+
n
-i/2Γrι
with
J r2n(x)P0(dx)
(5.8)
-> 0.
Then (5.9)
Km X 2 (P 0 n ;P n n ) = exp[ίg2dP0]
77>—rOO
^
- 1 .
516
J. Pfanzagl
Proof Since X 2 (P 0 ; Pn) = J (— - l) dPθ = ί^"1 y (» + rn)2rfPθ, we have ft—r 1} for unknown constants θn,Xn,yn,Zn, reconstruct Z.
526
Ronald Pyke
An application of Theorem 3.1 to each of the moved fragments shows that with probability one, the partial translations (x n , yn) and the bases' rotations θn can be uniquely determined. Thus each such transformed fragment Gn of the graph of Brownian sheet may be relocated correctly in all respects except for its vertical placement. If the fragments of the Brownian sheet's graphs are allowed to rotate in 3 M (and not just around the z-axis as for Problem 5) as they fall, one is led to the following analog of Problem 4 in which the two-dimensional rotation is replaced by a three-dimensional one: Problem 4*. . Given {rn(Gn) — (x n , y n , zn)} for unknown rotations τn and constants xnyniZn 0, there is an r e > 0 for which P(BT) > 1 — e for r < re where Bτ is the complement of the event in (14). Since r < 1, A* Π Br C Ar so that then P(Ar) > 1 — e. But by definition of the events Aτ it follows that Ar C Arι for r < r1. Together, this implies that P{Ar) = 1 for every 0 < r < 1. To this point, we have shown that for each fixed direction u and each r G (0,1), the probability is one that any non-vertical line in R3 through (1,1, Z(l)) whose projection on M2 is parallel to u has a parallel line that intersects the graph of Z over Br(l) more than once. The same is therefore true for a countably dense set of directions u. It remains to show that the statement holds with probability one for all directions simultaneously. Fix e > 0. Define (15)
( Δ Z ) r , n , M = Z(l + - r u )) - Z(l (+ Tl Tl
and (16)
M€^u* =
sup
) 7Ί
sup v^l(ΔZ)r,n,fc,u- (ΔZ) r > n ? M *|.
| u _ u *|< e l 1, Dn is the closure of its interior Z>£, the closure of |J Dn is I2 and \\JnD*\ = l. n
Thus, the bases of the fragments have relatively nice boundaries and the excess 'dust' generated by the shattering, namely the part of the graph over the complement of \JnDn, is 'negligible'. This assumption (20) also insures that for each n and each s G / 2 a set-indexed value Z(D^ Π [0,s]) is determined uniquely by the incremental information contained in the patch Gn. Thus for every s G / 2 , define oo
n=l
a mean zero normal r.v. with oo o varv(Z*(s)) = Σ \D°n Π [0,s]| = | [JD n n [0,s]| = |[0, 71=1
by (20). Moreover, Z*(s) = Z(s) a.s. In particular, this determines with probability 1 the heights Z(tn) for every n as required to solve Problem 2*. The values of Z on the boundaries dDn = Dn \ D% and on the complement of (J n Dn are determined by the continuity of Z\ note that (20) implies that the closure of \]n D% is also I2. Remarks. 1. Although it is possible to have bases Dn that by their 2 particular shapes can be fit together into I in only one way, the discussions here do not consider such possible information. Our emphasis is upon utilizing only the information in the Brownian surface so that the procedures apply to such uninformative shapes as rectangles and balls. 2. Let us comment further on the extendability of the definition of the Brownian sheet Z from points t E / 2 (or, equivalently, rectangles [0, t] C I2) to the family of sets i={ΰn(s,t] :DEP,s,t
el2}
used in the above derivation. Under assumption (20), each D in the countable family V is a closed subset of I2 that satisfies \d(D)\ = \D \ D°\ = 0 and the associated family of sets AD = {DΠ (S, t ] : s , t G / 2 } indexed by I2 is, for example, of zero metric entropy under inclusion with respect to the Hausdorff or symmetric-difference metrics. Thus a continuous extension of Z
Shattered Brownian Sheet
531
on AQ, and hence on A, exists that agrees with the given Z((s,t]) whenever (s,t]CU. 3. It should be pointed out that the existence of chords of arbitrarily large and small slopes in the trajectories of Z along fixed rays through 1 could also be shown by means of LIL's or Holder results for the particular Gaussian processes involved. For directions u with U\U2 > 0, such results are easily deduced since along such rays Z is Brownian motion with a quadratically, rather than linearly, growing variance. In the other directions the dependent increments require results for more general Gaussian processes. It would be of interest to obtain general uniform LIL and uniform Holder results for Gaussian processes analogous to the uniform quadratic variation result of Adler and Pyke (1993). In this regard, the reader should note the powerful results in Dalang and Mountford (1996). In particular, Theorem 2 of their paper, in the equivalent form given prior to their equation (2), implies that with probability one, it is true that for every θ E [0,2τr) and t in a small rectangle about 1, it is true that \Z(s) — Z(t)| > c|s —1| for infinitely many s for which s — t has direction 0, regardless of the value of c > 0. (Much more is proved in Dalang and Mountford (1996) since this statement about the increments is shown to hold along all Jordan curves constrained within wedges about the lines of direction 0, and not just along the straightline segments needed here.) A modification of this result that would resolve Problem 4* is that in which the absolute values of the increments is replaced by the positive and negative parts, (Z(s) — Z(t))+ and (Z(s) — Z(t))~. It would then be possible to utilize these results in the same way that the LIL was used to resolve Problem 4 in Section 2. However, this extension of the Dalang and Mountford result is not immediate; in particular, the representation in their Lemma l(b) does not hold. (The author is grateful to Robert Dalang for his assistance on this point.) 4. In Aratό (1997), the problem of estimating μ when Z + μ is observed over a domain D C I2 is considered, and its maximum likelihood estimator derived. The results of this paper permit the context to be modified so that the base D may be treated as being unknown. 5. In this paper, only distance preserving transformations have been considered. It would be of interest to allow for other perturbations of the bases, or even of the graphs, that are recoverable from the coordinate information inherent in Brownian sheet.
REFERENCES Adler, R. J. (1990). An Introduction to Continuity, Extremα and Related Topics for General Gaussian Processes. IMS Lecture Notes — Monograph Ser. 12, Hayward, CA.
532
Ronald Pyke
Adler, R. J. and Pyke, R. (1993). Uniform quadratic variation for Gaussian processes. Stochastic Processes Appl 48, 191-210. Aratό, N. M. (1997). Mean estimation of Brownian sheet. Comput. Math. Appl 33, 13-25. Baxter, G. (1956). A strong limit theorem for Gaussian processes. Proceedings of the American Mathematical Society 7, 522-527. Dalang, R. C. and Mountford, T. (1996). Nondifferentiability of curves on the Brownian sheet. Annals of Probability 24, 182-195. Feller, W. (1968). An Introduction to Probability Theory and Its Applications. Third Edition, J. Wiley and Sons, New York. Levy, P. (1940). Le mouvement Brownien plan. American Journal of Mathematics 62, 487-550. Pyke, R. (1980). The asymptotic behavior of spacings under Kakutani's model for interval subdivision. Annals of Probability 8 157-163. Pyke, R. and van Zwet, W. R. (2000). Weak convergence results for the Kakutani interval splitting procedure. In progress. van Zwet, W. R. (1978). A proof of Kakutani's conjecture on random subdivision of longest intervals. Annals of Probability 6 133-137.
DEPARTMENT OF MATHEMATICS UNIVERSITY O F WASHINGTON SEATTLE, WASHINGTON
pyke @math. Washington, edu
INVERTING NOISY INTEGRAL EQUATIONS USING WAVELET EXPANSIONS: A CLASS OF IRREGULAR CONVOLUTIONS
PETER HALL, FRITS RUYMGAART1, ONNO VAN GAANS, AND ARNOUD
VAN Roou Australian National University, Texas Tech. University, and Katholieke Universiteit Nijmegen Suppose a random sample is observed from a density which is a known transformation of an unknown underlying density to be recovered. Expansion of this unknown density in a wavelet basis yields Fourier coefficients that can be reexpressed in terms of the sampled density and an extension of the adjoint of the inverse of the operator involved. This seems to yield a new approach to inverse estimation. Focusing on deconvolution optimal error rates are obtained in the case of certain irregular kernels like the boxcar that cannot easily be dealt with by classical techniques or by Donoho's (1995) wavelet-vaguelette method. AMS subject classifications: 42C15, 45E10, 46N30, 62G07. Keywords and phrases: inverse estimation, wavelet expansion, deconvolution, irregular kernel.
1
Introduction
When a smooth input signal is to be recovered from indirect, noisy measurements, Hubert space methods based on a regularized inverse of the integral operator involved usually yield optimal rates of convergence of the mean integrated squared error (MISE). Statistical theory can be conveniently developed exploiting Halmos' (1963) version of the spectral theorem (van Rooij & Ruymgaart (1996), van Rooij, Ruymgaart h van Zwet (1998)). In practice, however, the input signal often is not regular like, for instance, in dynamical systems where it might be a pulse function. In such cases the traditional recovery technique may fail to capture the local irregularities of the input. Difficulties also arise in instances where the kernel of the integral operator itself displays a certain lack of smoothness. Whenever one has to deal with irregularities, wavelet methods seem pertinent. In classical, direct estimation of discontinuous densities Hall & Patil (1995) successfully apply a wavelet expansion. For certain inverse estimation models Donoho (1995), in a seminal paper, proposes a wavelet-vaguelette decomposition for optimal recovery of spatially inhomogeneous inputs. In both papers nonlinear Supported by NSF grant DMS 95-94485 and NSA grant MDA 904-99-1-0029. .
534
Peter Hall, Frits Ruymgaart, Onno van Gaans, and Arnoud van Rooij
techniques are used. Donoho (1995) points out that his method remains essentially restricted to so-called renormalizable problems (Donoho & Low (1992)). Convolution with the "boxcar" (the indicator of the unit interval) is an example of an operator that is excluded. In this paper we propose an alternative approach to statistical inverse problems with a special view towards irregularities, by using a wavelet expansion where an extension of the inverse operator appears in the Fourier coefficients. More precisely, let K : L2(R) -^L2(R) be a bounded, injective integral operator and consider the equation (i i)
g = Kf,
feL2(R).
Suppose we can find an operator V with dense domain so that its adjoint V* is well-defined. Let T C L2(R) and suppose that the domain of V contains KT and that V satisfies VKf = /, / E T\ in other words, let V be an extension of the restriction of K~ι to KF. Given any orthonormal basis φx, λ 6 Λ, we have forfeT the expansion
(1.2)
/ = Σ (/, φx) Φx = Σ ( ^ / ' V>λ> Ψx = Σ (9, Z>VΛ> ΨX, λeΛ
provided that the ψ\ are in the domain of V*. Since in practice g is imperfectly known, it may be much better to deal with V*ψχ than with Vg. Also, the V*ψχ are independent of the specific function / and hence can be used for the entire class T. The use of generalized Fourier coefficients to obtain convergence rates as such is, of course, not new and appears for instance in Wahba (1977) and Wahba h Wang (1990). In simpler situations it will be possible to choose φχ in the domain of K~ι. Calculation of the Fourier coefficients will then be usually performed in the spectral domain by application of Halmos' (1963) spectral theorem, mentioned before, coupled with the polar decomposition (Riesz & Nagy (1990)). It is far too ambitious to deal with (1.1) for arbitrary K and we will focus on examples of operators whose inverses are suitably related to certain differential operators. In Section 2 we will see that the boxcar convolution and the Abel type integral operator in Wicksell's problem are in this class, and that recovery of the forcing term in certain dynamical systems is a prototype. We will, moreover, restrict ourselves to indirect density estimation. Hence we will assume T to be a class of square integrable densities, and g — Kf is also supposed to be a density. The data consist of an i.i.d. sample Xχ,...,X n from #, with generic sample element X. An estimator of / is obtained by replacing the Fourier coefficients in (1.2) with their estimators and by truncation and data-driven thresholding like in the direct case (Donoho, Johnstone, Kerkyacharian & Picard (1996), Hall & Patil (1995)).
Irregular Convolutions
535
The boxcar convolution g = l ^ y * /, mentioned before, provides a good example of the difficulties, both statistical and analytical, that one may encounter when irregularities are involved. In Section 3 it will be shown that for smooth input functions the spectral cut-off type regularized inverse estimator does not yield optimal convergence rates of the MISE over most of the smoothness range. Furthermore, by directly expressing the convolution in terms of an indefinite integral of /, it can be easily seen that inversion boils down to a sum of shifted derivatives of the image 5, provided that / has finite support. The generic way of solving a convolution, however, is by transformation to the frequency domain via the Fourier transform, where in this particular case the inverse reduces to division by the characteristic function of l[o,i]? i e to division by e^sinc^t, where (1.3)
sine x \—
, x/0,
sine 0 := 1.
Hence we divide by a function that has zeros at 2fcπ, fcGZ, and zeros at ±00. It is a fair conjecture, corroborated by the recovery via the direct method in the time domain, that the latter zeros represent differentiation in the actual inversion procedure, but the interpretation of the other zeros is not at all immediate. As a referee pointed out, however, a much more sophisticated kind of harmonic analysis has been developed by S. Mallat and his students to deal with such transfer functions. This analysis involves wavelet bases with dyadic decomposition tailored to the problem at hand. Here we propose a different approach. Convolutions are a very important class of operators and it would be interesting to classify their inverses exploiting the properties in the frequency domain. This is still too ambitious but in Section 4 inverses are obtained for a subclass that contains the boxcar. Finally, in Section 5 we compute the MISE for indirect density estimators constructed by means of the wavelet method (1.2). As has already been observed above, we will restrict ourselves to operators K for which K~ι is a kind of differential operator. Calculations for the MISE for both smooth and discontinuous input functions can be patterned on those in Hall & Patil (1995). Due to space limitations we will restrict ourselves to smooth input functions, and how to obtain optimal rates in the boxcar example. 2
Examples
It will often be convenient to precondition and replace the original equation with an equivalent equation (2.1)
p:=Tg = TKf=:Rf,
f e L2(M),
where T : L2(M) -»L2(M) is a bounded injective operator that we can choose at our convenience. Setting T = K* would yield a strictly positive Hermitian
536
Peter Hall, Frits Ruymgaart, Onno van Gaans, and Arnoud van Rooij
operator R. In Example 2.2 preconditioning with K itself will be convenient. Preconditioning will not in general introduce extra, undue ill-posedness.
Example 2.1 Let us first consider the boxcar convolution and assume that / has support in a compact interval [A, B]. If F is an indefinite integral of /, it is immediate that g(x) = (l[o,i] */)(#) = F(x) -F(x-1), which entails g'{x) = f(x) - f{x-ϊ). Choosing an integer / with I > B - A - 1, it follows that
i=Q
The condition on / may be weakened at the cost of more involved technicalities. Example 2.2 Next let us consider WickselΓs unfolding problem where the density of the radii of not directly observable spherical particles is to be recovered from a sample of planar cuts. The relation between the density / of the squares of the radii of the spheres and the density g of the squares of the observed radii of the cuts is given by
(2.3) g(x) = μ [ -γM=dy =: μ(Kf)(x),
0 < x < 1, / E L2([0,1]),
χ
J x v y ~~
assuming that the radii of the spheres are all smaller than 1, and where μ is a constant that for simplicity we will assume to be known. This model plays a role in stereology and medicine. For some recent results we refer to Nychka, Wahba, Goldfarb & Pugh (1984), Silverman, Jones, Wilson &, Nychka (1990), and Groeneboom & Jongbloed (1995). We have already mentioned that inversion of (2.3) is included in Donoho (1995) as a special case. Yet we want to include it here to show that it also fits in our framework. This is due to the circumstance that the operator K represents a fractional integration, meaning that preconditioning with T = K yields the equivalent equation p := Kg = μK2f =: μRf, where (2.4)
(Rf)(x) = π [
= π{F(l) - F{x)}, 0 < x < 1,
l[xΛ](y)f(y)dy
./o where F is an indefinite integral of /. This means that 1 μπ
'
~~
~
537
Irregular Convolutions
Example 2.3 Let D^ be the j-th derivative and consider a dynamical system driven by the differential equation Σj=ocj(^9)(x) = 6/(#), x > 0, under the usual initial conditions and conditions on the given numbers 6,co,...,cj. The forcing term / is unknown and to be recovered from g. Again / is supposed to have support in a compact interval [A,B] C [0, oo). Although in this case no real inversion is involved, the relationship (2.6)
1 J f(x) = -^c^g^l^x),
x > 0,
3=0
between / and g still involves an unbounded operator. In practice the noisy data on g usually lead to a regression model. We intend to show that in all three examples the exact relation between / and g is of the form f = Vg where V is a differential operator of the following type: J
(2.7)
I
(Vg)(x) =ΣΣcji(x)(Dig)(x
- αάi), x G R
j=0 i=0
Here I and J are nonnegative integers, each αji is a real number, each Cji is a real-valued compactly supported function with a continuous j-th derivative. (/, J, αji and Cji may depend on the interval [A, B] but not on the specific
/•) Example 2.1 is easy. Letting / be larger than (B + 1) — (A — 1) — 1 we have f(x) = Σ L o 5 ' ( X " ~ Όforall X G [A - 1, B + 1]. Hence, if we choose any continuously differentiable function c : R -> R with c(x) = 1 for x E [A, J5] and c(x) = 0 for x £ [A — 1, B + 1], we find /
(2.8)
f(x)=^2c(x)g/(x-i),
xeR.
cx>, 2)
(3.6)
P
| | / / | | ^
x
[ n"ϊ,
1/2 < zy < 1
^>1
Next let us adopt the lower bound to the minimax MISE in van Rooij &; Ruymgaart (1998), and let T denote the class of all L2(M)-valued estimators with finite expected squared norm. The symbol C will be used as a generic constant. In the present case we then arrive at
(3.7)
infTGTsup/G^ E | | T / | | f
> "
CΓ
2
. ^ ~^ °° i+n(sinc | ,
v
| 2 v>\.
dt> "
These results imply that for v > 1, i.e. for most of the smoothness range, the convergence rate of the MISE for the spectral cut-off type estimators is suboptimal. It should be noted that the smoothness class Tv is somewhat different from the smoothness classes in terms of derivatives that one usually finds in the literature. See also Section 5. For regular kernels spectral cut-off estimators in general obtain the optimal rate (van Rooij & Ruymgaart (1996)). A regular kernel has a Fourier transform that decays monotonically to zero in the tails. In such a regular case the summation in the third line of (3.5), which is due to the oscillations of Δ between its zeros, would not have been present and the optimal rate would indeed have emerged.
540 4
Peter Hall, Frits Ruymgaart, Onno van Gaans, and Arnoud van Rooij Exact inverses of certain irregular convolution operators
In this section we present some convolution operators K that have inverses of the type described in (2.7). Our function / has its support contained in a compact interval [A, B]. Example 4.1 First, let K be convolution with a kernel of the type N
(4.1)
k(x) = Σcnl[an_uan]{x),
xeM,
71=1
where c i , . . . , c/v and SQ, Ǥi,..., SJV are given numbers and (4.2)
ci = 1,
0 = so < si < ... < sN.
Set 7 n = Cn — Cn+i f o r n = l , . . . , N—1 and jjy = CN If F is an indefinite integral of / , then for x E R we have (4.3)
g(x) := (k*f)(x)=jrCn
/ f(x-t)dt = sn-ι
n=l
Σ Cn (Fix-Sn-!) ~ F(χ-Sn)) = F(X) - E 7nF(x-5n).
=
n=l
n=l
Thus, (4.4)
'
=/ - ^
ff
7 n
/( -
βn
) = f - μ */
n=l
if we define μ to be the real-valued measure concentrated on the finite set {si,..., SJV} with μ({sn}) = ηn for each n. Let μ*1, μ* 2 ,... be the convolution powers of μ, i.e., μ*1 : = μ, μ*( m + 1 ) : = m μ * μ* for m G N. It follows from (4.4) that Af-l
/ = ^ + Σ / i * m * ^ + /i*M*/
(4.5)
m=l
for all M. Each μ*m is concentrated on a finite subset of [rasi,oo), and, of course, ms\ —> oo as m ->> oo. It follows that it makes sense to speak of the infinite sum Σ m = i M*™ a n d that there exist numbers λi, λ2,... and 0 < ίi < t2 < . . . with oo
(4.6)
oo m
^μ* (5) =^λils(ίi), 771=1
t=l
5 CM bounded.
Irregular Convolutions
541
As / has compact support, it also follows that μ*m * / -» 0 a.e., as m -f oo, so that, by (4.5) m
f = 9'+ Σ μ* *g' = (4.7)
^
= 1
oo
i=l
i=0
where we introduce λo := 1, ίo := 0. Actually, / is supported by [A, 5], and #, which is k * /, by [A, B+SN]. Therefore, if / is a nonnegative integer with tj > B—A+l, then g'(x—U) = 0 if i > I and x E [A-1,5+1], so that / = E i = o W ( # - U) on [A-1,5+1]. Choosing an infinitely differentiable c : R -> R with c(x) = 1 for x £ [A, £?] and c(x) = 0 for x £ [A—1,5+1], we see that (4.8)
f(x) = Y
Xic(x)g'(x-U), x £ R,
a formula of the type described in (2.7). Example 4.2 It is not difficult to generalize the above. Clearly, the condition ci = 1 may be replaced by c\ Φ 0 without noticeable harm. The condition SQ — 0 is not serious, either. Indeed, let k be as in (4.1) (without having SQ — 0). Let KQ be convolution with ko := fc( + so). Then fco is of the type considered above, and there exist / G N , numbers λo,..., λ/ and to < < ίj, and a compactly supported, infinitely differentiate function c for which
(4.9)
f(x) = ^
\ic{x){KΌf)'{x-U) =
ϊ=0
i=0
which brings us back to (2.7). Example 4.3 The preceding generalizes easily to a suitable class of spline functions. Suppose SQ < s\ < . . . < s#, J £ N, and let fc : R —> R be such that k vanishes identically outside the interval [SO^AΓ] and has J — 1 continuous derivatives, whereas for each n G {l,...,iV} the restriction of k to [s n _i,s n ] is a polynomial of degree at most J. Assume k φ 0. The function & has a J-th derivative at all points except possibly θo, s i , . . . , sjy, and fc(J) is constant on each (s n _i,s n ). For each j G {1,..., J}, fcβ'"1) is an indefinite integral of k^\ Hence, k^ Φ 0. By the previous examples, there exist / G N, numbers λ 0 ,..., λ/ and t 0 < . . . < ί/, and a compactly supported, infinitely differentiate function c such that (4.10) f(x) = i=0
*=0
542
Peter Hall, Frits Ruymgaart, Onno van Gaaπs, and Arnoud van Rooij
and again we have an instance of (2.7). 5
Estimation and MISE when the wavelet expansion is used
Let Wj denote the class of all compactly supported functions on R with J continuous derivatives. Given any χ E Wj we write (5.1)
Xm,k(x) := 2 m / 2 χ(2 m x - fc), x E R, m E Z, k E Z.
Let Vj be a scaling function and φ the corresponding wavelet. For our purposes the wavelet φ E Wj must satisfy the additional property f^°oox^φ(x)dx = 0, j — 0, ...,r — 1 for some r E N (see below). The resulting orthonormal wavelet basis is {φm,k, (m,fc) E Z x Z}. The existence of a wavelet with all these properties is shown in Daubechies (1992). At a given resolution level M E Z the low frequency elements can be combined in the usual way to yield the orthonormal system {ψM,k->k E Z}, that can be complemented to an orthonormal basis of L2(M) by adding the system of high frequency wavelets {φm,k,™> > M, k E Z}. Restricting ourselves in this section to operators K with K~ι = V as in (2.7), we see from (1.2) that we will need the adjoint, £>*. The domain of D* contains Wj, and for χ E Wj we have V*X = ΣΣ(-l) i (ciiX) ω ( + αii) =
(5.2)
ή°η° =
ΣΈdji-X^i
+ αji),
j=Q i=0
for certain, easily obtained continuous functions dji that have compact supports. Let us write, for brevity, fM,k •= (/, M, so that we have the expansion oo
/= Σ
oo
fM,kψM,k +
fc=-oo
m=M+l k=-oo
Since X has density g = Kf we have E(V*χ)(X) = (g,V*χ) = (Vg,χ) = ( , so that (5.3)
are unbiased estimators of fM,k and fm^. We are now in a position to present the general form of the wavelet-type inverse estimator oo
(5-4)
fM,v,δ
•= 5 Z fM,kψM,k fc=—oo
M+v
+ Σ
oo
Σ
m = M + l fc=—oo
1
{ | /
m
^
543
Irregular Convolutions
for M £ Z, v £ N, and threshold δ > 0. This kind of estimator has been introduced by Donoho (1995) in an inverse model and by Donoho, Johnstone, Kerkyacharian & Picard (1996) and Hall &; Patil (1995) in a direct model. The difference is that here the estimated Fourier coefficients contain the exact inverse of the operator. In asymptotics the parameters M, i/, and δ will depend on n. Let us write (5-5)
JM,V,S
~ E//vfjIΛ£.
We will consider the details for the asymptotics of the MISE under the assumption that / is in the class T[ of all functions on R that have r E N square integrable derivatives. Results for nonsmooth / could be likewise patterned on Hall & Patil (1995) but require more technicalities that cannot be presented here due to space limitations. For the present choice of wavelet we have
(5-6)
Σ
Σ A * = O(2- 2rM ), as M -+ oo, / € T'τ.
m=M+l k=-oo
For such smooth functions there is no need to include the high frequency terms in the estimator which then reduces to oo
(5.7)
/M := ] P fM,kψM,k, with fM := E / M . k=-oo
By (5.6) and because $M,k is a that the MISE equals -/||
2
=
=
n
unbiased estimator of fM,k it follows
E||/M-/M||2 + ||/M-/||
Σ
2
Var/ M , f c +O(2- 2 7 ' M ).
x e It will be convenient to set ΦM{x) := Σ/=o Έi=oδJi\ψ{J)(x+2MαJi)l M, M £ Z, where for each j and ΐ, δji is the maximal absolute value of the function dji, introduced in (5.2). As ψ is continuous and compactly supported, there is a constant C with Σfcl-oo ^ϊi(x~k) ) , as n -* oo,
provided that we choose M ~ M(n) as defined above.
Example 5.1 Let us return once more to the boxcar convolution. If we do not want to make the assumption that / (and hence g) have bounded support we obtain an infinite sum on the right in (2.2). Because φ has compact support, however, V*φm^ will involve only a fixed finite number of terms for all fc, ra > M, and any given number M. Apparently J = 1 and for / G T*τ the asymptotic order of the MISE equals O(n~ 2 r /( 2 r + 3 )). In order to compare with (3.7) for / G Tv, we should take v — r -\-\ and it follows that the wavelet-type estimators obtain the optimal rate for any number of derivatives and hence are superior to the regularized-inverse type estimators in (3.6). Remark 5.1 A more general situation arises when preconditioning is applied. In Example 2.2, for instance, it is only after preconditioning that we arrive at an operator R with inverse of type (2.7). Expansion (1.2) generalizes to / = Σ\eA (/' ^λ) ^λ = ΣAGΛ (R~lPi Φx) ΦX = ΣAGΛ Φτ9-> Φx) Φx = ΣλeΛ (9^*V*φχ) ψ\. For / G Tτ the estimator will again be given by
ΊL
The extra ill-posedness contained in T>* due to the preconditioning would be compensated by T* which is a smoothing operator. In the calculation of the MISE this should be reflected in the order of the variance of fM,k We will not consider this point here. Remark 5.2 The main difficulty with the boxcar convolution are the zeros of the characteristic function of its kernel that prevent us from conveniently dealing with the deconvolution in the frequency domain. Many kernels have characteristic functions that don't have any zeros and that decay monotonically in the tails. The Fourier coefficients in expansion (1.2) can be computed in the frequency domain. In fact we have
(5.12)
/ = J ] (Fg, FiK-'Yφx) φx = £ /F9, (1/ ~k)Fφλ) φx. λeΛ
xeλ ^
'
545
Irregular Convolutions
For suitable wavelets the Fourier coefficients can be unbiasedly estimated by substituting the empirical characteristic function, multiplied by ^ , for Fg. All that matters for calculation of the MISE along these lines is that the order of the variance of these Fourier coefficients can be obtained with sufficient accuracy.
Acknowledgements. The authors are indebted to Roger Barnard for some very helpful discussions and to the referees for useful comments. The second author gratefully acknowledges the hospitality of the Australian National University.
REFERENCES Brocket, R., & Mesarovic, M. (1965). The reproducibility of multivariate systems. J. Math. Anal. Appl. 11, 548-563. Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia. Donoho, D.L. (1995). Nonlinear solution of linear inverse problems by wavelet-vaguelette decomposition. AppL Comput. Harmon. Anal. 2, 101-126. Donoho. D.L., Johnstone, I.M., Kerkyacharian, G. & Picard, D. (1996). Density estimation by wavelet thresholding. Ann. Statist. 24, 508539. Donoho, D.L. & Low, M.G. (1992). Renormalization exponents and optimal pointwise rates of convergence. Ann. Statist. 20, 944-970. Groeneboom, P. & Jongbloed, G. (1995). Isotonic estimation and rates of convergence in WickselΓs problem. Ann. Statist. 23, 1518-1542. Hall, P. &; Patil, P. (1995). Formulae for mean integrated squared error of nonlinear wavelet-based density estimators. Ann. Statist. 23, 905-928. Halmos, P.R. (1963). What does the spectral theorem say? Amer.
Math.
Monthly 70, 241-247. Nychka, D., Wahba, G., Goldfarb, S. & Pugh, T. (1984). Cross-validated spline methods for the estimation of three-dimensional tumor size distributions from observations on two-dimensional cross sections. J. Amer. Statist. Assoc. 79, 832-846. Riesz, F. &; Sz.-Nagy, B. (1990). Functional Analysis. Dover, New York.
546
Peter Hall, Frits Ruymgaart, Onno van Gaans, and Arnoud van Rooij
van Rooij, A.CM. & Ruymgaart, F.H. (1996). for abstract linear estimators. J. Statist van Rooij, A.CM. & Ruymgaart, F.H. (1998). appear in Asymptotics, Nonparametrics, Ed.), Dekker, New York.
Asymptotic minimax rates PL Inf. 53, 389-402. On inverse estimation. To and Time Series (S. Ghosh,
van Rooij, A.CM., Ruymgaart, F.H. &; van Zwet, W.R. (1999). Asymptotic efficiency of inverse estimators. Th. Probability Appl, to appear. Silverman, B.W., Jones, M.C, Wilson, J.D. & Nychka, D.W. (1990). A smoothed EM approach to indirect estimation problems, with particular reference to stereology and emission tomography (with discussion). J.R. Statist. Soc. B. 52, 271-324. Wahba, G. (1977). Practical approximate solutions to linear operator equations when the data are noisy. SIAM J. Numer. Anal. 14, 651-667. Wahba, G. & Wang, J. (1990). When is the optimal regularization parameter insensitive to the choice of the loss function? Comm. Statist. - Th. Meth. 19, 1685-1700.
DEPARTMENT OF MATHEMATICS TEXAS T E C H UNIVERSITY
TX 79409 USA ruymg@math. ttu. edu LUBBOCK
NOTE ON A STOCHASTIC RECURSION
DAVID SIEGMUND
1
Stanford University The method of Yakir and Pollak (1998) is applied heuristically to a stochastic recursion studied by Goldie (1991). An alternative derivation of the Goldie's tail approximation with a new representation for the constant and some related results are derived. AMS subject classifications: 60H25, 62L10. Keywords and phrases: ARCH process, change-point, tail approximation.
1
Introduction
The stochastic recursion (1.1)
Rn = Qn + MnRn-U
has been studied by a number of authors. See, for example, Kesten (1973) and Goldie (1991), who obtained an expression for the tail behavior of the stationary distribution of (1.1), and de Haan, Resnick, Rootzen and de Vries (1989), who as an application of Kesten's result obtained inter alia the asymptotic distribution of max(i?i, , i ϊ m ) . In these studies it is asare sumed that ( M I , Q I ) , ( M 2 , Q 2 ) J independent, identically distributed and satisfy (1.2)
P{Mn >0} = l, £ ( l o g M n ) < 0 , P{Mn > 1} > 0
along with other technical conditions. One motive for studying (1.1) is to obtain information about the ARCH(l) process, which has been proposed as a model for financial time series. It is defined by the recursion Xn = {μ + λX^_1}1/2en, where ei,€2, are independent standard normal random variables. The process X% is a special case of (1.1) with Qn = μe 2 , Mn = λe^. See Embrechts, Klύppelberg and Mikosch (1997) for an excellent introduction to these and related ideas, and their applications. The special case of (1.1) having Qn = 1 and E{Mn) = 1 has also been studied in numerous papers involving change-point detection, e.g., Shiryayev (1963), Pollak (1985, 1987). This research was supported by The National Science Foundation.
548
David Siegmund
The process Rn defined by (1.1) behaves similarly to the solution of the recursion
(1.3)
log(iO = pogίΛn-i) + log(Mn)]]
+
which plays an important role in queueing theory and in change-point detection. The purpose of this note is to indicate the potential of a method motivated by change-point analysis (Yakir and Pollak, 1998; Siegmund and Yakir, 1999a, 1999b) and applied to processes similar to (1.3) to give insight into the results of Goldie and of de Haan, et al. The calculations are heuristic; rigorous justification appears to be a substantial undertaking. A by-product of this approach is a different and possibly more satisfactory expression for the constant C in equation (2.2) below (cf. (3.6)). 2
The Kesten/Goldie approximation
Let R denote the stationary solution of (1.1). For precise conditions under which this solution exists, see Vervaat (1979). Under the conditions (1.2), if Mn has (positive) moments of all orders there will by convexity exist a unique θ > 0 such that
(2.1)
E{M*) = 1.
We assume that such a θ exists and that I = ΘE[M% log(Mn)] is well defined and finite. The parameter I is the Kullback-Leibler information for testing the original distribution of Yn = log(Mn) against the alternative having relative density exp(0Yn) = Mθn (cf. (2.1)). Kesten (1973) and Goldie (1991) showed that (2.2)
P{R > x} ~ Cx~θ.
Although Kesten considered the more general case of a vector recursion, he did not characterize C. In the case of integer θ Goldie gave the constant C explicitly in terms of mixed integer moments of (M n , Qn). In general he characterized C in terms of the distribution of R itself. This characterization does not appear to be useful for evaluating C in the case of non-integral θ. Building on earlier research of Cramer, Wald and others, Feller (1972) showed how a number of results in queueing and insurance risk theory could be elegantly derived by an application of the renewal theorem to an "associated" distribution. Kesten's and Goldie's methods of proof involve clever extensions of this idea along with substantial analysis. Goldie's associated distribution will appear in the calculation given below; but the motivation behind it is entirely different, and the renewal theorem has been replaced by a simple local limit theorem.
549
A Stochastic Recursion
3
Alternative, heuristic derivation of (2.2)
Let P denote the measure of Sections 1 and 2. Let Sj = Σ{Yi and for j < n, put l^n = θ(Sn - Sj). Finally let the probability measure P^n be defined by dPj,n/dP = exp(ίJ9n).
(3.1)
The change-point interpretation mentioned above is that under P^n the random variables Yi, , Yn are independent and have distribution in an exponential family with natural parameter ξ = 0 for i = 1, ,j and ξ — θ for t = j + l,. -,n. To simplify the notation when there is no risk of confusion, I drop the subscript n and write more concisely ίj and Pj. It will also be convenient to let Pj denote the extended measure under which the Yi are independent for all — oo < i < oc and have distribution with parameter ξ = 0 for i < j and ξ = θ for i > j. Putting R-ι = 0, one obtains from the recursion (1.1) that (3.2)
Rn =
Σ%Qjexp(Sn-Sj).
If RQ = α].
Here the summation nominally extends over all i and j less than or equal to n. A rigorous proof would require showing that it suffices to sum over smaller subsets of these indices. For the moment it suffices to consider i and j such that n — j and n — i belong to the interval [α/I — eα, α/I + eα] for a suitable small positive e; additional restrictions will be introduced below. By straightforward algebra the term indexed by j on the right hand side of (3.3) can be rewritten as
(3.4)
e
^
[ exp{-[ίj
- α + θlog(ΣiQiexp(Sj
- Si)}}; [•••]> 0},
where [ •] > 0 indicates that the expectation is taken over the event where the immediately preceding bracketed quantity is positive. Under Pj the random walks Sj - Si have negative drift both for i > j and for i < j. (This is clear without calculation from the change-point interpretation, since this sum is proportional to the log likelihood ratio for testing that the change-point is at j against the alternative that it is at z; and j is the true change-point under Pj.) Hence with overwhelming probability
550
David Siegmund
the exponential of these sums is close to 0 unless i is close to j , say \i — j \ < clog(α). Also, ίj = θ(Sn — Sj) is the sum of approximately α/I terms; and α is assumed to be large. This means that the expressions involving Sj — Si and that involving ίj are asymptotically independent, so (3.4) oo of the same expression with the range
A Stochastic Recursion
551
of i restricted to l, ,ra and j any integer which satisfies j -> oo and m — j —> oo. Hence this expectation also equals the limit of [Σ
(ΛΛ\
Since the limit of (4.1) is the same uniformly in j (provided j is far from 1 and from m), it also equals in the limit the average over j of these expectations, viz. (A 2) (4.2) Recalling that F J ? m 5j)], one sees that
m m is defined by the likelihood ratio dP^mjdP (4.2) equals
—
exp[θ(Sm-
m-1E{[ΣiQiexp(Sm-Si)}θ}.
(4.3)
For the special case 0 = 1, (4.3) equals £ ( Q i ) , so the constant multiplying x~θ in (3.6) is of the form given by Goldie (1991). For the special case 0 = 2, (4.3) equals
m-1Σ^1E[Q2iexp{2(Sm
- Si)} ι
+ m~ 2Σ?^2ιE[QiQk = EQ\ + m-^ΣT^Σ^EQkEi -> EQ\ + 2E(Qι)E(Q1Mι)/[l -
exp{2(5m - 5,) + # - Sk}}
which again is of the form given by Goldie (1991). Moreover, one sees that with some effort similar expansions can be obtained for arbitrary integral values of θ. 5
D i s t r i b u t i o n of max(i?i
From the tail approximation (2.2) one can use any of several methods to obtain the approximate distribution of max(i?i Rm) (e.g., de Haan et al. (1989) or Woodroofe (1976)). It may be of interest to see briefly how the present method would deal with this problem-without requiring prior knowledge of (2.2). Assume that mE[M? log(Mi)]/log(z) -» oo and mx~θ -> 0. This puts P{max(i?i Rm) > x] into the domain of large deviations. A Poisson approximation can be derived by an auxiliary argument. In terms of the probabilities Pj ? n defined in (3.1) the argument leading to the equations (3.3)-(3.4) now yields
P{max(i?i
Rm) > x] n>
_ 6
- Sn)][ΣjfQjf
exp(5 J - Sr
552
David Siegmund
(5.1) x exp{-[ίnJ
- α + θ\og(maxΣj'Qf
e x p ( S , - S y + Sn> - Sn))]};
[•••}> 0 } .
The observation that the important values of n' are close to n (say to within c log α) and the important values of j ' are close to j combined with similar asymptotic analysis to that given above yields the asymptotic approximation 6
E
A
-°°
Σ n ,ex P [0 ( S n ,-S n )]
}
Under the probability P-oo,n the independent Y^ have parameter £ = 0 for i < n and ξ = 0 for i > n; under Pj,+oo they have parameter ξ = 0 for i < j and ξ = θ for i > j . The first expectation on the right hand side of (5.2) is in the form obtained by Yakir and Pollak (1998) and Siegmund and Yakir (1999a,b). The second is the same as that obtained above. Since the first expectation does not involve the Qi, one can use the argument of Siegmund and Yakir (1999a) to infer its equivalence to the corresponding expression obtained by de Haan et al. (1989) or to convert it into one of the equivalent expressions given by Siegmund (1985), which are more suitable for numerical computation. 6
Discussion
The preceding calculations indicate how one might study the stochastic recursion (1.1) via the changes of measure indicated in (3.3) and (5.1). Note that this change of measure does not make use of the linear ordering of the indexing set, and hence is particularly useful for problems involving multidimensional time (e.g., Siegmund and Yakir (1999a,b)). Although the ARCH(l) process does not itself satisfy (1.1), the marginal tail probability of its stationary distribution is easily inferred from (2.2): one simply replaces x by x2 and C by C/2. However, (5.2) requires an auxiliary argument to produce an approximation for the maximum of an ARCH(l) process. This argument is straightforward, but it seems intrinsically one dimensional; and the methods described above do not seem helpful. Let Γ = min{n : Xn > x}. Let T o = min{n : Rn > x 2 }, and for k = 1,2, let Tfc = min{n : n > T^i.Rn > x2}. Also let v = min{A; : eTk > 0}. Prom the representation T = TU and the approximation (5.2) one can derive, for example by the method of Woodroofe (1976), a tail probability approximation for max(Xi, , Xm). Except for some details of the calculation, this is closely related to the argument of de Haan et al. (1989). It leads to still a third constant, which is similar to the first expectation appearing in (5.2) in the sense that it is a functional of a random walk with increments K. More
A Stochastic Recursion
553
precisely, under the conditions of the preceding section one obtains ί\l
< Tϊl] ^ Till
G1G2,
X
where C\ is the product of the two expectations on the right hand of (5.2) and = 2 Γ Jo
e-Θx[l -
E2~N*]dx,
with Nx = Σg°l{5 n > -x}. The constant C2 can be calculated by simulation or possibly by repeated numerical integration as follows. Let u{x) = E(2~Nχ) and h(x) = l/21^x>0h Also let Q denote the operator defined by Qf(x) = Ef(x + Y{). Then u satisfies u = h Qu and can be obtained recursively as linin-tooUn, where uo = h G (w, 1] and un = h Qun-\. Acknowledgements I would like to thank Benny Yakir for several helpful discussions and The University of Cambridge for their hospitality.
REFERENCES Embrechts, Kluppelberg, C. and Mikosch, T. (1997). Modeling Extremal Events for Insurance and Finance, Springer-Verlag, Berlin. Feller, W. (1972). An Introduction to Probability Theory and Its Applications, Vol. Jί, John Wiley and Sons, New York. Goldie, C M . (1991). Implicit renewal theory and tails of solutions of random equations, Ann. Appl. Probab. 1, 126-166. Haan, L. de, Resnick. S.L, Rootzen, H. and Vries, C.G. de (1989). Extremal behaviour of solutions to a stochastic difference equation with applications to ARCH processes, Stoch. Proc. Appl. 32, 213-224. Kesten, H. (1973) Random difference equations and renewal theory for products of random matrices, Acta Math. 131, 207-248. Pollak, M. (1985). Optimal detection of a change in distribution, Ann. Statist. 13 206-227. Pollak, M. (1987). Average run lengths of an optimal method of detecting a change in distribution, Ann. Statist. 15 749-779. Shiryayev, A. (1963). On optimal methods in quickest detection problems, Theory Probab. Appl. 8 22-46. Siegmund, D. (1985). Sequential Analysis: Tests and Confidence Intervals, Springer-Verlag, New York.
554
David Siegmund
Siegmund, D. and Yakir, B. (1999a). Tail probabilities for the null distribution of scanning statistics, Bernoulli, to appear. Siegmund, D. and Yakir, B. (1999b). Approximate p-values for sequence alignments. Unpublished manuscript. Vervaat, W. (1979). On a stochastic difference equation and a representation of non-negative infinitely divisible random variables, Adv. Appl. Probab. 11 750-783. Yakir, B. and Pollak, M. (1998). A new representation for a renewaltheoretic constant appearing in asymptotic approximations of large deviations, Ann. Appl. Probab. 8 749-774. Woodroofe, M. (1976). Prequentist properties of Bayesian sequential tests, Biometrika 63, 101-110. DEPARTMENT OF STATISTICS SEQUOIA HALL STANFORD UNIVERSITY 790 SERRA ST. STANFORD, CA
94305
USA dos @stαt. Stanford, edu
ANCILLARY HISTORY
STEPHEN M. STIGLER
University of Chicago Ancillarity has long been a shadowy topic in statistical theory. — D.R. Cox (1982). AMS subject classifications: 62-03. Keywords and phrases: Conditional inference, Fisher, Galton, Edgeworth, Laplace.
1
Introduction
The origin of the term "ancillary statistics" is clear and well known. It was introduced in 1925 by Ronald A. Fisher in his paper "Theory of Statistical Estimation" (Fisher, 1925); it then lay dormant for nearly a decade until Fisher returned to the topic in his "Two new properties of mathematical likelihood," which was sent to the Royal Society of London in December 1933 and published as Fisher (1934). The term arose in these two papers in Fisher's characterization of statistical information and its relationship to the likelihood function. When a single sufficient statistic existed it would contain all of the information in the sample and serve as the basis for a fully efficient estimate, that estimate to be found from differentiating the likelihood function to find the maximum. When this was not the case, auxiliary or "ancillary" information was needed and could frequently be obtained from statistics arising from looking more closely at the likelihood in the neighborhood of the maximum, in particular at the second or higher order derivatives there. Fisher expanded upon his earlier usage a year later, treating "ancillary" as a broader term of art not specifically wedded to local behavior of the likelihood function in "The Logic of Inductive Inference," read to the Royal Statistical Society on December 18, 1934 and published with somewhat acrimonious discussion as Fisher (1935). Partly as a result of this broadened view, the precise nature of the concept, and hence of its history both before and after the introduction of the term, has been elusive. In these early publications (and indeed also in later ones), Fisher explained the term most clearly by describing what "ancillary statistics" accomplished rather than what they were: They supplied auxiliary information to supplement the maximum likelihood estimate. In Fisher (1935) he wrote that when the best estimate fails to use all the information in the sample, when it "leaves a measurable amount of the information unutilized," he would seek to supplement the estimate to utilize that information as well. He asserted that "It
556
Stephen M. Stigler
is shown that some, or sometimes all of the lost information may be recovered by calculating what I call ancillary statistics, which themselves tell us nothing about the value of the parameter, but, instead, tell us how good an estimate we have made of it. Their function is, in fact, analogous to the part which the size of our sample is always expected to play, in telling us what reliance to place on the result. Ancillary statistics are only useful when different samples of the same size can supply different amounts of information, and serve to distinguish those which supply more from those which supply less." No specific general guide was provided, although examples of their use were given, use that invariably involved conditional inference given the ancillary statistics. In 1934 Fisher had included as a prime example the estimation of the location parameter of a double exponential distribution. There the maximum likelihood estimate, the sample median, is neither sufficient nor fully efficient. "The median is an efficient estimate in the sense of the theory of large samples, for the ratio of the amount of information supplied to the total available tends to unity as the sample is increased. Nevertheless, the absolute amount lost increases without limit." (Fisher, 1934, p. 300). By conditioning upon the sample spacings — what Fisher called the sample configuration — he was able to show in great detail that the median was conditionally efficient on average, and he noted that this conclusion extended to more general location-scale families (Hinkley, 1980). A year later, Fisher (1935) illustrated the ancillarity idea through a new example, testing for homogeneity in a 2 x 2 table conditionally upon the marginal totals, an example that as we shall see introduced other complications to the discussion. In concluding that paper he indicated that ancillary statistics would be useful in the case, "of common occurrence, where there is no sufficient estimate." Then "the whole of the ancillary information may be recognized in a set of simple relations among the sample values, which I called the configuration of the sample." These statements were not clear to the audience at the time. The discussants who commented on this portion of his paper were distracted by other features of the example; only J. O. Irwin mentioned the term ancillary and then simply to say "it was not absolutely clear how one should define an ancillary statistic." In a few scattered comments on the term in later writings, Fisher added little by way of elaboration. Some later writers, such as Cox (1958), Cox and Hinkley (1974, pp. 31-35), Lehmann and Scholtz (1992), and Welsh (1996, p. 383), have added clarity and specificity to the definition in cases such as where a minimal sufficient statistic exists; others, such as Basu (1964), Buehler (1982, with discussion), and Brown (1990, with discussion), have pointed to difficulties with the concept due to the non-uniqueness of ancillary statistics in some even well-structured parametric problems, or to
Ancillary History
557
paradoxes that can arise in a decision theoretic framework. Despite these misgivings and the vagueness of the definition, the notion has come to be key to powerful ideas of conditional inference: When an ancillary statistic can be found (usually taken to be a part of a sufficient statistic whose marginal distribution does not depend upon the parameter of interest), it is best (or at least prudent) to make inferences conditional upon the value of the ancillary statistic. The goal here is not to explore the history of ancillarity subsequent to Fisher (1934, 1935), still less to attempt a rigorous and clear explication of the concept and its realm of appropriate application (for which see Eraser, 1979, Lehmann and Scholtz, 1992, Lloyd, 1992, and the recent book by Barndorff-Nielsen and Cox, 1994). Rather it is to present three earlier examples that bear on the understanding of the concept, examples which may help us better understand Fisher's idea as a not-fully crystallized recognition of a common thread in a variety of problems in statistical inference. 2
Laplace and the Location Parameter Problem, 1772-1777
It is common today, even where there is disagreement about the extent and usefulness of the idea of ancillarity, to adopt as sound statistical logic some of its consequences when considering location parameter problems. For example, in making inferences about μ in a random sample from a Uniform [μ —ft,μ + ft] distribution with known ft, where by inference we mean estimation and the assessment of the accuracy of the estimate of μ, we should condition on D = XmΆX - X m i n , since the usual estimator (Xmax + Xm{n)/2 must invariably lie within ft — D/2 of the unknown μ. Any assessment of the accuracy of this estimator that did not condition on the observed value of D could lead to absurd results (e.g. Welsh, 1996, p. 157). More generally (for other population distributions) we should assess accuracy conditional upon the residuals or the spacings between the observations. This practice has a long and distinguished provenance. In subjecting the location parameter problem to formal treatment, notation is necessary, and the choices of notation will reflect, however imperfectly, conceptual understanding. One common choice today is to introduce a symbol for the target value, say μ, and then describe the n observations X{ in terms of μ and the errors of observation, say e;, by X{ = μ + e;. The distribution of errors, a probability density, is represented by 0(e), and so the likelihood function is ΠΓ=i Φ(Xi ~ /Ό This notation reflects in principle the approach taken by some early mathematical statisticians. For example, in 1755 Thomas Simpson worked with the errors and the error distribution in showing that an arithmetic mean would improve upon a single observation. Simpson's approach in terms of errors made the inverse step to theoretical statistical inference an easier one,
558
Stephen M. Stigler
as I have argued before (Stigler, 1986a, pp. 88ff.) Indeed, this approach underlies Fisher's fiducial probability and Neyman's confidence intervals. But it is not the only possible approach, nor, since the errors axe not directly observable, is it even in practical matters necessarily the most natural. Others, and Laplace was a significant example, chose to frame the problem in a way where conditioning on ancillaries was much more tempting, in terms of the correction to be made to the first observation and the distances between the observations. This tendency was already present in Laplace's first serious memoir on mathematical statistics (Laplace, 1774; translated with commentary in Stigler, 1986b, see also Stigler, 1986a, pp. 105ff., and Hald, 1998, p. 176), but for present purposes it is clearer in a memoir Laplace read to the Academie des Sciences on March 8, 1777. The memoir remained unpublished until 1979 (Gillispie, 1979). Laplace's memoir is unusual in presenting two approaches to the estimation problem, from two different, clearly delineated statistical vantage points. He explained that one might address the problem of choosing a mean from either an a priori perspective (before the observations have been made), or a posteriori (after they have been made). In the latter case — the one that concerns us here — he described the problem of choosing a mean as one of "determining a function of the observations a posteriori, that is to say taking account of the respective distances between the observations" (Laplace, 1777, p. 229). He provided interesting analyses from both perspectives leading to quite different results; we focus here upon the second. Laplace began as we might now with the observations (he wrote α, α', α", ..., where we write Xi, X2, ^ 3 , •)> but in one section of the memoir he reexpressed these data in a different notation. He let x denote the correction that would be applied to the first observation to arrive at the true value; ι 2 in our notation x = μ — X±, so X\ + x — μ. And he let q( \ q^ \ q^\... represent the distances of the second and subsequent observations from the first. We could write these as gW = Xi+ι — X\. The likelihood function would then become φ(—x)φ(q^ — x)φ(q^ —x) . Laplace quoted his 1774 "principe general" for reasoning to inverse probabilities — what we would now describe as Bayes Theorem with a uniform prior distribution; see Stigler (1986a, pp.lOOff.; 1986b). He concluded that the probabilities of the different values of the correction x given the respective distances between the observations q^ι\q^2\q^\... would be proportional to this same function, φ(—x)φ(q^ — x)φ(qW — x) . This agrees with the result that Fisher obtained in 1934 for the case of the double exponential or Laplace density w |e~l l, as the conditional distribution of the difference between the median and the location parameter given the spacings. Fisher had noted that in general this "frequency distribution . . . is the mirror image of the likelihood function." (Fisher, 1934, p. 303).
Ancillary History
559
As an example Laplace considered a sample from a Uniform [μ — /ι, μ + h] distribution, h known. He wrote, Suppose for example that the law of facility of errors is constant and equal to if, that it is the same for all observations, and that the errors are each taken between t = —h and t = h; α^n~1^ being the time fixed by the last [largest] observation, we set α^71"1) — M = h and N — α = h [that is, M = α^71"1) — h and N = α + h, where α is the minimum observation]. It is clear that the true time of the phenomenon falls necessarily between the points M and JV; further that the probability that each of the intermediate points will be this instant is proportional to Kn\ ... and that the mean we need to choose, X, is evidently the midpoint of the line segment (α, α^71"1)), and so in this case, to take the mean among n observations it is necessary to add to the smallest result half the difference between the smallest and the largest observations." (Laplace, 1777, p. 241) He thus concluded that the posterior distribution for the true value was Uniform [ χ m a x - h,Xmϊn + Λ], leading him to suggest the midrange (that is, the posterior mean) as a posterior estimate. Some of Laplace's language was suggestive of Fisher, particularly his conditioning upon the spacings between the observations ("en ayant egard aux distances respectives des observations enter elles"), which was echoed by Fisher's "configuration of a sample." Laplace's perspective was closer to a Bayesian analysis than a Fisherian fiducial one, but then perhaps so was Fisher's in his initial foray into likelihood-based inference in 1912, before he took great pains (not always successfully) to distinguish his approach from others from 1922 on; see Zabell (1989, 1992), Edwards, (1997a,b), Aldrich, (1997). 3
Edgeworth, Pearson, and the Correlation Coefficient
Another area in which the idea of ancillarity has been appealed to is in inference about the parameters of a bivariate normal distribution,where the values of (say) X may be treated as ancillary with respect to inference about E(Y I X) — αX + 6, justifying conditioning upon the X's (or sufficient statistics for the distribution of the X's) whether the X's are random or assigned by experimental design (see, for example, Cox and Hinkley, 1974, pp. 32-33). There is interesting historical precedent for this. In 1893 Francis Edgeworth considered the estimation of the correlation p of n bivariate normal pairs (Xi, ϊi), assumed centered at expectations and measured in standard units, effectively marginally 7V(0,1) (Edgeworth, 1893; Stigler, 1986a, pp. 321322). Of course in this case Έ(Y \ X) = pX. Edgeworth considered the pairs with the X's "assigned", that is he conditioned upon the X% so that for X not equal to zero the conditional expected value of Y/X would be p,
560
Stephen M. Stigler
and the conditional variance of Y/X would be proportional to 1/X2. He then found the optimal weighted average of the Y/X's to be weighted by the X 2 's, and he gave that as the "best" value for p:
Σ(x2)
Σ(X2Y
Three years later, Karl Pearson attacked the problem of estimating the parameters of a bivariate normal distribution directly as a bivariate estimation problem. Approaching the problem from the standpoint of inverse probability (but in a manner mathematically equivalent to maximum likelihood estimation), he was led to the estimate of the correlation Σ{XY)lnσισ2, an where he had σ\ = Σ(X2)/n d o\ = ΣX^ 2 )/ n > *m the process blurring the distinction between these as statistics and as parameters (Pearson, 1896; Stigler, 1986a, pp. 342-343). Had Edgeworth similarly blurred this distinction (and to a degree he did, see Stigler, 1986a, p. 322), these estimates would seem to agree. But while Edgeworth noted this identity on several occasions, he stopped short of claiming priority. I have a reprint of Edgeworth's 1893 paper to which Edgeworth added a manuscript note after he had seen Pearson's work. He wrote, The value of p which I give at p. 101 is the most accurate on the assumption that the best value is a weighted mean of yi/^i, 2/2/^2? •; Prof. Karl Pearson obtains the same result without that arbitrary assumption. I have proceeded like one who having to determine the most probable value of the modulus [i.e. standard deviation], for given observations, ranging under an ordinary Probability-curve [i.e. a normal density], assumes that the quaesitum [what is desired] is a function of some mean power of errors and then proves that the most accurate result is afforded by the second power; Prof. Karl Pearson has proceeded without any such assumption. F. Y. E. 1896. Edgeworth made a similar, briefer and less specific, comment in print that same year (Edgeworth, 1896, p. 534). Edgeworth had approached the estimation of p conditionally, conditioning upon the ancillary X's, but his method of inference was not Fisherian inference: he estimated p by a weighted average (effectively using least squares conditionally given the X's) rather than conditionally employing maximum likelihood. And there is a good reason why he would not have used maximum likelihood: For his specification of the problem, with marginal means equal to zero and marginal variances equal to one, the maximum likelihood approach leads to algebraic problems; neither the Pearsonian product moment estimator nor Edgeworth's version is maximum likelihood. For that restricted setting, the maximum likelihood estimator of p is the solution of a cubic equation that resists closed form expression (Johnson and Kotz, 1972,
Ancillary History
561
p. 105). The same is true whether one proceeds conditionally given the X's (as may be sanctioned by appeal to Cox and Hinkley, 1974, pp. 34-35) or unconditionally. The difficulty stems from the fact that conditionally given the X% not only is E(Y | X) = pX, but the conditional variance 1 — p2 depends upon p as well. Edgeworth had avoided this problem (as he noted) by restricting the form of his estimator to a weighted average; Pearson (perhaps inadvertently) had avoided it by allowing the marginal variances to vary freely in his calculation. In any case, Edgeworth seemingly took conditional inference here for granted.
4
Gait on and Contingency Tables
As I noted earlier, Fisher had in his 1935 paper enlarged upon his broadened descriptive definition of ancillary statistics with a quite different example, one that involved testing, not estimation: the application of the concept of ancillary statistics to 2 x 2 tables. He presented a cross-classification based upon 30 sets of twins (Table 1), where in each pair one twin was a known criminal and the remaining twin was then classified as convicted or not. He supposed for the purposes of the example that the data were "unselected" and asked if there was evidence here that the "causes leading to conviction" had been the same for the monozygotic as for the dizygotic twins.
Monozygotic Dizygotic Total
Convicted 10 2
Not Convicted 3 15
12
18
Total 13 17 30~
Table 1. Convictions of Like-sex Twins of Criminals. Lange's data, from Fisher (1935).
Fisher wrote, To the many methods of treatment hitherto suggested for the 2 x 2 table the concept of ancillary information suggests this new one. Let us blot out the contents of the table, leaving only the marginal frequencies. If it be admitted that these marginal frequencies by themselves supply no information on the point at issue, namely as to the proportionality of the frequencies in the body of the table, we may recognize the information they supply as wholly ancillary; and therefore recognize that we are concerned only with the relative probabilities of occurrence of the different ways in which the table can be filled in, subject to these marginal frequencies. (Fisher, 1935) He went on to develop his conditional test, showing that the distribution of the table entries given the marginal totals was a hypergeometric distribution, independent of the probability of conviction under the hypothesis this is the
562
Stephen M. Stigler
same for both types of twin. Over four decades earlier, Francis Galton had faced a similar table, and his analysis sheds interesting light upon Fisher's. In his study of fingerprints, Galton had inquired as to the propensity for related individuals to have similar patterns. As part of this study he presented the data in Table 2 on relationships between the patterns on the right fore-fingers of 105 sibling pairs (Galton, 1892, p. 172-176; Stigler, 1995; 1999, Chapter 6). To investigate the degree to which sibling pairs shared the same general pattern of fingerprint, Galton needed to test these data for evidence of association, to measure the degree to which the diagonal entries of this table exceed what they would be, absent any heritable link.
B children Arches Loops Whorls Totals in A children
A children Arches
Loops
Whorls
Totals in B children
5 4 1 10
12 42 14 68
2 15 10 27
19 61 25 105
Table 2. Observed fraternal couplets (Galton, 1892, p. 175). The A sibling was distinguished from the B sibling in being the first "that happened to come to hand" (Galton, 1892, p. 172; presumably no pun was intended).
Recall that this was eight years before Karl Pearson introduced the Chisquare test, and 12 years before he applied it to testing independence in crossclassifications. Focussing entirely upon the diagonal, Galton constructed his own measure by first determining what the counts would be if the prints were paired at random. Thus for the first diagonal entry he found the number 19 x 10/105 = 1.7, for the second, 61 x 68/105 = 37.6, and for the third, 27 x 25/105 = 6.2. He labeled these "Random", and considered them as the baseline for comparison with the "Observed"; see Table 3. All of the "Observed" exceeded the "Random", but was the difference to be judged large enough to reject the "Random" hypothesis? Galton constructed a scale using "Random" as the baseline and measuring how large the "Observed" were in degrees on a centesimal scale, essentially as a percent of the distance to the "Utmost feasible" as determined from the marginal totals (this being the minimum of the two corresponding marginal totals). For these data the degrees are 40°, 19°, and 20°. He made no attempt to assign a probability to such discrepancies. Galton's procedure had one element in common with Fisher's, and it was an important one. His measure was, like Fisher's, conditional upon the marginal totals. The baseline values were, in common with all analyses since Karl Pearson, computed as the expected counts under the hypothesis of random assignment — independence between each of the pair of sibling's
563
Ancillary History A and B both being Random Observed Utmost feasible
Arches
Loops
Whorls
1.7 5.0 10.0
37.6 42.0 61.0
6.2 10.0 25.0
Table 3. Galton's test of independence for the fingerprint patterns of fraternal couplets. On Galton's centesimal scale, these observed counts are 40°, 19°, and 20° degrees above the random, higher than in other examples that were based upon a finer classification (Galton, 1892, p. 176).
patterns. Indeed, I do not know of an earlier example of this calculation of expected values, at least for tables larger than 2 x 2 , although I have not made an extensive search. But there was one point where Galton departed from Fisher's program: he expressed a principled reservation about the appropriateness of one aspect this conditioning on the margins. When Galton introduced this approach earlier in his book he had qualified it as follows: "Now consider the opposite extreme of the closest possible relationship, subject however, and this is the weak point, to the paramount condition that the average frequencies of the A. L. W. classes may be taken as pre-established:' (Galton's italics, Galton, 1892, p. 126). To Galton there was a "self-contradiction" in the assumption that the analysis proceed conditionally on the observed marginal frequencies, a contradiction that constituted a "grave objection" to his procedure. The problem was that if the relationship were perfect and all the counts fell on the diagonal, the marginal totals should agree, but they did not. The problem was particularly apparent in Galton's example, where the row and column categories were the same; indeed, they were based upon the same population and — absent sampling variation — should have agreed. But the problem holds more generally. Even in Fisher's 2 x 2 table the fact that the row totals do not equal the column totals is prima facie evidence that the relationship is not a perfect one: The margins do contain information about the degree of association! Plackett (1977) has noted this with specific reference to Fisher's data, but there is a suggestion in Fisher's wording that he realized it as well. His statement was conditional is a way that is technically correct even though misleading: "// it be admitted that these marginal frequencies by themselves supply no information ..., we may recognize the information they supply as wholly ancillary" (emphasis added). An unsuspecting reader would read this as suggesting the supposition clearly held, and would be lured into granting the premise and so accepting the conclusion of ancillarity. For was that not the point of the example? As Plackett has shown, however, the amount of information in the margins is slight, so this conclusion is not seriously misleading in practice. On this point see Plackett (1977), and particularly
564
Stephen M. Stigler
Barnard (1984) and Cox (1984). It is an extremely subtle point, and the fact that Galton picked up on it in 1892 is remarkable. 5
Conclusion
Fisher seems to have initially, in 1925, conceived of the ancillary statistics of a parametric inference problem as being that part of the likelihood function that varied from sample to sample but was not captured by the location of the maximum, more specifically as the second and higher derivatives of the likelihood at the maximum. By 1934 and 1935, with two quite different and vivid examples in hand that did not fit so easily (if at all) with his earlier conception, he broadened the definition and made it less specific — almost qualitative. Fisher had a powerful statistical intuition that worked best from exceedingly well chosen examples, and in this case his intuition led him to postulate a concept that indubitably worked well in his examples but resisted rigorous codification, just as fiducial probability has, and even as some aspects of maximum likelihood have. Laplace preceded Fisher down one of his lines, but in a different time and with a different statistical intuition he did not attempt to abstract from the location problem to more general considerations. Edgeworth may have had the best appreciation of the subtleties of statistical theory of anyone between Laplace and Fisher, but while he found it natural to use conditional inference given the ancillary X% the problem he faced did not in his formulation yield a manageable answer without the expedient step of restricting the form of the estimator. If he had treated the more general problem it is tempting to think he might have reasoned to the Pearsonian estimator without restriction and been inspired to investigate how far the idea might be generalized. But he did not. Galton too had a statistical mind of the very first order, and he clearly noted a problem that Fisher barely hinted at, if that. Ancillary statistics were an unusual product of an extraordinary statistical mind. The breadth of the conception exceeded (or has so far exceeded) what is mathematically possible. No single, crisply rigorous mathematical definition delivers all that Fisher promised. But if his reach exceeded his (or anyone's) grasp in this case, it was still very far from a failure. Savage has called the idea "of more lasting importance than fiducial probability" (Savage, 1967, p. 467), and while that smacks of faint praise, it need not have been. Ancillarity has led to a broad collection of procedures that travel together under the banner of conditional inference; it is an idea that has been with profit invoked to simplify, to sharpen, to improve inferences in an even broader list of applications than Fisher envisioned, and can, despite misgivings about how and when to apply it, be expected to continue to serve these roles for an indefinite future.
Ancillary History
565
REFERENCES [I] Aldrich, J., (1997). R. A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science 12, 162-176. [2] Barnard, G., (1984). Contribution to discussion ofYates (1984). [3] Barndorff-Nielsen, O.E. and Cox, D.R., (1994). Inference and Asymptotics. Chapman and Hall, London. [4] Basu, D., (1964). Recovery of ancillary information. Sankhya (A) 26, 3-16. [5] Brown, L.D., (1990). An ancillarity paradox which appears in multiple regression (with discussion). Annals of Statistics 18, 471-538. [6] Buehler, R.J., (1982). Some ancillary statistics and their properties (with discussion). Journal of the American Statistical Association 77, 581-594. [7] Cox, D.R., (1958). Some problems connected with statistical inference. Annals of Mathematical Statistics 29, 357-372. [8] Cox, D.R., (1982). Contribution to discussion of Buehler (1982). [9] Cox, D.R., (1984). Contribution to discussion ofYates
(1984)..
[10] Cox, D.R. and Hinkley, D., (1974). Theoretical Statistics. Chapman and Hall, London. [II] Edgeworth, F.Y., (1893). Exercises in the Calculation of Errors. Philosophical Magazine (Fifth Series) 36, 98-111. [12] Edgeworth, F.Y., (1896). Supplementary notes on statistics. Journal of the Royal Statistical Society 59, 529-539. [13] Edwards, A.W.F., (1997a). Three early papers on efficient parametric estimation. Statistical Science 12, 35-47. [14] Edwards, A.W.F., (1997b). What did Fisher mean by "inverse probability" in 1912-1922?. Statistical Science 12, 177-184. [15] Fienberg, S.E. and Hinkley, D.V., (1980). R. A. Fisher: An Appreciation. Lecture Notes in Statistics 1. Springer-Verlag, New York. [16] Fisher, R.A., (1912). On an absolute criterion for fitting frequency curves. Messenger of Mathematics 4 1 , 155-160. (Reprinted in: Fisher (1974) as Paper 1; reprinted in Edwards (1997a).) [17] Fisher, R.A., (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society 22, 700-725. (Reprinted in: Fisher (1950) as Paper 11; reprinted as Paper 42 in Fisher (1974).)
566
Stephen M. Stigler
[18] Fisher, R.A., (1934). Two new properties of mathematical likelihood. Proceedings of the Royal Society of London (A) 144, 285-307. (Reprinted in: Fisher (1950)as Paper 24; reprinted as Paper 108 in Fisher (1974),) [19] Fisher, R.A., (1935). The logic of inductive inference. Journal of the Royal Statistical Society 98, 39-54. (Reprinted in: reprinted as Paper 26 in Fisher (1950); reprinted as Paper 124 in Fisher (1974).) [20] Fisher, R.A., (1950). Contributions to Mathematical Statistics. Wiley, New York. [21] Fisher 1974, R.A.. The Collected Papers ofR. A. Fisher (eds: J. H. Bennett). U. of Adelaide Press, Adelaide. [22] Fraser, D.A.S., (1979). Inference and Linear Models. McGraw-Hill, New York. [23] Galton, F., (1892). Finger Prints. Macmillan, London. [24] Gillispie, C.C., (1979). Memoires inedits ou anonymes de Laplace sur la theorie des erreurs, les polynomes de Legendre, et la philosophie des probabilites. Revue d'histoire des sciences 32, 223-279. [25] Hald, A., (1998). A History of Mathematical Statistics from 1750 to 1930. Wiley, New York. [26] Hinkley, D.V., (1980). Fisher's development of conditional inference. In Fienberg and Hinkley (1980), 101-108. [27] Johnson, N.L. and Kotz, S., (1972). Distributions in Statistics: tinuous Multivariate Distributions. Wiley, New York.
Con-
[28] Laplace, P.S., (1774). Memoire sur la probabilite des causes par les evenements. Memoires de mathematique et de physique, presentes a ΓAcademie Royale des Sciences, par divers savans, & lu dans ses assembles 6, 621-656. (Translation: Stigler (1986b).) [29] Laplace, P.S., (1777). Recherches sur le milieu qu'il faut choisir entre les resultats de plusieurs observations, in Gillispie (1979), 228-256. [30] Lehmann, E.L. and Scholz, F.W., (1992). Ancillarity. Current Issues in Statistical Inference: Essays in Honor ofD. Basu (eds: Malay Ghosh and P. K. Pathak). IMS Lecture Notes Monograph Series 17, 32-51. Institute of Mathematical Statistics, California. [31] Lloyd, C., (1992). Effective conditioning. Australian Journal of Statistics 34, 241-260.
567
Ancillary History
[32] Pearson, K., (1896). Mathematical contributions to the theory of evolution, III: regression, heredity and panmixia. Philosophical Transactions of the Royal Society of London (A) 187, 253-318. (Reprinted in: Karl Pearson's Early Statistical Papers, Cambridge: Cambridge University Press, 1956, pp. 113-178.) [33] Plackett, R.L., (1977). The marginal totals of a 2 x 2 table. Biometrika 64, 37-42. [34] Savage, L.J., (1976). On rereading R. A. Fisher. Annals of Statistics 4, 441-500. [35] Stigler, S.M., (1986a). The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press, Cambridge, Mass.. [36] Stigler, S.M., (1986b). Laplace's 1774 memoir on inverse probability. Statistical Science 1, 359-378. [37] Stigler, S.M., (1995). Galton and Identification by Fingerprints. Genetics 140, 857-860. [38] Stigler, S.M., (1999). Statistics on the Table. Harvard University Press, Cambridge, Mass. [39] Welsh, A.H., (1996). Aspects of Statistical Inference. Wiley, New York. [40] Yates, F., (1984). Tests of significance for 2 x 2 tables (with discussion. Journal of the Royal Statistical Society (Series A) 147, 426-463. [41] Zabell, S.L., (1989). R. A. Fisher on the history of inverse probability. Statistical Science 4, 247-256. [42] Zabell, S.L., (1992). R. A. Fisher and the fiducial argument. Statistical Sciencevol 7, 369-387. DEPARTMENT OF STATISTICS UNIVERSITY OF CHICAGO
5734
UNIVERSITY AVENUE
CHICAGO,
IL 60637
USA stigler@gαlton.uchicαgo. edu
ON ORDER STATISTICS CLOSE TO THE MAXIMUM
J E F L. TEUGELS
Katholieke Universiteit Leuven We investigate the asymptotic properties of order statistics in the immediate vicinity of the maximum of a sample. The usual domain of attraction condition for the maximum needs to be replaced by a continuity condition. We illustrate the potential of the approach by a number of examples. AMS subject classifications: 62G30, 60F99. Keywords and phrases: Extremal law, domain of extremal attraction, extreme value statistics, geometric range of a sample, extremal quotient.
1
Introduction
Let X\,X2,..^Xn be a sample from X with distribution F. For convenience, we assume that F(x) = P(X < x) is ultimately continuous for large values of x. We denote the order statistics of the sample by
χt 0 g{χu)
ΛΠ
when x t oo. If 7 > 0, then g(x)/U(x) -> 7; if 7 < 0, U(x) t x+, the right-end point of F, and g(x)/(x+ - U{x)) -» - 7 . In what follows we are interested in order statistics X*_ £ + 1 that are very close to the maximum. To be more precise, we assume that n and t tend to 00 but that ί/n —> 0. We search for a centering sequence {bn} of real numbers and a normalizing sequence {αn} of positive reals for which α~ x (X*_^ +1 — bn) converges in distribution. The solution to this kind of problem depends on the type of condition one would like to impose. One could for example quantify the dependence of i on n explicitly, imposing some strong regularity conditions on ί. Non-normal laws are then possible as shown by Chibisov [5] and more recently by Cheng, de Haan and Huang [4]. Alternatively, conditions can be imposed on the underlying distribution F or on its tail-quant ile function U. This approach has been followed by Mason and van Zwet [14] and more particularly by Falk in [6,7]. For a comprehensive treatment, see the books by Reiss [15] and Leadbetter, Lindgren and Rootzen [12]. In this paper we offer a sufficient but unifying condition to arrive at asymptotic normality of the intermediate order statistics close to the maximum. We also illustrate the potential of the condition with a variety of different examples. After developing a rationale for its introduction in the next section, the condition is described and illustrated in section 3. The remaining sections contain applications to one and two order statistics. 2
Rationale for the condition
We outline two approaches. The first is based on the Helly-Bray theorem [11] while the second proceeds along a transformation.
2.1
Helly-Bray approach
Take m to be any real-valued bounded and continuous function on 5R. We ι investigate the limiting behaviour of En := E {m (α~ (X^_i+1 - bn))} By a classical combinatorial argument one writes
Note that the two exponents in the integrand are tending to oo. So we need to rewrite the integrand in such a way that both factors can be handled
570
ML. Teugels
simultaneously. To achieve that, we follow a procedure suggested by Smirnov [16]. Substitute 1 - F(x) = q + pυ where the sequences q = q(£,n) and p = p{£,n) will be determined soon. Here and in the sequel, we write q = 1 — q. Then an easy calculation yields
where we used the abbreviation τ n (w):= for convenience. The form of the integrand suggests to take q = q{l, n) — £ and p2 = p2(ί, n) = qq n~ι. We now follow the same approach as in [16,18]. Subdivide the integration in En over the three intervals (—|, —T) , [—T,Γ] and (T, | ) where T is a fixed quantity. It is then easy to show that with the notations above, the central part can be controlled by the condition (A):
rn(v) := — s ^ — - -> τ(υ) uniformly on bounded ^-intervals.
By taking T large enough, the two remaining pieces ultimately vanish since min(|, | ) -> oo . This then leads to the following result. Lemma 2.1 Under condition (A) (2)
E (m (Xn-^ι-bΛ
\ _,
1
In the subsequent applications of the lemma we have the freedom of choosing the constants an and bn in such a way that condition (A) is satisfied. The choice of bn is usually automatic. If we put v = 0 in (A), then it is almost 1 obvious that we should take bn = U(q~ ) — U(n/ί). Then, the choice of an has to be made by requiring the convergence of
for \v\ oo. Introduce this approximation in (3) to obtain
Let us compare this condition with the extremal condition (1). We then clearly can take x = n/ί which tends to oo by our assumptions. However, the fixed quantity y in (1) has to be replaced in (4) by a quantity that tends to 1 together with x. The resulting condition is discussed in the next section. Apart from the case where ί —>• oo, ίjn -» 0, there are at least two other situations. (i) First, ί could be taken fixed. Then results for fixed i can be obtained if and only the same result can be found for £ = 1. The extremal domain of attraction will play a predominant role. See for example [3,9,12,15]. (ii) If ί —> oo but ^ -> λ E (0,1). Condition (4) can then be replaced by
The latter condition is a differentiability condition of U in a neighbourhood of j and is classical in the theory of order statistics [7,15,17]. The condition that we need should be intermediate between conditions (1) and (5).
2.2
Transformation approach
Assume that Z has a standard exponential distribution and let Z* < Z% < . . . < Z* be the order statistics of a sample of size n from Z. It is well known that for this specific distribution
when ί -> oo and n — £ —> oo. See for example [15, p. 108] where the result is given for the equivalent case of uniform random variables. In order to transfer this asymptotic normality to a more general situation, we can identify X z with U(e ) := φ(Z). This transfer function φ should then be approximately linear on intervals of size of order ί~ιl2 around log(n/£). More explicitly, we need a condition of the form
(6)
φ(x + tδ{x))-φ{x)
wwd
>*
when δ(x) -> 0 for x ->• oo. We will transform (6) into a condition on U.
572
ML. Teugels The class C*
3
Motivated by the arguments above, we now introduce our working condition. Definition 3.1 Assume that the tail quantile function U is uniformly differentiable at oo with ultimately positive derivative u. The distribution F belongs to C* iff for ally eft U(x +
yxe(x))-U(x)
(7)
>y
whenever e(x) —»> 0 when x t oo.
We first show how this condition emerges from the two approaches above. (i) Prom the Helly-Bray approach. Take the definition of CΊ(g) and replace y by 1 + ye(x) in (1). Then approximately U{x + yxe{x)) - U{x) e(x)g(x)
(1 + ye{x))^ - 1 ηe(x)
Expanding and taking limits on the right hand side yields y. Note that the quantity 7 disappeared from the expression. (ii) Prom the transformation approach. Replace φ{x) by U(ex) in (6), then (7) emerges naturally. Before embarking on the applicability of condition (7), we make a number of remarks. Remark 3.1 a. Note first that (7) is satisfied if \og(xu(x)) is uniformly continuous on a neighbourhood of 00. Alternatively, xuf(x)/u(x) is bounded on such a neighbourhood. The latter condition is satisfied if the distribution F has a density that satisfies a Von Mises condition xu'(x)/u(x) —> c G K. As such, (7) slightly generalizes the conditions given by Falk [6]. Alternatively, look in [15, p.164]. Remark 3.2 b. The condition for C* is equivalently transformed into a condition in terms of F itself. For such comparisons in general, see [3,9]. Proposition 3.1 Assume that F has an ultimately positive density f. Then
FeC* ifffor ally eft δ(x) whenever δ(x) ~> 0 for x
Order Statistics Close to the Maximum Proof Then
573
Choose g(v) = vu(v) and write —yx for the expression on the left. 1-F(x
= (1-
-
yxδ(x)).
Put x = U{υ) and yx = y'υ. Then
Now, define e(υ) by the equation 1 + ye(v) := {1 that u and / are linked by the equality υu(υ) = (^
^
) } " 1 and note . Then
Pfo + yvejv)) - U(v) _ δ(U(v)) e(v)vu(v) ~y e(v) As all steps can be reversed, also the converse holds. Remark 3.3 c. A useful implication of the condition F E C* is given in the next result. Lemma 3.1 Let xn -> oo ; yn ->- y and rn φ 0, a sequence tending to 0. If F e C* then for n -> oo, ynrn))-U(xn) x n u(xn)
(8)
y
Proof First assume that the sequence {yn} is constant and equal to y. Suppose now on the contrary that (8) does not hold. Then there exists a subsequence {xn} and a positive δ for which xn+1 > xn + 1 and !Li
7—v ~ y >δ rn xn u{xn) for all n. Define e(x) = rn when xn < x < xn+ι and e(x) = 0 when x < x\. As τn -> 0 and xn+ι > %n + 1? e(.) is well defined and e(x) —> 0 as x -> oo. Nevertheless U(χm(l
4- ly^frr-r,)^ — U(x»)
> e(xn) xn u(xn) for all n, leading to a contradiction with the definition of C*. The sequence of increasing functions ( In[y)
_
' rnxnu(xn) converges pointwise to the function f(y) := y. But then the convergence is uniform as follows from Polya's extension of Dini's theorem [13]. Hence the result follows. •
574
ML. Teugels
Remark 3.4 d. All of the results in the next sections can also be derived by following a transformation approach. To avoid duplication we only deal with the Helly-Bray procedure. Remark 3.5 e. A link between the conditions (1) and (7) is given by the relation g(x) = xu{x) which has already been used in b. above. In what follows both functions g and u will be used repeatedly. 4
One large order statistic
We illustrate the above concept first in the easiest possible situation, i.e. that of one order statistic close to the maximum. Recall a well-known weak law under the condition F E CΊ(g). For then
0 = T(V).
But then by lemma 1, g~1(n/£)(Xn_i+ι - U{n/£)) 4 0. Note that in the case 7 > 0 we can go one step further. Since ^ y -¥ η~λ, we also have g~ι{n/£) X*_i+1 4 7 " 1 . When 7 < 0, we find similarly that g-1{n/£){x+
-
χu+ι) $ -7-1.
A key point for introducing the class C* is illustrated in the following result, which specifies the speed of convergence in the above weak law. Because of its basic importance, we formulate the result in the form of a theorem. Theorem 4.1 Let F eC*. If t,n -> oo such that £ -> 0, then
nu(n/i) Proof Look anew at τn(v) above. Then with x = nfi and e{x) = £~2 the condition F eC* shows that τn(v) -> — υ. By the symmetry of a normal random variable the result follows. • The above result is precisely of the form deduced by Falk [6] under the traditional Von Mises conditions. See also Reiss [15]. We can expect that the speed of convergence in the above result might be very slow. Forfixedί the limit law for n —> oo is linked to the classical extreme value distribution while for ί —>- oo and £/n 4 θ w e get a very different distribution in the normal law.
575
Order Statistics Close to the Maximum
5
Two large order statistics
We turn to the case of X*_s_t+1 and X^-s+i where s and t are two integers both converging to infinity. Prom lemma 1 we know that we have to use a specific normalisation inspired by a reduction of the kernel of the integrand to a bivariate normal density.
5.1
General approach
As in section 2.1 we start from the expression for the joint expectation. Then with En := E < m ί —^=^αn