This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
0 there exists CX,2 2 Var(Xn ) ≥ CX,2 (λ − )2n .
Taking = λδ1 , we have Var(Xn ) satisfies the lower bound in Condition 4.1. Applying Lemma 4.3 with p = 4 and = λδ3 we see the fourth moment bound on Xn in Condition 4.2 is satisfied. With these choices for δi , i = 1, . . . , 4, as η < ϕ < 1, we have φ2 < η < 1 and φ1 = η < 1, hence Conditions 4.1 and 4.2 are satisfied. Noting that β = max{φ1 , φ2 } = η < ϕ now completes the proof.
4.2 Hierarchical Structures
87
4.2.3 Convergence Rates for the Diamond Lattice We now apply Theorem 4.4 to hierarchical sequences generated by the diamond lattice conductivity function F (x) in (4.30). We have already argued that Theorem 4.5 implies that F (x) is strictly averaging on, say [a, b]4 , for any 0 < a < b and choice of positive weights satisfying F (14 ) = 1, and on this domain such an F (x) is easily seen to be twice continuously differentiable. For all such F (x) the result of Shneiberg (1986) quoted in Sect. 4.2 shows that Xn satisfies a weak law. We now study the quantity ϕ which determines the exponential decay rate of the upper bound of Theorem 4.4 to zero. The first partial derivative ∂F (x)/∂x1 has the form (w1 x12 )−1 ∂F (x) = , ∂x1 ((w1 x1 )−1 + (w2 x2 )−1 )2 and similarly for the other partials. Hence F (tx) = F (x) for all t = 0. As Xn is a random variable on [a, b] we have cn = EXn = 0, and therefore α n = F (cn 14 ) = F (14 ) for all n ≥ 0. In particular, α = limn→∞ α n is given by
w3−1 w2−1 w4−1 w1−1 , , , . α= (w1−1 + w2−1 )2 (w1−1 + w2−1 )2 (w3−1 + w4−1 )2 (w3−1 + w4−1 )2 Since we are considering the case where all the weights are positive, the vector α is not a scalar multiple of a standard basis vector. Now from (4.16) we compute −3 w3−3 + w4−3 w1 + w2−3 −3 ϕ=λ + , (4.67) (w1−1 + w2−1 )6 (w3−1 + w4−1 )6 where
λ=
w1−2 + w2−2
(w1−1 + w2−1 )4
+
w3−2 + w4−2
(w3−1 + w4−1 )4
1/2 .
As an illustration of the bounds provided by Theorem 4.4, first consider the ‘side equally weighted network’, the one with w = (w, w, 2 − w, 2 − w) for w ∈ [1, 2); we recall the weights w refer to the bonds in the lattice traversed counterclockwise from the top in Fig. 4.1(c). The vector of weights for w in this range are positive and −1 satisfy F (14 ) = 1. For w = √ 1 all weights are equal and α = 4 14 , so ϕ achieves its minimum value 1/2 = 1/ k with k = 4. By Theorem 4.4, for all γ ∈ (1/2, 1) there exists a constant C such that Wn − Z1 ≤ Cγ n . The values of γ just above 1/2 correspond, in view of (4.36), to the rate N −θ for θ just below − log4 1/2 = 1/2, that is, N −1/2+ for small > 0, where N = 4n , the number of variables√at stage n. As w increases from 1 to 2, ϕ increases continuously from 1/2 to 1/ 2, with w approaching 2 from below corresponding to the least √ favorable rate for the side equally weighted network of θ just under − log4 1/ 2 = 1/4, that is, of N −1/4+ for any > 0.
88
4
L1 Bounds
With only the restriction that the weights are positive and satisfy F (14 ) = 1 consider w = (1 + 1/t, s, t, 1/t)
−1 −1 where s = 1 − (1/t + t)−1 − (1 + 1/t)−1 , t > 0. √ s/t → 1/2 and α √ tends When t = 1 we have s = 2/3 and ϕ = 11 2/27. As t → ∞, √ to the standard basis vector (1, 0, 0, 0), so ϕ → 1. Since 11 2/27 ∈ (1/2, 1/ 2 ), the above two examples show that the value of γ given by Theorem 4.4 for the diamond lattice can take any value in the range (1/2, 1), corresponding to N −θ for any θ ∈ (0, 1/2).
4.3 Cone Measure Projections In this section we use Stein’s method to obtain L1 bounds for the normal approximation of one dimensional projections of the form Y = θ · X,
(4.68)
Rn
where for some p > 0, the vector X ∈ has the cone measure distribution Cpn n given in (4.71) below, and θ ∈ R is of unit length. The normal approximation of projections of random vectors in lesser and greater generality has been studied by many authors, and under a variety of metrics. In the case p = 2, when cone measure is uniform on the surface of the unit Euclidean sphere in Rn , Diaconis and Freedman (1987) show that the low dimensional projections of X are close to normal in total variation. It is particularly easy to see in this case, and true in general, that cone measure Cpn is coordinate symmetric, that is, (X1 , . . . , Xn ) =d (e1 X1 , . . . , en Xn ) for all (e1 , . . . , en ) ∈ {−1, 1}n . (4.69) Meckes and Meckes (2007) derive bounds using Stein’s method for the normal approximation of random vectors with symmetries in general, including coordinatesymmetry, considering the supremum and total variation norm. Goldstein and Shao (2009) give√ L∞ bounds on the projections of coordinate symmetric random vectors of order 1/ n without applying Stein’s method. Klartag (2009) proves bounds of order 1/n on the L∞ distance under additional conditions on the distribution of X, including that its density be log concave. One special case of note where X is coordinate symmetric is when its distribution is uniform over a convex set which has symmetry with respect to all coordinate planes. For general results on the projections of vectors sampled uniformly from convex sets, see Klartag (2007) and references therein. Studying here the specific instance of the projections of cone measure allows, naturally, for the sharpening of general results about projections of coordinate symmetric vectors to this particular case. To define cone measure let n n
n p |xi | = 1 and S p = x ∈ R :
n
i=1
B p = x ∈ R : n
n i=1
|xi | ≤ 1 . p
(4.70)
4.3 Cone Measure Projections
89
Then with μn Lebesgue measure in Rn , the cone measure of A ⊂ S(np ) is given by Cpn (A) =
μn ([0, 1]A) μn (B(np ))
where [0, 1]A = {ta: a ∈ A, 0 ≤ t ≤ 1}.
(4.71)
The main result in the this section on the projections of Cpn is the following. Theorem 4.7 Let X have cone measure Cpn on the sphere S(np ) for some p > 0 and let Y=
n
θ i Xi
i=1
be the one-dimensional projection of X along the direction θ ∈ Rn with θ = 1. 2 = Var(X ) and m 3 2 Then with σn,p 1 n,p = E|X1 | /σn,p , given in (4.84) and (4.87), respectively, and F the distribution function of the normalized sum W = Y/σn,p , we have n mn,p 3 1 4 F − 1 ≤ |θi | + ∨1 , (4.72) σn,p p n+2 i=1
where is the cumulative distribution function of the standard normal. We note that by the limits in (4.84) and (4.88), the constant mn,p /σn,p that multiplies the sum in the bound (4.72) is of the order of a constant with asymptotic value √ mn,p (4/p) (1/p) lim = . n→∞ σn,p (3/p)3/2 Since, for θ ∈ Rn with θ = 1, we have 1 |θi |3 ≥ √ , n the second term in (4.72) is always of smaller order than the first, so√the decay rate 3 of the bound to zero is determined by i |θi | . The minimal rate 1/ n is achieved √ when θi = 1/ n. In the special cases p = 1 and p = 2, Cpn is uniform on the simplex ni=1 |xi | = 1 n and the unit Euclidean sphere i=1 xi2 = 1, respectively. By (4.84) and (4.87) for p = 1, 2 = σn,1
2 n(n + 1)
and mn,1 =
and, using also (4.88) for p = 2, 2 σn,2
these relations yield
1 = n
and
mn,2 ≤
3 , n+2
3 ; n+2
90
4
mn,1 n(n + 1) 3 =3 ≤√ 2 σn,1 2(n + 2) 2
and
mn,2 ≤ σn,2
L1 Bounds
√ 3n ≤ 3. n+2
Substituting into (4.72) now gives 3 4 F − 1 ≤ √ |θi |3 + n+2 p + 1 i=1 n
for p ∈ {1, 2}.
(4.73)
4.3.1 Coupling Constructions for Coordinate Symmetric Variables and Their Projections We generalize the construction in Proposition 2.3 to coordinate symmetric vectors, beginning by generalizing the notion of square biasing, given there, to square biasing in coordinates. To begin, note that if Y is a coordinate symmetric random vector in Rn and EYi2 < ∞ for i = 1, . . . , n, then the symmetry condition (4.69) implies EYi = −EYi
and EYi Yj = −EYi Yj
for all i = j ,
and hence EYi = 0
and EYi Yj = σi2 δij
for all i, j ,
(4.74)
where σi2 = Var(Yi ) = EYi2 . By removing any component which has zero variance, and lowering the dimension accordingly, we may assume without loss of generality that σi2 > 0 for all i = 1, . . . , n. For such Y, for all i = 1, . . . , n, we claim there exists a distribution Yi such that for all functions f : Rn → R for which the expectation of the left hand side below exists,
EYi2 f (Y) = σi2 Ef Yi , (4.75) and say that Yi has the Y-square bias distribution in direction i. In particular, the distribution of Yi is absolutely continuous with respect to Y with dF i (y) =
yi2 σi2
dF (y).
(4.76)
By specializing (4.75) to the case where f depends only on Yi , we see, in the language of Proposition 2.3, that Yii =d Yi , that is, that Yii has the Yi -square bias distribution. Proposition 4.3 shows how to construct the zero bias distribution Y ∗ for the sum Y of the components of a coordinate-symmetric vector in terms of Yi and a random index in a way that parallels the construction for size biasing given in Proposition 2.2. Again we let U[a, b] denote the uniform distribution on [a, b].
4.3 Cone Measure Projections
91
Proposition 4.3 Let Y ∈ Rn be a coordinate-symmetric random vector with Var(Yi ) = σi2 ∈ (0, ∞) for all i = 1, 2, . . . , n, and Y=
n
Yi .
i=1
Let Yi , i = 1, . . . , n, have the square bias distribution given in (4.75), I a random index with distribution σ2 P (I = i) = n i
(4.77)
2 j =1 σj
and Ui ∼ U [−1, 1], with Yi , I and Ui mutually independent for all i = 1, . . . , n. Then Y ∗ = UI YII + YjI (4.78) j =I
has the Y -zero bias distribution. Proof Let f be an absolutely continuous function with E|Yf (Y )| < ∞. Staring with the given form of Y ∗ then averaging over the index I , integrating out the uniform variable Ui and applying (4.75) and (4.69) we obtain 2 ∗ 2 I σ Ef (Y ) = σ Ef UI YI + YjI =σ
2
n σ2 i=1
= = =
n i=1 n i=1 n
i Ef σ2
σi2 E EYi
f (Yii
j =I
+
Ui Yii +
f (Yi +
j =i
j =i
j =i
Yji
Yji ) − f (−Yii + 2Yii
Yj ) − f (−Yi +
j =i
j =i
Yji )
Yj )
2
EYi f Yi + Yj
i=1
j =i
= EYf (Y ). Thus,
Y∗
has the Y -zero bias distribution.
Factoring (4.76) as dF i (y) = dFii (yi )dF (y1 , . . . , yi−1 , yi+1 , . . . , yn |yi ) where dFii (yi ) =
yi2 dFi (yi ) σi2
(4.79)
92
4
L1 Bounds
provides an alternate way of seeing that Yii =d Yi . Moreover, it suggests a coupling between Y and Y ∗ where, given Y, an index I = i is chosen with weight proportional to the variance σi2 , the summand Yi is replaced by Yii having that summand’s ‘square bias’ distribution and then multiplied by U , and, finally, the remaining variables of Y are perturbed, so that they achieve their original distribution conditional on the ith variable now taking on the value Yii . Typically the remaining variables are changed as little as possible in order to make the coupling between Y and Y ∗ close. Now let X ∈ Rn be an exchangeable coordinate-symmetric random vector with components having finite second moments and let θ ∈ Rn have unit length. Then, by (4.74), the projection Y of X along the direction θ , Y=
n
θ i Xi
i=1
has mean zero and variance σ 2 equal to the common variance of the components of X. To form Y ∗ using the construction just outlined, in view of (4.79) in particular, requires a vector of random variables to be ‘adjusted’ according to their original distribution, conditional on one coordinate taking on a newly chosen, biased, value. Random vectors which have the ‘scaling-conditional’ property in Definition 4.2 can easily be so adjusted. Let L(V ) and L(V |X = x) denote the distribution of V , and the conditional distribution of V given X = x, respectively. Definition 4.2 Let X = (X1 , . . . , Xn ) be an exchangeable random vector and D ⊂ R the support of the distribution of X1 . If there exists a function g : D → R such that P (g(X1 ) = 0) = 0 and g(a) L(X2 , . . . , Xn |X1 = a) = L for all a ∈ D, (4.80) (X2 , . . . , Xn ) g(X1 ) we say that X is scaling g-conditional, or simply scaling-conditional. Proposition 4.4 is an application of Theorem 4.1 and Proposition 4.3 to projections of coordinate symmetric, scaling-conditional vectors. Proposition 4.4 Let X ∈ Rn be an exchangeable, coordinate symmetric and scaling g-conditional random vector with finite second moment. For θ ∈ Rn of unit length set n θi Xi , σ 2 = Var(Y ), and F (x) = P (Y/σ ≤ x). Y= i=1
Then any construction of (X, Xii ) on a joint space for each i = 1, . . . , n with Xii having the Xi -square biased distribution provides the upper bound
g(XII ) 2 I F − 1 ≤ E θI UI XI − XI + θj Xj , −1 (4.81) σ g(XI ) j =I
and Ui ∼ U [−1, 1] with where P (I = i) = independent for i = 1, 2, . . . , n. θi2
{Xii , Xj , j
= i}, I and Ui mutually
4.3 Cone Measure Projections
93
Proof For all i = 1, . . . , n, since X is scaling g-conditional, given X and Xii with the Xi -square bias distribution, by (4.79) and (4.80) the vector i g(Xii ) g(Xii ) g(Xii ) i i g(Xi ) X = X1 , . . . , Xi−1 , Xi , Xi+1 , . . . , Xn g(Xi ) g(Xi ) g(Xi ) g(Xi ) has the X-square bias distribution in direction i as given in (4.75), that is, for every h for which the expectation on the left-hand side below exists,
(4.82) EXi2 h(X) = EXi2 Eh Xi . We now apply Proposition 4.3 to Y = (θ1 X1 , . . . , θn Xn ). First, the coordinate symmetry of Y follows from that of X. Next, we claim
Yi = θ1 X1i , . . . , θn Xni has the Y-square bias distribution in direction i. Given f , let h(X) = f (θ1 X1 , . . . , θn Xn ). Applying (4.82) we obtain EYi2 f (Y) = Eθi2 Xi2 f (Y) = θi2 EXi2 h(X)
= θi2 EXi2 Eh Xi
= Eθi2 Xi2 Ef Yi
= EYi2 Ef Yi . Since X is exchangeable, the variance of Yi is proportional to θi2 and the distribution of I in (4.77) specializes to the one claimed. Lastly, as Yi , I and Ui are mutually independent for i = 1, . . . , n, Proposition 4.3 yields that Y ∗ = UI YII + YjI j =I
has the Y -zero bias distribution. The difference Y ∗ − Y is given by Y ∗ − Y = UI YII +
YjI −
j =I
= UI θI XII +
n
Yi
i=1
θj XjI −
j =I
n
θ j Xj
j =1
I
= θI UI XII − XI + θ j X j − Xj j =I
g(XII ) − 1 Xj g(XI ) j =I I
g(XI ) = θI UI XII − XI + θ j Xj . −1 g(XI )
= θI UI XII − XI +
θj
j =I
94
4
L1 Bounds
The proof is completed by dividing both sides by σ , applying (2.59) to yield Y ∗ /σ = (Y/σ )∗ , and invoking Theorem 4.1.
4.3.2 Construction and Bounds for Cone Measure Proposition 4.5 below shows that Proposition 4.4 can be applied to cone measure. We denote the Gamma and Beta distributions with parameters α, β as (α, β) and B(α, β), respectively. That is, with the Gamma function at α > 0 given by ∞ (α) = x α−1 e−x dx, 0
with β > 0, the density of the (α, β) distribution is x α−1 e−x/β 1{x>0} ; β α (α) the density of the Beta distribution B(α, β) is given in (4.90). Proposition 4.5 Let Cpn denote cone measure as given in (4.71) for some n ∈ N and p > 0. 1. Cone measure Cpn is exchangeable and coordinate-symmetric. For {Gj , j , j = 1, . . . , n} independent variables with Gj ∼ (1/p, 1) and j taking values −1 and +1 with equal probability, setting Ga,b = bi=a Gi we have G1 1/p Gn 1/p , . . . , n (4.83) ∼ Cpn . X = 1 G1,n G1,n 2. The common marginal distribution Xi of cone measure is characterized by
Xi =d −Xi and |Xi |p ∼ B 1/p, (n − 1)/p , 2 = Var(X ) is given by and the variance σn,p i 2 = σn,p
(3/p)(n/p) (1/p)((n + 2)/p)
(4.84)
and satisfies 2 = lim n2/p σn,p
n→∞
p 2/p (3/p) . (1/p)
3. The square bias distribution Xii of Xi is characterized by p
Xii =d −Xii and Xii ∼ B 3/p, (n − 1)/p .
(4.85)
Letting {Gj , G j , j , j = 1, . . . , n} be independent variables with Gj ∼ (1/p, 1), G j ∼ (2/p, 1) and j taking values −1 and +1 with equal probability, for each i = 1, . . . , n, a construction of (X, Xii ) on a joint space is given by the representation of X in (4.83) along with
4.3 Cone Measure Projections
95
Xii
= i
Gi + G i G1,n + G i
1/p (4.86)
.
2 for all i = 1, . . . , n is given by The mean mn,p = E|Xii | = E|Xi3 |/σn,p
mn,p =
(4/p)((n + 2)/p) (3/p)((n + 3)/p)
(4.87)
and satisfies lim n
n→∞
1/p
p 1/p (4/p) mn,p = (3/p)
and mn,p ≤
3 n+2
1/(p∨1) . (4.88)
4. Cone measure Cpn is scaling (1 − |x|p )1/p conditional. The proof of Proposition 4.5 is deferred to the end of this section. Before proceeding to Theorem 4.7, we remind the reader of the following known facts about the Gamma and Beta distributions; see Bickel and Doksum (1977), Theorem 1.2.3 for the case n = 2 of the first claim, the extension to general n and the following claim being straightforward. For γi ∼ (αi , β), i = 1, . . . , n, independent with αi > 0 and β > 0, γ1 γ1 + γ2 ∼ (α1 + α2 , β), ∼ B(α1 , α2 ), (4.89) γ 1 + γ2 n γn γ1 n , . . . , n γi are independent; and and i=1 γi i=1 γi i=1
the Beta distribution B(α, β) has density (α + β) α−1 u (1 − u)β−1 1u∈[0,1] (α)(β) (α + κ)(α + β) and κ > 0 moment . (α + β + κ)(α)
pα,β (u) =
(4.90)
Proof of Theorem 4.7 Using Proposition 4.5, we apply Proposition 4.4 for X with g(x) = (1 − |x|p )1/p and the joint construction of (X, Xii ) given in item 3. Note that Proposition 4.2 applies, using the notation there, with V ∼ U[0, 1], independent of all other variables, Ui = i V , and Gi + G i 1/p Gi 1/p Xi = and Y i = . G1,n G1,n + G i Applying the triangle inequality on (4.81) yields the bound on the L1 norm F − 1 of g(XII )
2 I θj Xj . −1 E θI UI XI − XI + E (4.91) σn,p g(XI ) j =I
We begin by averaging the first term over I . Note that
96
4
|X1 | =
G1 G1,n
1/p ≤
G1 + G 1 G1,n + G 1
1/p
L1 Bounds
= X11 ,
and therefore, recalling P (I = i) = θi2 , we may invoke Proposition 4.2 to conclude n
E θI UI XII − XI = |θi |3 E Ui Xii − Xi i=1 n = E U1 X11 − X1 |θi |3 i=1
≤
|3
E|X1 2 2σn,p
n
|θi |3 =
i=1
n mn,p 3 |θi | . 2
(4.92)
i=1
Now, averaging the second term in (4.91) over the distribution of I yields n g(Xii ) g(XII ) θ j Xj = E θj Xj θi2 . (4.93) −1 −1 E g(XI ) g(Xi ) j =I
j =i
i=1
g(x) = (1 − |x|p )1/p ,
we have 1/p g(Xii ) G1,n − 1. −1= g(Xi ) G1,n + G i
Using (4.83), (4.86) and
(4.94)
Applying (4.89) we have that {G1,n , G i } are independent of X1 , . . . , Xn ; hence, the term (4.94) is independent of the sum it multiplies in (4.93) and therefore (4.93) equals n g(Xii ) E θj Xj θi2 . (4.95) − 1E g(Xi ) i=1
j =i
To bound the first expectation in (4.95), since G1,n /(G1,n + G i ) ∼ B(n/p, 2/p), we have 1/p g(Xii ) 2 1 G1,n E − 1 = E 1 − ∨1 (4.96) ≤ g(Xi ) G1,n + Gi p n+2 since for p ≥ 1, using (4.90) with κ = 1, 1/p G1,n E 1− G1,n + G i 2 n/p G1,n = , =1− ≤E 1− G1,n + Gi (n + 2)/p n + 2 while for 0 < p < 1, using Jensen’s inequality and the fact that (1 − x)1/p ≥ 1 − x/p we have
for x ≤ 1,
4.3 Cone Measure Projections
97
E 1−
1/p G1,n G1,n + G i 1/p 1/p G1,n n 2 ≤1− E = 1 − ≤ . G1,n + G i n+2 p(n + 2)
We may bound the second expectation in (4.95) by σn,p since 2 θj Xj E j =i
≤E
2 θ j Xj
2 2 = Var θj Xj = σn,p θj2 ≤ σn,p .
j =i
j =i
j =i
Neither this bound nor the bound (4.96) depends on i, so substituting them into (4.95) and summing over i, again using i θi2 = 1, yields n g(Xii ) 2 1 E θj Xj θi2 ≤ σn,p − 1E ∨1 . (4.97) g(Xi ) p n+2 j =i
i=1
Adding (4.92) and (4.97) and multiplying by 2/σn,p in accordance with (4.81) yields (4.72). Proof of Proposition 4.5 1. For A ⊂ S(np ), e = (e1 , . . . , en ) ∈ {−1, 1}n and a permutation π ∈ Sn , let Ae = x: (e1 x1 , . . . , en xn ) ∈ A and Aπ = x: (xπ(1) , . . . , xπ(n) ) ∈ A . By the properties of Lebesgue measure, μn ([0, 1]Ae ) = μn ([0, 1]Aπ ) = μn ([0, 1]A), so by (4.71), cone measure is coordinate symmetric and exchangeable. The coordinate symmetry of X implies that P (X ∈ A) = P (X ∈ Ae )
for all e ∈ {−1, 1}n ,
so with i , i = 1, . . . , n, i.i.d. variables taking the values 1 and −1 with probability 1/2 and independent of X,
P (1 X1 , . . . , n Xn ) ∈ A = P (X ∈ A ) 1 = n P (X ∈ Ae ) 2 n e∈{−1,1}
= P (X ∈ A), and hence (1 X1 , . . . , n Xn ) =d (X1 , . . . , Xn ). Note that for any (s1 , . . . , sn ) ∈ {−1, 1}n that (1 s1 , . . . , n sn ) =d (1 , . . . , n ),
and is independent of X.
98
4
L1 Bounds
Hence, since P (Xi = 0) = 0, with si = Xi /|Xi |, the sign of Xi , we have
P 1 |X1 |, . . . , n |Xn | ∈ A = P (1 s1 X1 , . . . , n sn Xn ) ∈ A
= P (1 X1 , . . . , n Xn ) ∈ A
= P (X1 , . . . , Xn ) ∈ A . We thus obtain (4.83) applying that X ∼ Cpn satisfies
Gn 1/p G1 1/p ,..., |X1 |, . . . , |Xn | =d G1,n G1,n
(4.98)
shown, for instance, by Schechtman and Zinn (1990). 2. Applying the coordinate symmetry of X coordinatewise gives Xi =d −Xi and (4.98) yields |Xi |p = Gi /G1,n , which has the claimed Beta distribution, by (4.89). As EXi = 0, we have
2/p Var(Xi ) = EXi2 = E |Xi |p (4.99) and the variance claim in (4.84) follows from (4.90) for α = 1/p, β = (n − 1)/p and κ = 2/p. From Stirlings formula, for all x > 0, mx (m) = 1, m→∞ (m + x) lim
so letting m = n/p and x = k/p, nk/p (n/p) = p k/p . n→∞ ((n + k)/p) lim
(4.100)
The limit (4.84) now follows. 3. If X is symmetric with variance σ 2 > 0 and X has the X-square bias distribution, then for all bounded continuous functions f
σ 2 Ef X
= EX 2 f (X) = E (−X)2 f (−X) = EX 2 f (−X) = σ 2 Ef −X , showing X is symmetric. From (4.90) and a change of variable, a random variable X satisfies |X|p ∼ B(α/p, β/p) if and only if the density p|X| (u) of |X| is p|X| (u) =
β/p−1 p((α + β)/p) α−1 1 − up 1u∈[0,1] . u (α/p)(β/p)
(4.101)
Hence, since |Xi |p ∼ B(1/p, (n − 1)/p) by item 2, the density p|Xi | (u) of |Xi | is
(n−1)/p−1 p(n/p) 1u∈[0,1] . 1 − up p|Xi | (u) = (1/p)((n − 1)/p)
4.3 Cone Measure Projections
99
Multiplying by u2 and renormalizing produces the |Xii | density u2 p|Xi | (u) i EXi2
(n−1)/p−1 p((n + 2)/p) = 1u∈[0,1] , (4.102) u2 1 − up (3/p)((n − 1)/p) and comparing (4.102) to (4.101) shows the second claim in (4.85). The representation (4.86) now follows from (4.89) and the symmetry of Xii . The moment formula (4.87) for mn,p follows from (4.90) for α = 3/p, β = (n − 1)/p and κ = 1/p, and the limit in (4.88) follows from (4.100). Regarding the last claim in (4.88), for p ≥ 1 Hölder’s inequality gives 1/p 1 1 p 1/p 3 mn,p = E X ≤ E X = , n+2 while for 0 < p < 1, we have 1 Gi + G i 1/p Gi + G i 3 ≤E . = mn,p = E X = E G1,n + Gi G1,n + Gi n+2 p|Xi | (u) =
4. We consider the conditional distribution on the left-hand side of (4.80), and use the representation, and notation Ga,b , given in (4.83). The second equality below follows from the coordinate-symmetry of X, and the fourth follows since we may replace G1,n by G2,n /(1 − |a|p ) on the conditioning event. Using the notation aL(V ) for the distribution of aV , we have L(X2 , . . . , Xn |X1 = a) G2 1/p Gn 1/p G1 1/p , . . . , n =a = L 2 1 G1,n G1,n G1,n 1/p 1/p 1/p G2 Gn G1 = L 2 , . . . , n = |a| G1,n G1,n G1,n 1/p 1/p G2 Gn G2,n p = L 2 , . . . , n = 1 − |a| G1,n G1,n G1,n 1/p
G2 Gn 1/p G2,n p 1/p p = 1 − |a| L 2 , . . . , n | = 1 − |a| G2,n G2,n G1,n 1/p 1/p
G2 Gn G1 p 1/p p = 1 − |a| L 2 , . . . , n = |a| G2,n G2,n G1,n 1/p 1/p
1/p G2 Gn = 1 − |a|p L 2 , . . . , n G2,n G2,n G2 1/p Gn 1/p = g(a)L 2 , . . . , n . (4.103) G2,n G2,n In the penultimate step may we remove the conditioning on G1 /G1,n since (4.89) and the independence of G1 from all other variables gives that Gn G2 ,..., is independent of (G1 , G2,n ) G2,n G2,n
100
4
L1 Bounds
and therefore independent of G1 /(G1 + G2,n ) = G1 /G1,n . Regarding the right-hand side of (4.80), using 1 − |X1 |p = ni=2 |Xi |p and the representation (4.83), we obtain (X2 , . . . , Xn ) g(a)(X2 , . . . , Xn )/g(X1 ) = g(a) (|X2 |p + · · · + |Xn |p )1/p ( ( G2 )1/p , . . . , ( Gn )1/p ) 2 G1,n n G1,n = g(a) G2 n (( G1,n ) + · · · + ( GG1,n ))1/p 1/p 1/p (2 G2 , . . . , n Gn ) = g(a) (G2 + · · · + Gn )1/p G2 1/p Gn 1/p = g(a) 2 , . . . , n G2,n G2,n matching the distribution (4.103). In principle, Proposition 4.3 and Theorem 4.1 may be applied to compute bounds to the normal for projections of other coordinate-symmetric vectors when the required couplings, and conditioning, are as tractable as here.
4.4 Combinatorial Central Limit Theorems In this section we apply Theorem 4.1 to derive L1 bounds in the combinatorial central limit theorem, that is, for random variables Y of the form Y=
n
ai,π(i) ,
(4.104)
i=1
where π is a permutation distributed uniformly over the symmetric group Sn , and {aij }1≤i,j ≤n are the components of a matrix A ∈ Rn×n . Random variables of this form are of interest in permutation tests. In particular, given a function d(x, y) which in some sense measures the closeness of two observations x and y, given values x1 , . . . , xn and y1 , . . . , yn and a putative ‘matching’ permutation τ that associates xi to yτ (i) , one can test whether the level of matching given by τ , as measured by yτ =
n
aiτ (i)
where aij = d(xi , yj ),
i=1
is unusually high by seeing how large the matching level yτ is relative to that provided by a random matching, that is, by seeing whether P (Y ≥ yτ ) is significantly small. Motivated by these considerations, Wald and Wolfowitz (1944) proved the central limit theorem as n → ∞ when the factorization aij = bi cj holds; Hoeffding (1951) later generalized this result to arrays {aij }1≤i,j ≤n . Motoo (1957) gave
4.4 Combinatorial Central Limit Theorems
101
Lindeberg-type sufficient conditions for the normal limit to hold. In Sect. 6.1 the L∞ distance to the normal is considered for the case where π is uniformly distributed, and also when its distribution is constant on conjugacy classes of Sn . Letting a =
n 1 aij , n2
1 aij n n
ai =
i,j =1
1 aij , n n
and
aj =
j =1
i=1
straightforward calculations show that when π is uniform over Sn the mean μA and variance σA2 of Y are given by μA = na and
1 2 2 σA2 = aij − ai2 − aj + a2 n−1 i,j
(4.105)
1 = (aij − ai − aj + a )2 . n−1 i,j
For simplicity, writing μ and σ 2 for μA and σA2 , respectively, we prove in (4.124) the following equivalent representation for σ 2 , 2 1 σ2 = 2 (4.106) (aik + aj l ) − (ail + aj k ) , 4n (n − 1) i,j,k,l
and assume in what follows that σ 2 > 0 to rule out trivial cases. By (4.106), σ 2 = 0 if and only if ail − ai does not depend on i, that is, if and only if the difference between any two rows ai and aj of A satisfy ai − aj = (ai − aj )(1, . . . , 1). For each n ≥ 3, Theorem 4.8 provides an L1 bound between the standardized version of the variable Y given in (4.104) and the normal, with an explicit constant depending on the third-moment-type quantity γ = γA ,
where γA =
n
|aij − ai − aj + a |3 .
(4.107)
i,j =1
When the elements of A are all of comparable order, σ 2 is of order n and γ of order n2 , resulting in a bound of order n−1/2 . Theorem 4.8 For n ≥ 3, let {aij }ni,j =1 be the components of a matrix A ∈ Rn×n , let π be a random permutation uniformly distributed over Sn , and let Y be given by (4.104). Then, with μ, σ 2 given in (4.105), and γ given in (4.107), F the distribution function of W = (Y − μ)/σ and that of the standard normal, γ 8 56 F − 1 ≤ + 16 + . (n − 1) (n − 1)2 (n − 1)σ 3 The proof of this theorem depends on a construction of the zero bias variable using an exchangeable pair, which we now describe.
102
4
L1 Bounds
4.4.1 Use of the Exchangeable Pair We recall that the exchangeable variables Y , Y form a λ-Stein pair if E(Y |Y ) = (1 − λ)Y
(4.108)
for some 0 < λ < 1. When Var(Y ) = σ 2 ∈ (0, ∞), Lemma 2.7 yields EY = 0 and E(Y − Y )2 = 2λσ 2 .
(4.109)
The following proposition is in some sense a two variable version of Proposition 2.3. Proposition 4.6 Let Y , Y be a λ-Stein pair with Var(Y ) = σ 2 ∈ (0, ∞) and distribution F (y , y ). Then when Y † , Y ‡ have distribution dF † (y , y ) =
(y − y )2 dF (y , y ), 2λσ 2
(4.110)
and U ∼ U[0, 1] is independent of Y † , Y ‡ , the variable Y ∗ = U Y † + (1 − U )Y ‡
has the Y -zero biased distribution.
(4.111)
Proof For all absolutely continuous functions f for which the expectations below exist,
σ 2 Ef (Y ∗ ) = σ 2 Ef U Y † + (1 − U )Y ‡ f (Y † ) − f (Y ‡ ) 2 =σ E Y† − Y‡ 1 f (Y ) − f (Y ) 2 = E (Y − Y ) 2λ Y − Y
1 = E f (Y ) − f (Y )(Y − Y ) 2λ
1 = E Y f (Y ) − Y f (Y ) λ
1 = E Y f (Y ) − (1 − λ)Y f (Y ) λ = EY f (Y ). The following lemma, leading toward the construction of zero bias variables, is motivated by generalizing the framework of Example 2.3, where the Stein pair is a function of some underlying random variables ξα , α ∈ χ and a random index I. Lemma 4.4 Let F (y , y ) be the distribution of a Stein pair and suppose there exists a distribution F (i, ξα , α ∈ χ)
(4.112)
4.4 Combinatorial Central Limit Theorems
103
and an R2 valued function (y , y ) = ψ(i, ξα , α ∈ χ) such that when I and {α , α ∈ X } have distribution (4.112) then (Y , Y ) = ψ(I, α , α ∈ X ) has distribution F (y , y ). If I† , {†α , α ∈ χ} have distribution dF † (i, ξα , α ∈ X ) =
(y − y )2 dF (i, ξα , α ∈ X ) E(Y − Y )2
(4.113)
then the pair
(Y † , Y ‡ ) = ψ I† , †α , α ∈ X has distribution F † (y † , y ‡ ) satisfying dF † (y , y ) =
(y − y )2 dF (y , y ). 2λσ 2
Proof For any bounded measurable function f
Ef (Y † , Y ‡ ) = Ef ψ I† , †α , α ∈ X
= f ψ(i, ξα , α ∈ χ) dF † (i, ξα , α ∈ χ) (y − y )2 = f (y , y ) dF (i, ξα , α ∈ χ) 2λσ 2 (Y − Y )2 =E f (Y , Y ) , 2λσ 2 where (Y , Y ) has distribution F (y , y ).
We continue building a general framework around Example 2.3, where the random index is chosen independently of the permutation, so their joint distribution factors, leading to dF (i, ξα , α ∈ χ) = P (I = i)dF (ξα , α ∈ χ).
(4.114)
Moreover, in view of (2.47), that is, that
Y − Y = b i, j, π(i), π(j ) where b(i, j, k, l) = ail + aj k − (aik + aj l ), we will pay special attention to situations where Y − Y = b(I, α , α ∈ χI )
(4.115)
where I and χI are vectors of small dimensions with components in I and χ , respectively. In other words, we consider situations where the difference between Y and Y depends on only a few variables. In such cases, it will be convenient to further decompose dF (i, ξα , α ∈ χ) as dF (i, ξα , α ∈ χ) = P (I = i)dFi (ξα , α ∈ χi )dFic i (ξα , α ∈ / χi |ξα , α ∈ χi ), (4.116)
104
4
L1 Bounds
where dFi (ξα , α ∈ χi ) is the marginal distribution of ξα for α ∈ χi , and dFic |i (ξα , / χi given ξα for α ∈ χi . α∈ / χi |ξα , α ∈ χi ) the conditional distribution of ξα for α ∈ One notes, however, that the factorization (4.114) guarantees that the marginal distributions of any ξα does not depend on i. In terms of generating variables having the specified distributions for the purposes of coupling, the decomposition (4.116) corresponds to first generating I, then {ξα , α ∈ χI }, and lastly {ξα , α ∈ / χI } conditional on {ξα , α ∈ χI }. In what follows we will continue the slight abuse notation of letting {α: α ∈ χi } denote the set of components of the vector χi . We now consider the square bias distribution F † in (4.113) when the factorization (4.116) of F holds. Letting I and {α : α ∈ χ} have distribution (4.114), by (4.109), (4.115) and independence we obtain P (I = i)Eb2 (i, α , α ∈ χi ). 2λσ 2 = E(Y − Y )2 = Eb2 (I, α , α ∈ χI ) = i⊂I
In particular, we may define a distribution for a vector of indices I† with components in I by P (I† = i) =
ri 2λσ 2
with ri = P (I = i)Eb2 (i, α , α ∈ χi ).
(4.117)
Hence, substituting (4.115) and (4.116) into (4.113), dF † (i, ξα , α ∈ χ) =
P (I = i)b2 (i, ξα , α ∈ χi ) dFi (ξα , α ∈ χi )dFic |i (ξα , α ∈ / χi |ξα , α ∈ χi ) 2λσ 2
=
b2 (i, ξα , α ∈ χi ) ri dFi (ξα , α ∈ χi )dFic |i (ξα , α ∈ / χi |ξα , α ∈ χi ) 2λσ 2 Eb2 (i, α , α ∈ χi )
= P (I† = i)dFi† (ξα , α ∈ χi )dFic |i (ξα , α ∈ / χi |ξα , α ∈ χi ),
(4.118)
where dFi† (ξα , α ∈ χi ) =
b2 (i, ξα , α ∈ χi ) dFi (ξα , α ∈ χi ). Eb2 (i, α , α ∈ χi )
(4.119)
Definition (4.119) represents dF † (i, ξα , α ∈ χ) in a manner parallel to (4.116) for dF (i, ξα , α ∈ χ). This representation gives the parallel construction of variables I† , {†α , α ∈ χ} with distribution dF † (i, ξα , α ∈ χ) as follows. First generate I† according to the distribution P (I† = i). Then, when I† = i, generate {†α , α ∈ χi } according to dFi† (ξα , α ∈ χi ) and then {†α , α ∈ / χi } according to dFic |i (ξα , α ∈ / χi |ξα , α ∈ χi ). As this last factor is the same as the last factor in (4.116) an opportunity for coupling is presented. In particular, it may be possible to set †α equal to α for many α ∈ / χi , thus making the pair Y † , Y ‡ close to Y , Y .
4.4 Combinatorial Central Limit Theorems
105
4.4.2 Construction and Bounds for the Combinatorial Central Limit Theorem In this section we prove Theorem 4.8 by specializing the construction given in Sect. 4.4.1 to handle the combinatorial central limit theorem, and then applying Theorem 4.1. Recall that by (2.45) we may, without loss of generality, replace aij by aij − ai − aj + a , and assume ai = aj = a = 0,
(4.120)
noting that by doing so we may now write
W = Y/σ,
(4.121)
and that (4.107) becomes γ = ij |aij Now, denoting Y and π by Y and π , respectively, when convenient, the construction given in Example 2.3 applies. That is, given π , uniform over Sn , take (I, J ) independent of π with a uniform distribution over all distinct pairs in {1, . . . , n}, in other words, with distribution |3 .
p1 (i, j ) =
1 1(i = j ). (n)2
(4.122)
Letting τij be the permutation which transposes i and j , set π = πτI,J and let Y be given by (4.104) with π replacing π . Example 2.3 shows that (Y, Y ) is a 2/(n − 1)-Stein pair, and (2.48) gives Y − Y = (aI,π(I ) + aJ,π(J ) ) − (aI,π(J ) + aJ,π(I ) ).
(4.123)
In particular, averaging over I, J, π(I ) and π(J ) we now obtain (4.106) as follows, using (4.109) for the second equality, 2 1 (aik + aj l ) − (ail + aj k ) = E(Y − Y )2 n2 (n − 1)2 i,j,k,l
= 2λσ 2 4σ 2 = . (4.124) n−1 We first demonstrate an intermediate result before presenting a coupling construction of Y , Y to Y † , Y ‡ , leading to a coupling of Y and Y ∗ . Lemma 4.5 Let π be chosen uniformly from Sn and suppose i = j and k = l are elements of {1, . . . , n}. Then ⎧ if l = π(i), k = π(j ), ⎪ ⎨ πτπ −1 (k),j † if l = π(i), k = π(j ), (4.125) π = πτπ −1 (l),i ⎪ ⎩ πτ −1 π (k),i τπ −1 (l),j otherwise, is a permutation that satisfies
106
4
L1 Bounds
π † (m) = π(m) for all m ∈ / i, j, π −1 (k), π −1 (l) , † π (i), π † (j ) = {k, l},
(4.126) (4.127)
and
/ {i, j } = P π † (m) = ξm† , m ∈
1 (n − 2)!
(4.128)
for all distinct ξm† , m ∈ / {i, j } with ξm† ∈ / {k, l}. Proof That π † satisfies (4.126) is clear from its definition. To show (4.127) and that π † is a permutation, let A1 , A2 and A3 denote the three cases of (4.125) in their respective order. Clearly under A1 we have π † (t) = π(t) for all t ∈ / j, π −1 (k) . Hence, as i = j and i = π −1 (l) = π −1 (k), we have π † (i) = π(i) = l. Also,
π † (j ) = πτπ −1 (k),j (j ) = π π −1 (k) = k, showing (4.127) holds on A1 . As π † (π −1 (k)) = π(j ), both π and π † map the set {j, π −1 (k)} to {π(j ), k}, and, as their images agree on {j, π −1 (k)}c , we conclude that π † is a permutation on A1 . As A2 becomes A1 upon interchanging i with j and k with l, these conclusions hold also on A2 . Under A3 , either l = π(i), k = π(j ) or l = π(i), k = π(j ). In the first instance π † = π , so π † is a permutation, and (4.127) is immediate. Otherwise, as i = j and i = π −1 (l), we have
π † (i) = πτπ −1 (k),i τπ −1 (l),j (i) = πτπ −1 (k),i (i) = π π −1 (k) = k and similarly, as j = i and j = π −1 (k),
π † (j ) = πτπ −1 (k),i τπ −1 (l),j (j ) = πτπ −1 (k),i π −1 (l) ,
(4.129)
and now, as l = k and l = π(i),
πτπ −1 (k),i π −1 (l) = π π −1 (l) = l,
so (4.127) holds under A3 . As both π and π † map {i, j, π −1 (k), π −1 (l)} to {π(i), π(j ), k, l}, and agree on {i, j, π −1 (k), π −1 (l)}c , we conclude that π † is a permutation on A3 . / {i, j } be distinct and satisfy We now turn our attention to (4.128). Let ξm† , m ∈ † / {k, l}. Under A1 we have k = π(j ), and have shown that i = π −1 (k). Hence ξm ∈ π −1 (k) ∈ / {i, j } and therefore ξπ† −1 (k) ∈ / {k, l}. Setting ξi† = l, we have
/ {i, j }, A1 P π † (m) = ξm† , m ∈
/ {i, j }, π(i) = l, π(j ) = k = P π † (m) = ξm† , m ∈
/ {j }, π(j ) = k = P π † (m) = ξm† , m ∈
/ j, π −1 (k) , π(j ) = k, π † π −1 (k) = ξπ† −1 (k) = P π † (m) = ξm† , m ∈
4.4 Combinatorial Central Limit Theorems
107
= P π(m) = ξm† , m ∈ / j, π −1 (k) , π(j ) = k, π(j ) = ξπ† −1 (k)
/ j, π −1 (k) , π(j ) = ξπ† −1 (k) = P π(m) = ξm† , m ∈
= P π(m) = ξm† , m ∈ / {j, q}, π(j ) = ξq† , π(q) = k q ∈{i,j / }
(n − 2) . n! Case A2 being the same upon interchanging i with j and k with l, we obtain =
2(n − 2) / {i, j }, A1 ∪ A2 = P π † (m) = ξm† , m ∈ . n! Under A3 there are subcases depending on R = π(i), π(j ) ∩ {k, l},
(4.130)
and we let A3,r = A3 ∩ {R = r} for r = 0, 1, 2. When R = 0 the elements π(i), π(j ), k, l are distinct, and so A3,0 = {R = 0}. Additionally R = 0 if and only if the inverse images i, j, π −1 (k), π −1 (l) under π are also distinct, and so
P π † (m) = ξm† , m ∈ / {i, j }, A3,0 = P π † (m) = ξm† , m ∈ / i, j, π −1 (k), π −1 (l) ,
π † π −1 (k) = ξπ† −1 (k) , π † π −1 (l) = ξπ† −1 (l) , A3,0 = P π(m) = ξm† , m ∈ / i, j, π −1 (k), π −1 (l) ,
π(i) = ξπ† −1 (k) , π(j ) = ξπ† −1 (l) , A3,0 = P π(m) = ξm† , k ∈ / {i, j, q, r}, {q,r}: |{q,r,i,j }|=4
π(i) = ξq† , π(j ) = ξr† , π(q) = k, π(r) = l (n − 2)(n − 3) . n! Considering the case R = 1, in view of (4.125) we find =
(4.131)
A3,1 = A3 ∩ {R = 1} = A3,1a ∪ A3,1b , where
A3,1a = π(i) = k, π(j ) = l ,
and A3,1b = π(i) = k, π(j ) = l .
Since by appropriate relabeling each of these cases becomes A1 , we have
2(n − 2) P π † (m) = ξm† , m ∈ / {i, j }, A3,1 = . (4.132) n! For R = 2 we have A3,2 = A3,2a ∪ A3,2b where and A3,2b = π(j ) = l, π(i) = k . A3,2a = π(i) = l, π(j ) = k
108
4
L1 Bounds
Under A3,2a ,
P π † (m) = ξm† , m ∈ / {i, j }, A3,2a
1 / {i, j }, π(i) = l, π(j ) = k = , = P π † (m) = ξm† , m ∈ n! and the same holding for A3,2b , by symmetry, yields
2 P π † (m) = ξm† , m ∈ / {i, j }, A3,2 = . (4.133) n! Summing the contributions from (4.130), (4.131), (4.132) and (4.133) we obtain
4(n − 2) (n − 2)(n − 3) 2 1 P π † (m) = ξm† , k ∈ / {i, j } = + + = n! n! n! (n − 2)!
as claimed.
The following lemma shows how to choose the ‘special’ indices in Lemma 4.5 to form the square bias, and hence, zero bias, distributions. In addition, as values of the π † permutation can be made to coincide with those of a given π using (4.125), a coupling of these variables on the same space is achieved. Before stating the lemma we note that (4.134) is a distribution by virtue of (4.106). Lemma 4.6 Let Y=
n
ai,π(i)
i=1
with π chosen uniformly from Sn , and let (I † , J † , K † , L† ) be independent of π with distribution p2 (i, j, k, l) =
[(aik + aj l ) − (ail + aj k )]2 . 4n2 (n − 1)σ 2
(4.134)
Further, let π † be constructed from π as in (4.125) with I † , J † , K † and L† replacing i, j , k and l, respectively and π ‡ = π † τI † ,J † . Then π(i) = π † (i) = π ‡ (i)
for all i ∈ /I
(4.135)
where I = {I † , J † , π −1 (K † ), π −1 (L† )}, the variables Y† =
n
ai,π † (i)
i=1
and
Y‡ =
n
ai,π ‡ (i)
(4.136)
i=1
have the square bias distribution (4.113), and with U an uniform variable on [0, 1], independent of all other variables Y ∗ = U Y † + (1 − U )Y † has the Y -zero bias distribution.
4.4 Combinatorial Central Limit Theorems
109
Proof The claim (4.135) follows from (4.126) and the definition of π ‡ . When I = (I, J ) is independent of π with distribution (4.122), χ = {1, . . . , n} and α = π(α) for α ∈ χ , let ψ be the R2 valued function of {I, α , α ∈ χ} which yields the exchangeable pair Y , Y in Example 2.3. In view of Lemma 4.6, to prove the remainder of the claims it suffices to verify the hypotheses of Lemma 4.4, that is, with I† = (I † , J † ) that {I† , †α , α ∈ χ}, or equivalently {I† , π † (α), α ∈ χ}, has distribution (4.113). Relying on the discussion following Lemma 4.4, we prove this latter claim by considering the factorization (4.116) of dF (i, ξα , α ∈ χ) and show that {I† , π † (α), α ∈ χ} follows the corresponding square bias distribution (4.118). With i = (i, j ) and P (I = i) already specified by (4.122), we identify the remaining parts of the factorization (4.116) by noting that the distribution dFi (ξα , α ∈ χi ) = dFi (ξi , ξj ) of the images of i and j under π is uniform over all ξi = ξj , / {i, j }|ξi , ξj ) is uniform over all distinct elements and, for such ξi , ξj , dFic |i (ξα , α ∈ ξα , α ∈ χ that do not intersect {ξi , ξj }, that is, for such values
/ {i, j }|ξi , ξj = dFic |i ξα , α ∈
1 . (n − 2)!
(4.137)
Now consider the corresponding factorization (4.118). First, this expression specifies the joint distribution of the values I† and their images †α , α ∈ I† under π † by P (I† = i)dFi† (ξα , α ∈ χi ) P (I = i) 2 = b (i, ξα , α ∈ χi )dFi (ξα , α ∈ χi ), 2λσ 2
(4.138)
where from (2.47) for the difference Y − Y we have b(i, j, ξi , ξj ) = (ai,ξi + aj,ξj ) − (ai,ξj + aj,ξi ).
(4.139)
Since the distribution (4.122) of I is uniform over the range where i = j , and for such distinct i and j , the distribution dFi (ξα , α ∈ χi ) is uniform over all distinct choices of images ξi and ξj , we conclude that the joint distribution (4.138) of I† and their ‘biased permutation images’ (†I † , †J † ) is proportional to 1i =j, k =l b2 (i, j, k, l). This is exactly the distribution p2 (i, j, k, l) from which I † , J † , K † , L† is chosen. In addition, the values {K † , L† } are the images of {I † , J † } under the permutation π † constructed as specified in the statement of the lemma, as follows. By (4.134) I † = J † and K † = L† with probability one. As {I † , J † , K † , L† } and π are independent, the construction and conclusions of Lemma 4.5 apply, conditional on these indices. Invoking Lemma 4.5, π † is a permutation that maps {I † , J † } to {K † , L† }. To show that the remaining values are distributed according to dFi (ξα , α ∈ χi ), / {I † , J † } are distinct values not lying in {K † , L† }, again by Lemma 4.5, if ξm† , m ∈ then
1 / {I † , J † }|I † , J † , K † , L† = . (4.140) P π † (m) = ξm† , m ∈ (n − 2)! As (4.140) agrees with (4.137), the proof of the lemma is complete.
110
4
L1 Bounds
Note that in general even when I is uniformly distributed, the index I† need not be. In fact, from (4.117) it is clear that when I is uniform the distribution of I† is given by P (I† = i) = 0 for all i such that P (I = i) = 0, and otherwise Eb2 (i, α , α ∈ χi ) . P (I† = i) = 2 i Eb (i, α , α ∈ χi )
(4.141)
In particular, the distribution (4.134) selects the indices I† = (I † , J † ) jointly with their ‘biased permutation’ images (K † , L† ) with probability that preferentially makes the squared difference large. One can see this effect directly by calculating the marginal distribution of I † , J † , which, by (4.141), is proportional to [(aik + aj l ) − (ail + aj k )]2 , by expanding and applying (4.120), yielding
2 (aik + aj l ) − (ail + aj k )
k,l
=2
2 aik + aj2l − aik aj k − aj l ail
k,l
= 2n
n (aik − aj k )2 , k=1
and hence the generally nonuniform distribution n P (I = i, J = j ) = †
†
− aj k )2 . 2n(n − 1)σ 2
k=1 (aik
With the construction of the zero bias variable now in hand, Theorem 4.8 follows from Lemma 4.6, Theorem 4.1, (4.10) of Proposition 4.1, and the following lemma. Lemma 4.7 For Y and Y ∗ constructed as in Lemma 4.6 γ 4 28 L(Y ∗ ) − L(Y ) ≤ + 8+ . 1 (n − 1) (n − 1)2 (n − 1)σ 2 With π and the indices {I † , J † , K † , L† } constructed as in Lemma 4.6 the calculation of the bound proceeds by decomposing V = Y∗ − Y
as V = V 12 + V 11 + V 10
where 1k = 1(R = k) with R = π(I † ), π(J † ) ∩ {K † , L† }. The three factors give rise to the three terms of the bound. The proof of the lemma, though not difficult, requires some attention to detail, and can be found in the Appendix to this chapter.
4.5 Simple Random Sampling
111
4.5 Simple Random Sampling Theorem 4.9 gives an L1 bound for the exchangeable pair coupling. After proving the theorem, we will record a corollary and use it to prove an L1 bound for simple random sampling. Recall that (Y, Y ) is a λ-Stein pair for λ ∈ (0, 1) if (Y, Y ) are exchangeable and satisfy the linear regression condition E(Y |Y ) = (1 − λ)Y.
(4.142)
Theorem 4.9 Let W, W be a mean zero, variance 1, λ-Stein pair. Then if F is the distribution function of W , 2 (W − W )2 1 F − 1 ≤ E E 1 − W + E|W − W |3 . π 2λ 2λ Proof Letting = W − W , the result follows directly from Proposition 2.4 and ˆ Lemma 2.7, the latter which shows that identity (2.76) is satisfied with R = 0, K(t) given by (2.38), Kˆ 1 = E(2 |W )/2λ by (2.39), and 0 − || (−t)dt + 1{−>0} tdt 1{−≤0} Kˆ 2 = 2λ − 0 2 2 || |3 | = 1{−≤0} + 1{−>0} = . 2λ 2 2 4λ In many applications calculation of the expectation of the absolute value of the conditional expectation may be difficult. However, by (2.34) we have (W − W )2 (W − W )2 E = 1 so that E E 1 − = 0. W 2λ 2λ Hence, by the Cauchy–Schwarz inequality, − W )2 (W − W )2 (W E E 1 − W ≤ Var E 1 − W 2λ 2λ
1 = Var E (W − W )2 W . 2λ Though the variance of the conditional expectation E((W − W )2 |W ) may still be troublesome, the inequality
Var E(Y |W ) ≤ Var E(Y |F ) when σ {W } ⊂ F (4.143) often leads to the computation of a tractable bound, and provides estimates which result in the optimal rate. To show (4.143), first note that the conditional variance formula, for any X, yields Var E(X|W ) ≤ E Var(X|W ) + Var E(X|W ) = Var(X).
112
4
L1 Bounds
However, for X = E(Y |F ) we have
E(X|W ) = E E(Y |F )|W = E(Y |W ), and substituting yields (4.143). Hence we arrive at the following corollary to Theorem 4.9. Corollary 4.3 Under the assumptions of Theorem 4.9, when F is any σ -algebra containing σ {W }, 1 1 1 F − 1 ≤ √ + E|W − W |3 , λ 2 2π where =
Var E (W − W )2 |F .
(4.144)
We use Corollary 4.3 to prove an L1 bound for the sum of numerical characteristics of a simple random sample, that is, for a sample of a population {1, . . . , N } drawn so that all subsets of size n, with 0 < n < N , are equally likely. The limiting normal distribution for simple random sampling was obtained by Wald and Wolfowitz (1944) (see also Madow 1948; Erdös and Rényi 1959a; and Hájek 1960). Let ai ∈ R, i = 1, 2, . . . , N denote the characteristic of interest associated with individual i, and let Y be the sum of the characteristics {X1 , . . . , Xn } of the sampled individuals. One can easily verify that the mean μ and variance σ 2 of Y are given by n(N − n) (ai − a) ¯ 2 N (N − 1) N
μ = na¯
and σ 2 =
where a¯ =
i=1
N 1 ai . N
(4.145)
i=1
As we are interested in boundsto the normal for the standardized variable (Y − ¯ 2 we may assume in what follows μ)/σ , by replacing a by (a − a)/ ¯ b∈A (b − a) without loss of generality that a¯ = 0 and
N
ai2 = 1.
(4.146)
i=1
For m = 1, . . . , n let (n)m = n(n − 1) · · · (n − m + 1), the falling factorial of n, and fm =
(n)m . (N )m
(4.147)
Theorem 4.10 Let the numerical characteristics A = {ai , i = 1, 2, . . . , N } of a population of size N satisfy (4.146), and let Y be the sum of characteristics in a simple random sample of size n from A with 1 < n < N . Let
4.5 Simple Random Sampling
113
n(N − n) , N (N − 1) N a4, , A4 = λ= n(N − n)
σ2 =
and γ =
a∈A
(4.148)
|a|3 .
a∈A
Then with F the distribution function of Y/σ , R2 1 R1 + , F − 1 ≤ √ λ 2 2π where
1 R1 = n
2 8 S1 + 4 S2 2 σ σ (N − n)2
with 1 , N S2 = A4 (f1 − 7f2 + 6f3 − 6f4 ) + 3(f2 − f3 + f4 ) − σ 4 S 1 = A4 −
and
R2 = 8f1 γ /σ . 3
In the usual asymptotic n and N tend to infinity together with the sampling fraction f1 = n/N bounded away from zero and one; in such cases λ = O(1/n) and 2 fm = O(1). Additionally, if a ∈ A satisfy comparable size a∈A a = 1 and are of √ √ then a = O(1/ N ) which implies A4 = O(1/n) and γ = O(1/ n ). Overall then the bound provided by the √ theorem in such an asymptotic, which has main contribution from R2 , is O(1/ n ). Since distinct labels may be appended to ai , i = 1, . . . , N , say as a second coordinate which is neglected when taking sums, we may assume in what follows that elements of A = {ai , i = 1, . . . , N } are distinct. The first main point of attention is the construction of a Stein pair, which can be achieved as follows. Let X1 , X2 , . . . , Xn+1 be a simple random sample of size n + 1 from the population and let I and I be two distinct indices drawn uniformly from {1, . . . , n + 1}. Now set Xi . Y = XI + T and Y = XI + T where T = i∈{1,...,n+1}\{I,I }
As (XI , XI , T ) =d (XI , XI , T ) the variables Y and Y are exchangeable. By exchangeability and the first condition in (4.146) we have 1 E(XI |Y ) = Y n
and E(XI |Y ) = −
1 Y, N −n
and therefore E(Y |Y ) = E(Y − XI + XI |Y ) = (1 − λ)Y where λ ∈ (0, 1) is given by (4.148); the linearity condition (4.142) is satisfied.
114
4
L1 Bounds
Before starting the proof we pause to simplify the required moment calculations for X = {X1 , . . . , Xn }, a simple random sample of A. For m ∈ N, {k1 , . . . , km } ⊂ N and k = (k1 , . . . , km ) let k1 k2 km [k] = E a b ···c {a,b,...,c}⊂X , |{a,b,...,c}|=m
and
k =
km y1k1 y2k2 . . . ym .
{y1 ,...,ym }⊂A, |{y1 ,...,ym }|=m
Now observe that, with fm given in (4.147), [k] = fm k.
(4.149)
As [k] and k are invariant under any permutation of its components we may always use the canonical representation where k1 ≥ · · · ≥ km . Let ejm be the j th unit vector in Rm . When the population characteristics satisfy (4.146) we have k1 , . . . , km−1 , 1 = −
m−1
!
(k1 , . . . , km−1 ) + ejm−1
"
and
j =1
k1 , . . . , km−1 , 2 = k1 , . . . , km−1 −
m−1
!
" (k1 , . . . , km−1 ) + 2ejm−1 .
j =1
Note then that 2 = 1 3, 1 = −4 2, 2 = 2 − 4
(4.150)
2, 1, 1 = −3, 1 − 2, 2 = 4 − 2 + 4 = 24 − 2 1, 1, 1, 1 = −32, 1, 1 = −64 + 32. Proof of Theorem 4.10 We may assume n ≤ N/2, as otherwise we may replace Y , a sample of size n from A, by −Y , a sample of size N − n; this assumption is used in (4.151). We apply Corollary 4.3, beginning with the first term in the bound. Letting X = {Xj , j = I } and F = σ (X ), applying inequality (4.143) yields
Var E (Y − Y )2 |Y ≤ Var E (Y − Y )2 |F
= Var E (XI − XI )2 |F
= Var E XI2 − 2XI XI + XI2 |F . For these three conditional expectations,
4.5 Simple Random Sampling
E XI2 |F =
115
1 2 b , N −n b∈ /X
1 E(XI XI |F ) = n(N − n)
ab
a∈X ,b∈ /X
1 2 and E XI2 |F = a . n a∈X
By the standardization (4.146) we have, 1 1 2 b = a2 1− N −n N −n b∈ /X
a∈X
1 n(N − n)
and
a∈X b∈ /X
2 1 ab = − a . n(N − n) a∈X
Hence, using Var(U + V ) ≤ 2(Var(U ) + Var(V )),
Var E (Y − Y )2 |Y 2 2 N − 2n 2 a + a ≤ Var n(N − n) n(N − n) a∈X a∈X 2 2 2 1 2 a + Var a . ≤ 2 2 Var n(N − n) n a∈X
(4.151)
a∈X
Calculating the first variance in (4.151), using (4.149), we begin with 2
2 a 2 = [2]2 = f1 2 = f12 . E a∈X
Next, note E
a∈X
and therefore
2 a
2
= [4] + [2, 2] = f1 4 + f2 2, 2
n(N − n) = f1 4 + f2 2 − 4 = 4 + f2 , N (N − 1)
1 n(N − n) 2 Var a = 4 − = σ 2 S1 . N (N − 1) N a∈X
For the second variance in (4.151), using (4.149) and (4.150) we first obtain the expectation 2 E a = [2] + [1, 1] = f1 − f2 = σ 2 . (4.152) a∈X
Similarly, for the second moment we compute
116
4
L1 Bounds
4 E a = [4] + 4[3, 1] + 3[2, 2] + 3[2, 1, 1] + [1, 1, 1, 1] a∈X
= f1 4 + f2 43, 1 + 32, 2 + f3 32, 1, 1 + f4 1, 1, 1, 1 = 4(f1 − 7f2 + 6f3 − 6f4 ) + 3(f2 − f3 + f4 ).
The variance of this term is now obtained by subtracting the square of the expectation (4.152), resulting in the quantity S2 . Hence, from (4.151),
1 8 2 2 S2 , Var E (Y − Y ) |Y ≤ 2 2σ S1 + n (N − n)2 and therefore, with W = Y/σ and W = Y /σ , we have
Var E (W − W )2 |W = Var E (Y − Y )2 |Y /σ 4 = R1 . Regarding the second term in Corollary 4.3, as E|Y − Y |3 = E|XI − XI |3 ≤ 8E|XI |3 = 8
n 3 |a| = 8f1 γ , N a∈A
we obtain E|W − W |3 = 8f1 γ /σ 3 = R2 .
4.6 Chatterjee’s L1 Theorem The basis of all normal Stein identities is that Z ∼ N (0, 1) if and only if E Zf (Z) = E f (Z)
(4.153)
for all absolutely continuous functions f for which these expectations exist. For a mean zero, variance one random variable W which may be close to normal, (4.153) may hold approximately, and there may therefore be a related identity which holds exactly for W . One way the identity (4.153) may be altered to hold exactly for some given W is to no longer insist that the same variable, W , appear on the right hand side as on the left, thus leading to the zero bias identity (2.51) (4.154) E Wf (W ) = E f (W ∗ ) , as discussed in Sect. 2.3.3. Insisting that W appear on both sides, one may be lead instead to consider identities of the form (4.155) E Wf (W ) = E f (W )T , for some random variable T , defined on the same space as W . When such a T exists, by conditioning we obtain E f (W ∗ ) = E Wf (W ) = E f (W )T = E f (W )E(T |W ) ,
4.6 Chatterjee’s L1 Theorem
117
which reveals that dF ∗ (w) dF (w) is the Radon–Nikodym derivative of the zero bias distribution of W with respect to the distribution of W . In particular, as W ∗ always has an absolutely continuous distribution, for there to exist a T such that (4.155) holds it is necessary for W to be absolutely continuous; naturally, in other cases, considering approximations allows the equality to become relaxed. Identities of the form (4.155), in some generality, were considered in Cacoullos and Papathanasiou (1992), but T was constrained to be a function of W . As we will see, much more flexibility is provided by removing this restriction. Theorem 4.11, of Chatterjee (2008), gives bounds to the normal, in the L1 norm, for a mean zero function ψ(X) of a vector of independent random variables X = (X1 , . . . , Xn ) taking values in some space X . For the identity (4.155), or an approximate form thereof, to be useful, a viable T must be produced. Towards this goal, with X an independent copy of X, and A ⊂ {1, . . . , n}, let XA be the random vector with components Xj j ∈ A, XjA = (4.156) / A. Xj j ∈ E(T |W = w) =
For i ∈ {1, . . . , n}, writing i for {i} when notationally convenient, let
i ψ(X) = ψ(X) − ψ Xi ,
(4.157)
which measures the sensitivity of the function ψ to the values in its ith coordinate. Now, for any A ⊂ {1, . . . , n}, let
TA 1 n
TA = . (4.158) i ψ(X)i ψ XA and T = 2 |A| (n − |A|) i ∈A /
A⊂{1,...,n} |A| =n
Theorem 4.11 Let W = ψ(X) be a function of a vector of independent random variables X = (X1 , . . . , Xn ), and have mean zero and variance 1. Then, with i as defined in (4.157) and T given in (4.158) we have that ET = 1 and n # 3
L(W ) − L(Z) ≤ 2/π Var E(T |W ) + 1 E i ψ(X) . 1 2 i=1
We present the proof, from Chatterjee (2008),at the end of this section. To explore a simple application, let ψ(X) = ni=1 Xi where X1 , . . . , Xn are independent with mean zero, variances σ12 , . . . , σn2 summing to one, and fourth moments τ1 , . . . , τn . For A ⊂ {1, . . . , n} and i ∈ / A, A
A∪i
A
i ψ X = ψ X − ψ X Xj + Xj − Xj + Xj = Xi − Xi . (4.159) = j ∈A /
j ∈A
j ∈A∪i /
j ∈A∪i
118
4
L1 Bounds
Hence, TA =
2 Xi − Xi , i ψ(X)i ψ XA =
i ∈A /
i ∈A /
and T=
=
1 2
TA (n − |A|)
n
A⊂{1,...,n}, |A| =n |A| n−1
a=0
A⊂{1,...,n}, |A|=a
1 1 n
2 a (n − a)
1 1 n
2 a (n − a) n−1
=
=
1 2
Xi − Xi
2
/ A⊂{1,...,n},|A|=a i ∈A
a=0
n−1
TA
n
2 1 Xi − Xi . (n − a) i=1 A⊂{1,...,n}, |A|=a,A i
n
a=0 a
As for each i ∈ {1, . . . , n} there are we obtain
n−1
a
subsets of A of size a that do not contain i,
n−1 n
2 1 1 n
1 Xi − Xi 2 (n − a) i=1 a=0 a A⊂{1,...,n}, |A|=a,A i n n−1
n − 1 1 1 2 n
Xi − Xi = a 2 a (n − a)
T=
i=1
=
a=0
n 1
2
2
Xi − Xi .
i=1
For the first term in the theorem, applying the bound (4.143) with F the σ algebra generated by X we obtain n n
2 1
1 = τi + 3σi4 . Var Xi − Xi Var E(T |W ) ≤ Var(T ) = 4 2 i=1
i=1
From (4.159), n n n 3 1 3 1
4 3/4 1 E Xi − Xi E i ψ(X) = E Xi − Xi ≤ 2 2 2 i=1
i=1
=
1 21/4
Invoking Theorem 4.11 yields,
i=1
n
i=1
τi + 3σi4
3/4
.
4.6 Chatterjee’s L1 Theorem
119
$ % n n %
3/4 4 + 1 L(W ) − L(Z) ≤ & 1 τ τi + 3σi4 + 3σ . i i 1 π 21/4 i=1
i=1
When X1 , . . . , Xn are independent, mean zero variables having common second 2 and fourth moments, √ say, σ and τ , respectively, then applying this result to W = (X1 + · · · + Xn )/ n yields
1 1 4 3/4 4 L(W ) − L(Z) ≤ n−1/2 τ + 3σ + 1/4 τ + 3σ . 1 π 2 For a different application of Theorem 4.11 we consider normal approximation of quadratic forms. Let Tr(A) denote the trace of A. Proposition 4.7 Let X = (X1 , . . . , Xn ) be a vector of independent variables taking the values +1, −1 with equal probability, A a real symmetric matrix and Y = i≤j aij Xi Xj . Then the mean μ and variance σ 2 of Y are given by μ = Tr(A)
and σ 2 =
and W = (Y − μ)/σ satisfies L(W ) − L(Z) ≤ 1
1/2
1 Tr A4 4 πσ
1 2
Tr A , 2
(4.160)
n 3/2 n 7 2 + 3 aij . 2σ i=1
j =1
Proof The mean and variance formulas (4.160) can be obtained by specializing Theorems 1.5 and 1.6 of Seber and Lee (2003) to X with the given distribution. By subtracting the mean and then replacing aij by aij /σ it suffices to prove the result when aii = 0 and σ 2 = 1. Letting ψ(x) = aij xi xj i<j
Rn ,
for x ∈ we have
with
xi
the vector x with
xi
replacing xi and using the symmetry of A
i ψ(x) = ψ(x) − ψ xi = aij xi xj + aj i xj xi − aij xi xj − aj i xj xi j : i<j
= xi − xi
j : j 0, showing the bound of Proposition 4.8 to be of order n−1/2 . Similar remarks apply to S. In particular, KS , as a lower bound on the kissing number, is bounded in any dimension as n → ∞, and (4.169) holds for some
4.6 Chatterjee’s L1 Theorem
125
gS > 0 when σV2 is replaced by σS2 . At the cost of considerable more effort, Goldstein and Penrose (2010) apply Theorem 5.6 to obtain bounds of order n−1/2 for the Kolmogorov distance for both the standardized V and S, with explicit constants. Though Chatterjee’s approach might at first glance seem to bear little connection to the methods already presented, and (4.158) indeed appears a bit mysterious, Chen and Röllin (2010) have an interpretation which fits it into a general framework that contains a number of previous techniques mentioned, the exchangeable pair and size bias methods in particular. Chen and Röllin (2010) consider an identity of the form E Gf (W ) − Gf (W ) = E Wf (W ) , (4.170) for some triple (W, W , G) of square integrable random variables. If W , W is a λ-Stein pair then by (2.35) identity (4.170) is satisfied with 1 (W − W ). 2λ If Y s is on the same space as Y and has the Y -size biased distribution, and if EY = μ and Var(Y ) = σ 2 , then by (2.64) the variables W = (Y − μ)/σ and W = (Y s − μ)/σ satisfy (4.170) with G = μ/σ . Chatterjee’s approach is also included in the framework of Chen and Röllin (2010), by the method of ‘interpolation to independence’, as follows. Suppose W is a mean zero, variance 1 random variable, and for each i ∈ {1, . . . , n} we have a random variable Wi which is close in some sense to W . Suppose there exists a sequence of random variables V0 , . . . , Vn such that V0 = W , that V0 and Vn are independent, and that
(W, Vi−1 ), Wi , Vi =d Wi , Vi , (W, Vi−1 ) for all i = 1, . . . , n. G=
Note in particular we must therefore have W =d Wi and Vi =d Vi−1 , so all elements of the sequence V0 , . . . , Vn are equal in distribution, and have mean E[V0 ] = EW = 0. Given such variables, letting I be uniform over {1, . . . , n} and independent of the remaining variables and n G = (VI − VI −1 ), 2 we have, by telescoping the sum, using the independence of Vn and W on the first term and taking conditional expectation with respect to W on the second, that n 1 E Gf (W ) = (Vi − Vi−1 )f (W ) 2 i=1
1 = (Vn − V0 )f (W ) 2 1 = − E Wf (W ) , 2
126
4
L1 Bounds
while n 1 E Gf (W ) = (Vi − Vi−1 )f (W ) 2
=−
i=1 n
1 2
(Vi − Vi−1 )f (W )
i=1
1 = − E Gf (W ) 2 1 = E Wf (W ) . 2 Hence (4.170) is satisfied with W = WI . Now when W = ψ(X), a mean zero, variance one function of i.i.d. variables X1 , . . . , Xn , one can construct the required sequence V0 , . . . , Vn by setting Vi to be the function ψ evaluated on X1 , . . . , Xi , Xi+1 , . . . , Xn where Xi is an independent copy of Xi . Let also Wi = ψ(Xi ), where Xi is the vector X with Xi replacing Xi . It is clear that V0 = W , and is independent of Vn . In the notation of (4.156) we have
Wi = ψ Xi and Vi = ψ X{1,...,i} . Now consider the variation where π is a random permutation independent of the remaining variables, and we interpolate to independence in the order determined by π , that is,
Wi = ψ Xπ(i) and Vi = ψ X{π(1),...,π(i)} . Then (4.170) is satisfied with G=
1 Wπ(I ) − Wπ(I −1) , 2n
where I is an independent index chosen uniformly from {1, . . . , n}. Moreover, bounds to the normal in this framework involve conditional expectations such as (4.144), and in particular E(G(W − W )|X, X ) is the expression (4.158), see Chen and Röllin (2010) for details. We now present the proof of Theorems 4.11 and 4.12, starting with some preliminary lemmas. Lemma 4.8 Let X = (X1 , . . . , Xn ) be a random vector with independent χ valued components. Then, for any functions φ, ψ : χ n → R such that Eφ(X)2 and Eψ(X)2 are both finite,
1 Cov φ(X), ψ(X) = 2
A⊂{1,...,n} |A| =n
1 E j φ(X)j ψ XA . (n − |A|) j ∈A /
n
|A|
4.6 Chatterjee’s L1 Theorem
127
Proof First, we claim that
1 n
j ψ X A (n − |A|) j ∈A / A⊂{1,...,n} |A| |A| =n
=
1 ψ XA − ψ XA∪j (n − |A|) j ∈A /
n
A⊂{1,...,n} |A| =n
|A|
= ψ(X) − ψ(X ).
(4.171)
In particular, note that for any set A ⊂ {1, . . . , n}, except A = {1, . . . , n}, as there are n − |A| elements j ∈ / A, these set appear in (4.171) with a positive sign a total of
1 1 n
× n − |A| = n
(n − |A|) |A| |A| times. Similarly, any set B ⊂ {1, . . . , n}, except B = ∅, can be represented as B = A ∪ j for |B| different sets A, so these sets appear with a negative sign a total of
1
n
|B|−1 (n − |B| + 1)
1 × |B| = n
|B|
times. Hence only the terms A = ∅ and A ∪ j = {1, . . . , n} do not cancel out, the first one appearing with a coefficient of 1/ n0 = 1, and the latter with coefficient
−1/ nn = −1. Now, for a fixed A and j ∈ / A let U = φ(X)j ψ(XA ), a function of the random vectors X and X . Note that upon interchanging Xj and Xj the joint distribution of (X, X ) is unchanged, while U becomes U = −φ(Xj )j ψ(XA ). Thus, 1 1 EU = EU = E(U + U ) = j φ(X)j ψ XA . 2 2 Combining these observations yields
Cov φ(X), ψ(X) = E φ(X)ψ(X) − E φ(X) E ψ(X)
= E φ(X) ψ(X) − ψ(X ) 1 n
= E φ(X)j ψ XA (n − |A|) j ∈A / A⊂{1,...,n} |A| |A| =n
=
as desired.
1 2
A⊂{1,...,n} |A| =n
1 E j φ(X)j ψ XA , (n − |A|) j ∈A /
n
|A|
Lemma 4.9 Let W = ψ(X) with EW = 0 and Var(W ) = 1 where X = (X1 , . . . , Xn ) is a vector of χ valued, independent components, and let T be given by (4.158).
128
4
L1 Bounds
Then, for any twice continuously differentiable function f with bounded second derivative, we have n 3
E f (W )W − E f (W )T ≤ f E j ψ(X) , 4 j =1
where T is given by (4.158). Proof For each A ⊂ {1, . . . , n} and j ∈ / A, let
RA,j = j (f ◦ ψ)(X)j ψ XA and
'A,j = f ψ(X) j ψ(X)j ψ XA . R By Lemma 4.8 with g = f ◦ ψ , we have 1 E f (W )W = 2
1 ERA,j . (n − |A|) j ∈A /
n
A⊂{1,...,n} |A| =n
|A|
(4.172)
By the mean value theorem, and Hölder’s inequality, we have
'A,j | ≤ f E j ψ(X) 2 j ψ XA E|RA,j − R 2 f A
3 ≤ . E j ψ X 2 From the definition of T ,
f (W )T =
1 2
1 'A,j . R (n − |A|) j ∈A /
n
|A|
A⊂{1,...,n} |A| =n
(4.173)
(4.174)
Combining (4.172), (4.174) and (4.173), we obtain E f (W )W − Ef (W )T 1 1 ' n
E(RA,j − Rj,A ) = 2 (n − |A|) j ∈A / A⊂{1,...,n} |A| ≤
=
|A| =n f
4
A⊂{1,...,n} |A| =n
3 1 E j ψ(X) (n − |A|) j ∈A /
n
|A|
n 3 f E j ψ(X) , 4 j =1
as claimed.
4.6 Chatterjee’s L1 Theorem
129
Proof of Theorem 4.11 Let h be any absolutely continuous function with h ≤ 1, and let f be the solution to the Stein equation for h, Eh(W ) − N h = E f (W ) − Wf (W ) . √ By (2.13) of Lemma 2.4, we have that f ≤ 2/π and f ≤ 2. Setting φ = ψ in Lemma 4.8, we obtain ET = EW 2 = 1. Therefore Eh(W ) − N h ≤ E f (W ) − Wf (W ) ≤ E f (W ) − f (W )T + E f (W )T − Wf (W ) # ≤ 2/πE E(T |W ) − 1 + E f (W )T − Wf (W ) ≤
#
n 3
1/2 1 2/π Var E(T |W ) + E j ψ(X) , 2 j =1
by the Cauchy–Schwarz inequality, and Lemma 4.9. The proof is completed by taking supremum over h, noting (4.8). We now proceed to the proof of Theorem 4.12. By Theorem 4.11, it suffices to bound Var(E(T |X)). For this reason, the proof of Theorem 4.12 follows quickly from the following upper bound. Lemma 4.10 Let X be a vector of i.i.d. variates, A ⊂ {1, . . . , n} with |A| = n, and TA , M and δ given by (4.158), (4.161) and (4.162), respectively. Then there exists a constant C such that
1/2 4 1/2 # n(n − |A|). Eδ Var E(TA |X) ≤ C EM 8 For the remainder of this section, we make the convention that constants C need not be the same at each occurrence. Deferring the proof of Lemma 4.10, we present the proof of Theorem 4.12. Proof By the definition of T and Minkowski’s inequality, we obtain
1/2 1 Var E(T |X) ≤ 2
A⊂{1,...,n} |A| =n
[Var(E(TA |X))]1/2 n
. |A| (n − |A|)
Substituting the bound from Lemma 4.10 yields
1/2
1/4 4 1/2 Var E(T |X) Eδ ≤ C EM 8
1/4 = C EM 8 Eδ
n1/4 (n − |A|)1/4 n
|A| (n − |A|)
A⊂{1,...,n} |A| =n n
1/4 −3/4 4 1/4
n
k
k=1
1/4 4 1/4 1/2 Eδ = C EM 8 n . Now invoking Theorem 4.11 completes the proof.
130
4
L1 Bounds
It remains to prove Lemma 4.10. We proceed by way of the following preliminary result. Lemma 4.11 Suppose that G is a symmetric graphical rule on χ n and X = (X1 , . . . , Xn ) is a vector of i.i.d. χ -valued random variables. Let d1 be the degree of vertex 1 in G(X), and, for any k ≤ n − 1, let i, i1 , . . . , ik be any collection of k + 1 distinct elements of {1, . . . , n}. Then
E(d1 )k , (4.175) P {i, il } ∈ G(X) for all 1 ≤ l ≤ k = (n − 1)k where (r)k stands for the falling factorial r(r − 1) · · · (r − k + 1). Proof Since G is a symmetric rule and X1 , . . . , Xn are i.i.d., the probability
P {i, il } ∈ G(X) for all 1 ≤ l ≤ k does not depend on i, i1 , . . . , ik . Hence
P {i, il } ∈ G(X) for all 1 ≤ l ≤ k
1 = P {i, jl } ∈ G(X) for all 1 ≤ l ≤ k . (n − 1)k {j1 ,...,jk }⊂{1,...,n}\{i} |{j1 ,...,jk }|=k
Lastly, note that
1 {i, jl } ∈ G(X) for all 1 ≤ l ≤ k = (di )k ,
{j1 ,...,jk }⊂{1,...,n}\{i} |{j1 ,...,jk }|=k
where di is the degree of vertex i. As di and d1 have the same distribution, the argument is complete. To prove Lemma 4.10 we require the following result, the Efron–Stein inequality, see Efron and Stein (1981), and Steele (1986). Lemma 4.12 Let U = g(Y1 , . . . , Ym ) be a function of independent random objects Y1 , . . . , Ym , and let Yi be an independent copy of Yi for i = 1, . . . , m. Then Var(U ) ≤
m
2 1 E g Y1 , . . . , Yi−1 , Yi , Yi+1 , . . . , Ym − g(Y1 , . . . , Ym ) . 2 i=1
Proof of Lemma 4.10 Fix A ⊂ {1, . . . , n} with |A| = n. For each j ∈ / A, let A
Rj = j ψ(X)j ψ X
= ψ(X) − ψ Xj ψ XA − ψ XA∪j . Let Y = (Y1 , . . . , Yn ) be a copy of X, which is independent of both X and X . For a fixed i ∈ {1, . . . , n} let ' X = (X1 , . . . , Xi−1 , Yi , Xi+1 , . . . , Xn ).
4.6 Chatterjee’s L1 Theorem
131
Similarly, for each B ⊂ {1, . . . , n}, let B B , Y , X B , . . . , X B ) if i ∈ (X1 , . . . , Xi−1 / B, i n i+1 ' XB = if i ∈ B. XB Now let A∪j
j
A
Rj i = ψ(' ψ ' X −ψ ' X , X) − ψ ' X and put hi = E
2 (Rj − Rj i ) .
j ∈A /
It follows from inequality (4.143) and Lemma 4.12 that n
1 hi . Var E(TA |X) ≤ Var(TA ) ≤ 2
(4.176)
i=1
Hence, we turn our attention to bounding hi , and note that we need only consider j∈ / A. When j = i let
dj1i = 1 {i, j } ∈ G(X) ,
dj2i = 1 {i, j } ∈ G Xj ,
X) and dj3i = 1 {i, j } ∈ G(' j
dj4i = 1 {i, j } ∈ G ' X . Suppose in a particular realization we have dj1i = dj2i = dj3i = dj4i = 0. Since G is an interaction rule for ψ , on this event we have
j
ψ(X) − ψ Xj = ψ(' X) − ψ ' X . XA in place of X and ' X, and define ej1i , ej2i , ej3i and ej4i If we now take XA and ' 3 1 2 4 analogously, then when ej i = ej i = ej i = ej i = 0 we have
A
A∪j
ψ XA − ψ XA∪j = ψ ' X −ψ ' X , whether i ∈ A or not. Now, let
A Li = maxj ψ(X)j ψ XA − j ψ(' X)j ψ ' X . j ∈A /
From the preceding considerations, when j = i |Rj − Rj i | ≤ Li
4
djki + ejk i .
k=1
When j = i then i ∈ / A and we have |Rj − Rj i | ≤ Li . The Cauchy–Schwarz inequality now yields 4 )1/2 ( 4 k
4 k dj i + ej i / A) + . (4.177) hi ≤ ELi E 1(i ∈ j ∈A∪i / k=1
132
4
L1 Bounds
Applying the inequality ( ri=1 ai )4 ≤ r 3 ri=1 ai4 , we obtain 4 4 k
E 1(i ∈ / A) + dj i + ejk i j ∈A∪i / k=1
≤ 93 1(i ∈ A) + 93
4 4 4 4 E djki + 93 E ejk i . k=1
j ∈A∪i /
k=1
j ∈A∪i /
To handle the first term in the first sum, from Lemma 4.11, for any j, k, l and m,
Eδ r 1 1 1 E dj1i dki dli dmi ≤ C r1 , n where r is the number of distinct indices among j , k, l, m, and δ1 is the degree of vertex 1 in G(X). Recall the definition of δ from (4.162), and observe that δ ≥ δ1 + 1. It follows easily that 4 n − |A| E dj1i ≤ CE δ 4 . n j ∈A∪i /
2 d 2 d 2 ). First suppose that j, k, l, m are disNow we consider bounding E(dj2i dki li mi tinct. Now let ' X be the random vector in χ n+4 given by
' . X = X1 , . . . , Xn , Xj , Xk , Xl , Xm 2 = d 2 = d 2 = 1 then {i, n + 1}, {i, n + 2}, {i, n + 3} and {i, n + Note that if dj2i = dki li mi 4} are all edges in the extended graph G (' X). Since G is a symmetric rule and the ' components of X are i.i.d., it follows from Lemma 4.10 that
Eδ 4 2 2 2 E dj2i dki dli dmi ≤ C 4 . n Now, suppose j , k, l are distinct, and that m = l. Let s ∈ {1, . . . , n} be distinct from j , k and l, and define
' X = X1 , . . . , Xn , Xj , Xk , Xl , Xs and argue as before to conclude that in this case
Eδ 3 2 2 2 2 2 dli dmi = E dj2i dki dli ≤ C 3 . E dj2i dki n In general, if r is the number of distinct elements among j , k, l, m, then
Eδ r 2 2 2 E dj2i dki dli dmi ≤ C r . n From this inequality we obtain as before that 4 n − |A| E dj2i ≤ CE δ 4 . n j ∈A∪i /
4.7 Locally Dependent Random Variables
133
The d 3 , e1 and e3 terms can be bounded as the d 1 term, while the d 4 , e2 and e4 terms like the d 2 term. Combining, we conclude 4 4 4
k
n − |A| k E 1(i ∈ / A) + ≤ CE δ dj i + e j i 1(i ∈ / A) + . n j ∈A∪i / k=1
8 applying these bounds in As M = maxj |j ψ(X)| have EL4i ≤ CEM √ , and √ √ (4.177), along with the inequality x + y ≤ x + y for nonnegative x and y, we obtain
4 1/2 n − |A| 8 1/2 Eδ . 1(i ∈ / A) + hi ≤ C EM n
Substituting this bound in (4.176), we obtain
1/2 4 1/2
Eδ n − |A| + n n − |A| Var E(TA |X) ≤ C EM 8
1/2 4 1/2 # ≤ C EM 8 Eδ n(n − |A|),
completing the proof.
4.7 Locally Dependent Random Variables In this section we consider L1 bounds for sums of locally dependent random variables. We being by recalling that an m-dependent sequence of random variables ξi , i ∈ N, is one with the property that, for each i, the sets of random variables {ξj , j ≤ i} and {ξj , j > i + m} are independent. Independent random variables are the special case of m-dependence when m = 0. Local dependence generalizes the notion of m-dependence to collections of random variables indexed more generally. The concept of local dependence is applicable, for example, to random variables indexed by the vertices of a graph such that the collections {ξi , i ∈ I } and {ξj , j ∈ J } are independent whenever I ∩ J = ∅ and the graph contains no edges {i, j } with i ∈ I and j ∈ J . Let J be a finite index set of cardinality n, and let {ξi , i ∈ J } be a random field, that is, an indexed collection of random variables, with zero means and finite variances. Define W = i∈J ξi , and assume that Var(W ) = 1. For any A ⊂ J let / A} Ac = {j ∈ J : j ∈
and ξA = {ξi : i ∈ A}.
We introduce the following two conditions, corresponding to different degrees of local dependence. (LD1) For each i ∈ J there exists Ai ⊂ J such that ξi and ξAci are independent. (LD2) For each i ∈ J there exist Ai ⊂ Bi ⊂ J such that ξi is independent of ξAci and ξAi is independent of ξBic .
4
L1 Bounds
Clearly (LD2) implies (LD1). Whenever (LD1) or (LD2) hold we set ξj and τi = ξj ηi =
(4.178)
134
j ∈Ai
j ∈Bi
respectively. Note that when {ξi , i ∈ J } are independent (LD2) holds with Ai = Bi = {i}, in which case ηi = τi = ξi . Theorem 4.13 Let {ξi , i ∈ J } be a random field with mean zero and Var(W ) = 1 where W = i∈J ξi . If (LD1) holds then, then with ηi as in (4.178), 2 L(W ) − L(Z) ≤ 2 E η − E(ξ η ) E ξi ηi , (4.179) ξ i i i i + 1 π i∈J
i∈J
and if (LD2) holds, then with ηi and τi as in (4.178),
2 L(W ) − L(Z) ≤ 2 E(ξi ηi )E|τi | + E|ξ η τ | + E ξi ηi . (4.180) i i i 1 i∈J
i∈J
We remark that for independent random variables, applying Hölder’s inequality to the bound in (4.180) yields 5 i∈J E|ξi |3 , somewhat larger than the constant of 1 given by Corollary 4.2. Proof Assume (LD1) holds and let f = fh be the solution of the Stein equation (2.4) for an absolutely continuous function h satisfying h ≤ 1. By the independence of ξi and W − ηi , and that Eξi = 0, we have E Wf (W ) = Eξi f (W ) = Eξi f (W ) − f (W − ηi ) . i∈J
i∈J
Now adding and subtracting yields E ξi f (W ) − f (W − ηi ) − ηi f (W ) E Wf (W ) = i∈J
+E
ξi ηi f (W ) .
(4.181)
i∈J
Now, using again that Eξi = 0 for all i, from (LD1) it follows that E{ξi ξj } = E{ξi ηi }, 1 = EW 2 = i∈J j ∈J
i∈J
and so
ξi ηi − E(ξi ηi ) f (W ) E f (W ) − Wf (W ) = −E i∈J
E ξi f (W ) − f (W − ηi ) − ηi f (W ) . (4.182) − i∈J
4.7 Locally Dependent Random Variables
135
√ By (2.13), f ≤ 2/π and f ≤ 2. Therefore it follows from (4.182) and a Taylor expansion that 2 Eh(W ) − Eh(Z) ≤ 2 E ξ η − E(ξ η ) E ξi ηi . i i i i + π i∈J
i∈J
Now (4.179) follows from (4.8). When (LD2) is satisfied, f (W − τi ) and ξi ηi are independent for each i ∈ J . Hence, using (4.182), we can write Eh(W ) − Eh(Z)
2 ≤ E E ξi ηi ξi ηi − E(ξi ηi ) f (W ) − f (W − τi ) + i∈J
≤2
2 E|ξi ηi τi | + E(ξi ηi )E|τi | + E ξi η ,
i∈J
i
i∈J
i∈J
as desired.
We provide two examples of locally dependent random variables. We refer to Baldi and Rinott (1989), Rinott (1994), Baldi et al. (1989), Dembo and Rinott (1996), and Chen and Shao (2004) for more details. Example 4.1 (Graphical dependence) Consider a set of random variables {ξi , i ∈ V} indexed by the vertices of a graph G = (V, E). The graph G is said to be a dependency graph if, for any pair of disjoint sets 1 and 2 in V such that no edge in E has one endpoint in 1 and the other in 2 , the sets of random variables {ξi , i ∈ 1 } and {ξi , i ∈ 2 } are independent. Let Ai = {i} ∪ j ∈ V: {i, j } ∈ E and Bi = j ∈Ai Aj . Then {ξi , i ∈ V} satisfies (LD2). Hence (4.180) holds. Example 4.2 (The number of local maxima on a graph) Consider a graph G = (V, E) (which is not necessary a dependency graph) and independent and identically distributed continuous random variables {Yi , i ∈ V}. For i ∈ V define the indicator variable 1 if Yi > Yj for all j ∈ Ni , ξi = 0 otherwise = {j ∈ V: {i, j } ∈ E}. Hence ξi = 1 indicates that Yi is a local maximum where Ni and W = i∈V ξi is the total number of local maxima. Letting Nj and Bi = Aj Ai = {i} ∪ Ni ∪ j ∈Ni
j ∈Ai
we find that {ξi , i ∈ V} satisfies (LD2), and therefore (4.180) holds. Bounds in L∞ for this problem are considered in Example 6.4.
136
4
L1 Bounds
4.8 Smooth Function Bounds In defining a distance L(X) − L(Y )H through (4.1) one typically chooses H to be a convergence determining class of functions, that is, a collection of functions such that if {Xn }n≥0 is any sequence of random variables then Eh(Xn ) → Eh(X0 )
for all h ∈ H implies Xn →d X0 .
A convergence determining class can consist of functions all of which are very smooth, such as the collection of all infinity differentiable functions with compact support. To describe the collection of functions we consider in this section, following E.M. Stein (1970), let L∞ a} can be usefully upper bounded, with similar remarks applying to Theorem 5.7 for the size biased coupling.
5.1 Bounded Zero Bias Couplings The calculation here is greatly simplified due to the assumption of boundedness. For W a mean zero random variable with variance one, recall definition (2.51) of W ∗ , the W -zero biased variable. Theorem 4.1 shows that when W and W ∗ are close then W is close to normal in the L1 sense. Theorem 5.1 below, the bounded zero bias coupling theorem, provides a corresponding result in L∞ . Theorem 5.1 Let W be a mean zero and variance 1 random variable, and suppose that there exists W ∗ , having the W -zero bias distribution, defined on the same space as W , satisfying |W ∗ − W | ≤ δ. Then supP (W ≤ z) − P (Z ≤ z) ≤ cδ z∈R
√ √ where c = 1 + 1/ 2π + 2π/4 ≤ 2.03. L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_5, © Springer-Verlag Berlin Heidelberg 2011
147
148
5
L∞ by Bounded Couplings
We note that the application of this theorem does not require the sometimes difficult calculation of a variance of a conditional expectation, such as the term in (4.144) of Corollary 4.3, for the exchangeable pair. Theorem 5.1 may be directly applied to the sum of independent, mean zero random variables X1 , . . . , Xn all bounded by some C, yielding a bound with an explicit constant that has the order of the inverse of the standard deviation of the sum. In particular, let σi2 = Var(Xi ), Bn2 = ni=1 σi2 and W=
n
ξi
where ξi = Xi /Bn .
i=1
Then applying Lemma 2.8, (2.58) to yield |Xi∗ | ≤ C, and the scaling property (2.59), we obtain, for I an index independent of X1 , . . . , Xn with P (I = i) = σi2 /Bn2 , W ∗ − W = ξI∗ − ξI ,
so in particular |W ∗ − W | ≤ 2C/Bn .
Hence the conclusion of Theorem 5.1 holds with δ = 2C/Bn . The proof of Theorem 5.1 is similar to that of Theorem 3.5, even though the relation between W and a coupled W ∗ having the W -zero bias distribution cannot be expressed as in (3.23). Proof Let z ∈ R and f be the solution to the Stein equation (2.2) with z replaced by z − δ. Then f (W ∗ ) = 1{W ∗ ≤z−δ} − (z − δ) + W ∗ f (W ∗ ) ≤ 1{W ≤z} − (z − δ) + W ∗ f (W ∗ ). By taking expectation in this inequality, applying a simple bound on the normal distribution function and the zero bias definition (2.51), we obtain P (W ≤ z) − (z) = (z − δ) − (z) + P (W ≤ z) − (z − δ) δ ≥ −√ + P (W ≤ z) − (z − δ) 2π δ ≥ −√ + E f (W ∗ ) − W ∗ f (W ∗ ) 2π δ = −√ + E Wf (W ) − W ∗ f (W ∗ ) . (5.1) 2π Writing = W ∗ − W and applying the bound (2.10) yields E Wf (W ) − W ∗ f (W ∗ ) = E Wf (W ) − (W + )f (W + ) √ ≤ E |W | + 2π/4 || √ ≤ δ(1 + 2π/4). Using this inequality in (5.1) yields
1 +1+ P (W ≤ z) − (z) ≥ −δ √ 2π A similar argument yields the reverse inequality.
√
2π ≥ −2.03δ. 4
5.2 Exchangeable Pairs, Kolmogorov Distance
149
5.2 Exchangeable Pairs, Kolmogorov Distance In this section we provide results that give a bound on the Kolmogorov distance when we can construct W, W , an exact or approximate Stein pair, whose difference |W − W | is bounded. Theorem 5.3 can also be applied when W − W is not bounded if the term E(W − W )2 1|W −W |>a can be handled. For a pair (W, W ) and a given δ, some of the results in this section are expressed in terms of = W − W ,
W ˆ K(t)dt Kˆ 1 = E |t|≤δ
ˆ = with K(t) and additionally
(1{−≤t≤0} − 1{0 0, E(W − W )2 1{−a≤W −W ≤0} 1{z−a≤W ≤z} ≤ 3λa. Proof Let
⎧ ⎨ −3a/2 f (w) = w − z + a/2 ⎩ 3a/2
w ≤ z − 2a, z − 2a ≤ w ≤ z + a, w ≥ z + a.
Then using (2.35), 3aλ ≥ 2λE Wf (W ) = E(W − W ) f (W ) − f (W )
0 = E (W − W ) f (W + t)dt W −W 0
≥ E (W − W )
W −W
Noting that
1{|t|≤a} 1{z−a≤W ≤z} f (W + t)dt .
f (w + t) = 1{z−2a≤w+t≤z+a} ,
we have
5.2 Exchangeable Pairs, Kolmogorov Distance
151
1{|t|≤a} 1{z−a≤W ≤z} f (W + t) = 1{|t|≤a} 1{z−a≤W ≤z} , and hence
0 3aλ ≥ E (W − W ) 1{|t|≤a} dt1{z−a≤W ≤z} W −W = E |W − W | min a, |W − W | 1{z−a≤W ≤z} ≥ E (W − W )2 1{0≤W −W ≤a} 1{z−a≤W ≤z} .
Theorem 5.3 If W , W are mean zero, variance 1 exchangeable random variables satisfying E(W − W |W ) = λ(W − R) for some λ ∈ (0, 1) and some random variable R, then for any a ≥ 0, supP (W ≤ z) − P (Z ≤ z) z∈R
√ E(W − W )2 1{|W −W |>a} 2π 0.41a 3 + 1.5a + + E|R|, ≤B+ λ 2λ 4 where B is as in (5.3). If W, W is a variance one λ-Stein pair satisfying |W − W | ≤ δ, then 0.41δ 3 supP (W ≤ z) − P (Z ≤ z) ≤ B + + 1.5δ. λ z∈R Proof Let f be the solution to the Stein equation (2.2) for some arbitrary z ∈ R. Following the reasoning in the derivation of (2.35), we find that 1 E Wf (W ) = E (W − W ) f (W ) − f (W ) + E f (W )R . 2λ Hence, P (W ≤ z) − (z) = E f (W ) − Wf (W )
(W − W )(f (W ) − f (W )) + f (W )R = E f (W ) − 2λ − W )2
(W = E f (W ) 1 − 2λ
f (W )(W − W )2 − (f (W ) − f (W ))(W − W ) + + f (W )R 2λ := E(J1 + J2 + J3 ), say,
≤ |EJ1 | + |EJ2 | + |EJ3 |.
(5.6)
152
5
L∞ by Bounded Couplings
For the first term, by conditioning and then taking expectation, using (2.8) we obtain
(W − W )2 |EJ1 | = E f (W )E 1 − W ≤ B. (5.7) 2λ For the third term, applying (2.9) we have √ 2π |EJ3 | ≤ E|R|. 4 To bound |EJ2 |, with a ≥ 0 write f (W )(W − W )2 − f (W ) − f (W ) (W − W ) W −W f (W ) − f (W + t) dt = (W − W ) 0
= (W − W )1|W −W |>a
W −W
f (W ) − f (W + t) dt
0
+ (W − W )1|W −W |≤a := J21 + J22 ,
W −W
f (W ) − f (W + t) dt
0
say.
(5.8)
By (2.8), |EJ21 | ≤ E(W − W )2 1|W −W |>a , yielding the second to last term in the bound of the theorem. Now express J22 , using (2.2), as the sum W −W (W − W )1|W −W |≤a Wf (W ) − (W + t)f (W + t) dt 0
+ (W − W )1|W −W |≤a
0
W −W
(1{W ≤z} − 1{W +t≤z} )dt.
(5.9)
Applying (2.10) to the first term in (5.9) shows that the absolute value of its expectation is bounded by √
W −W 2π E(W − W )1|W −W |≤a |t|dt |W | + 4 0 √
2π 1 ≤ E |W − W |3 1|W −W |≤a |W | + 2 4 √
2π 1 3 ≤ a 1+ 2 4 ≤ 0.82a 3 . We break up the expectation of the second term in (5.9) according to the sign of W − W . When W − W ≤ 0, we have
5.2 Exchangeable Pairs, Kolmogorov Distance
153
E (W − W )1{−a≤W −W ≤0}
0
W −W
(1{W ≤z} − 1{W +t≤z} )dt
0 = E (W − W )1{−a≤W −W ≤0} 1{z−t<W ≤z} dt W −W ≤ E (W − W )2 1{−a≤W −W ≤0} 1{z−a<W ≤z} ≤ 3aλ. As a bound may be similarly produced for the case W − W ≥ 0, the proof of the first claim is complete. The second claim follows by choosing a = δ in the first bound, and noting that R = 0 for a λ-Stein pair. Next we present two results that are obtained by using the linearly smoothed indicator function hz,α (w), as given in (2.14) for z ∈ R and α > 0, which equals the indicator hz (w) = 1(−∞,z] (w) over (−∞, z], decays to zero linearly over [z, z + α], and equals zero on (z + α, ∞). Let (5.10) κ = sup Ehz (W ) − N hz : z ∈ R , the Kolmogorov distance, and for α > 0 set κα = sup Ehz,α (W ) − N hz,α : z ∈ R .
(5.11)
Theorem 5.4 If W , W is a variance one λ-Stein pair that satisfies |W − W | ≤ δ for some δ then 3δ 3 supP (W ≤ z) − P (Z ≤ z) ≤ + 2B λ z∈R
(5.12)
where B is given by (5.3). If δ is of order 1/σ , B of order 1/σ , and λ of order 1/σ 2 , then the bound has order 1/σ . A more careful optimization in the proof leads to the improved bound √ ( 11δ 3 + 10λB + 2δ 3/2 )2 sup P (W ≤ z) − P (Z ≤ z) ≤ . (5.13) 10λ z∈R √ √ The bound (5.12) follows from (5.13) and the fact that ( a + b)2 ≤ 2(a + b). Proof For z ∈ R arbitrary and α > 0 let f be the solution (2.4) to the Stein equation for the function hz,α given in (2.14). Decompose Ehz,α (W )−N hz,α into E(J1 +J2 ) as in the proof of Theorem 5.3, noting that here the term R is zero. By the second inequality in (2.15) of Lemma 2.5 we may again bound |EJ1 | by B as in (5.7). From (5.8) with a = δ we obtain W −W 1 f (W ) − f (W + v) dv |EJ2 | ≤ E(W − W ) 2λ 0
0∨(W −W ) 1 f (W ) − f (W + v)dv . ≤ E |W − W | 2λ 0∧(W −W )
154
5
L∞ by Bounded Couplings
By applying |W − W | ≤ δ and a simple change of variable in (2.16) of Lemma (2.5), we may bound |EJ2 | by
0∨(W −W ) δ 1 δ v∨0 |v|dv + 1{z≤W +u≤z+α} dudv . E 1 + |W | 2λ α −δ v∧0 0∧(W −W ) As
0∨(W −W )
0∧(W −W )
we obtain |EJ2 | ≤
1 δ2 |v|dv = (W − W )2 ≤ 2 2
and E|W | ≤ 1
δ 1 δ v∨0 P (z ≤ W + u ≤ z + α)dudv . δ2 + 2λ α −δ v∧0
(5.14)
Now, recalling the definitions of κ and κα in (5.10) and (5.11) respectively, as P (a ≤ W ≤ b) = P (W ≤ b) − (b) − P (W < a) − (a) + (b) − (a) √ ≤ 2κ + (b − a)/ 2π, we bound (5.14) by
δ v∨0 1 1 (2κ + 0.4α)dudv ≤ δ 3 + δα −1 1.4δ 3 + 2δ 3 α −1 κ . 2λ 2λ −δ v∧0 Combining the bounds for |EJ1 | and |EJ2 | and taking supremum over z ∈ R we obtain 1 κα ≤ B + (5.15) 1.4δ 3 + 2δ 3 α −1 κ . 2λ As P (W ≤ z) − (z) ≤ Ehz,α (Z) − (z)
= Ehz,α (Z) − N hz,α − (z) − N hz,α √ ≤ κα + α/ 2π,
with similar reasoning providing a corresponding lower bound, taking supremum over z ∈ R we obtain κ ≤ κα + 0.4α. Now applying the bound (5.15) yields κ≤
aα + b , 1 − c/α
where a = 0.4, b = B +
0.7δ 3 δ3 and c = . λ λ
Now setting α = 2c yields 4ac + 2b, the right hand side of (5.12).
Lastly we present a result of Stein (1986), with an improved constant and slightly extended to allow a nonlinear remainder term; this result has the advantage of not requiring the coupling to be bounded. However, the bound supplied by the theorem is typically not of the best order due to√its final term. √In particular, if W is the sum of i.i.d. variables taking the values 1/ n and −1/ n with equal probability and W is formed from W by replacing a uniformly chosen variable by an independent
5.2 Exchangeable Pairs, Kolmogorov Distance
155
copy, then λ = 1/n and E|W − W |3 = 4/n3/2 , so that the final term in the bound of the theorem below becomes of order n−1/4 . Nevertheless, in Sect. 14.1 we present a number of important examples where this final term makes no contribution. Theorem 5.5 If W , W are mean zero, variance 1 exchangeable random variables satisfying E(W − W |W ) = λ(W − R)
(5.16)
for some λ ∈ (0, 1) and some random variable R, then E|W − W |3 supP (W ≤ z) − (z) ≤ B + (2π)−1/4 + E|R|, λ z∈R where B is given by (5.3). Proof For z ∈ R and α > 0 let f be the solution to the Stein equation for hz,α , the smoothed indicator given by (2.14). Decompose f (W ) − Wf (W ) into J1 + J2 + J3 as in the proof of Theorem 5.3. Applying the first inequality in (2.15) of Lemma 2.5, we may bound the contribution from |EJ3 | by E|R|, and from |EJ1 | by B as in (5.7). Next we claim that for J2 , the second term of (5.6), we have W 1 f (W ) − f (t) dt J2 = (W − W ) 2λ W W W 1 f (u)dudt (5.17) = (W − W ) 2λ W t W 1 (W − u)f (u)du. (5.18) = (W − W ) 2λ W We obtain (5.18) by first considering W ≤ W and rewriting (5.17) as W t W W 1 1 − (W − W ) f (u)dudt = − (W − W ) f (u)dtdu 2λ 2λ W W u W W 1 (W − u)f (u)dtdu, = − (W − W ) 2λ W which equals (5.18). When W ≤ W , similarly we have W W W W 1 1 f (u)dudt = − (W − W ) f (u)dudt (W − W ) 2λ 2λ W t W t W u 1 = − (W − W ) f (u)dtdu 2λ W W W 1 = − (W − W ) u − W f (u)du, 2λ W which is again (5.18).
156
5
L∞ by Bounded Couplings
Since W and W are exchangeable, the expectation of (5.18) is the same as that of 1 (W − W ) 2λ
W W W
+W − u f (u)du, 2
which we bound by the expectation of W ∨W W + W 1 |W − W |3 1 f |W − W | − udu = f 2λ 2 2λ 4 W ∧W
|W − W |3 , 4αλ where for the inequality we used the fact that |hz,α (x)| ≤ 1/α for all x ∈ R, and then applied (2.13). Collecting the bounds, we obtain ≤
P (W ≤ z) ≤ Ehz,α (W ) E|W − W |3 + E|R| 4αλ E|W − W |3 α +B + + E|R|. ≤ (z) + √ 4αλ 2π ≤ N hz,α + B +
Evaluating the expression at the minimizer (2π)1/4 E|W − W |3 α= 2 λ yields the inequality
P (W ≤ z) − (z) ≤ B + (2π)−1/4
E|W − W |3 + E|R|. λ
Proving the corresponding lower bound in a similar manner completes the proof of the theorem.
5.3 Size Biasing, Kolmogorov Bounds We now present two results employing size biased couplings, Theorems 5.6 and 5.7, which parallel Theorems 5.4 and 5.3, respectively, for the exchangeable pair. In particular, in Theorem 5.6 we focus on deriving bounds in the Kolmogorov distance in situations where bounded size bias couplings exist, that is, where one can couple the nonnegative variable Y to Y s having the Y -size biased distribution, so that |Y s − Y | is bounded. In Theorem 5.7 we require the bounded coupling to satisfy an additional monotonicity condition. In principle, Theorem 5.7, like Theorem 5.3, may be applied in situations where Y s − Y is not bounded.
5.3 Size Biasing, Kolmogorov Bounds
157
For Y a nonnegative random variable with positive mean μ, recall that Y s has the Y -size bias distribution if E Yf (Y ) = μEf Y s (5.19) for all functions f for which the expectations above exist. When Y has finite positive variance σ 2 , we consider the normalized variables W = (Y − μ)/σ
and, with some abuse of notation, W s = Y s − μ /σ.
(5.20)
Ys,
Given a size bias coupling of Y to the resulting bounds will be expressed in terms of the quantities D and given by
μ s D = E E 1 − W − W |W and = Var E Y s − Y |Y σ μ (5.21) which obey D ≤ 2 . σ To demonstrate the inequality, note that EY s = EY 2 /μ by (5.19), hence
μ EY 2 μ s E W −W = 2 − μ = 1, σ μ σ so the Cauchy–Schwarz inequality yields μ μ Var E W s − W |W = 2 . D≤ σ σ
(5.22)
Therefore D may be replaced by μ /σ 2 in all the upper bounds in this section and the one following. Note that we cannot apply Theorem 3.5 here, as for a size biased coupling in ˆ general there is no guarantee that the function K(t) will be non-negative. Theorem 5.6 Let Y be a nonnegative random variable with finite mean μ and positive, finite variance σ 2 , and suppose Y s , having the Y -size biased distribution, may be coupled to Y so that |Y s − Y | ≤ A for some A. Then with W = (Y − μ)/σ and D as in (5.21), 6μA2 + 2D. supP (W ≤ z) − P (Z ≤ z) ≤ σ3 z∈R
(5.23)
Following Goldstein and Penrose (2010), a more careful optimization in the proof yields the improved bound
μ 11A2 5σ 2 2A 2 . + D+ √ supP (W ≤ z) − P (Z ≤ z) ≤ 2 σ μ 5σ σ z∈R Again, as for the bound√in Theorem 5.4, inequality (5.23) follows from the one √ above and the fact that ( a + b)2 ≤ 2(a + b).
158
5
L∞ by Bounded Couplings
Usually the mean μ and variance σ 2 of Y will grow at the same rate, typically n, so the bound will asymptotically have order O(σ −1 ) when D is of this same order. In Chap. 6, Theorem 5.6 is applied to counting the occurrences of fixed relatively ordered sub-sequences in a random permutation, such as rising sequences, and to counting the occurrences of color patterns, local maxima, and sub-graphs in finite random graphs. Here we consider a simple application of Theorem 5.6 when Y is the sum of the i.i.d. variables X1 , . . . , Xn with mean θ and variance v 2 , satisfying 0 ≤ Xi ≤ A. In this case μ = nθ and σ 2 = nv 2 so μ/σ 2 = θ/v 2 a constant. Next, applying the construction in Corollary 2.1 we have Y s − Y = XIs − XI , and now using (2.67) and the fact that Xi and Xis are nonnegative we obtain s Y − Y = X s − XI ≤ A. I Lastly, by independence and exchangeability Var E Y s − Y |Y = Var E XIs − XI |Y = Var EXIs − Y/n = v 2 /n, √ so in (5.21), and therefore the resulting bound, is of order 1/ n, with an explicit constant. Proof of Theorem 5.6 Fix z ∈ R and α > 0, and let f solve the Stein equation (2.4) for the linearly smoothed indicator hz,α (w) given in (2.14). Then, letting W s = (Y s − μ)/σ , applying (5.19) we have E hz,α (W ) − N hz,α = E f (W ) − Wf (W )
μ s = E f (W ) − f W − f (W ) σ
s μ s μ W −W = E f (W ) 1 − f (W + t) − f (W ) dt . W −W − σ σ 0 (5.24) For the first term, taking expectation by conditioning and then applying the second inequality in (2.15) of Lemma 2.5, we have
E f (W )E 1 − μ W s − W |W ≤ D σ where D is given by (5.21). Hence, letting δ = A/σ so that |W s − W | ≤ δ, applying a change of variable on (2.16) of Lemma 2.5 for the second inequality, and then proceeding as in the proof of Theorem 5.4 yields E hz,α (W ) − N hz,α (W s −W )∨0 μ f (W + t) − f (W )dt ≤D+ E σ (W s −W )∧0
5.3 Size Biasing, Kolmogorov Bounds
≤D+
μ E σ μ
159
(W s −W )∨0
(W s −W )∧0
1 + |W | |t| + α −1
t∨0
t∧0
1{z≤W +u≤z+α} du dt
δ t∨0 μ (2κ + 0.4α)dudt 1 + E|W | δ 2 + α −1 ≤D+ 2σ σ −δ t∧0 μ μ ≤ D + 1.4 δ 2 + 2 δ 2 α −1 κ. (5.25) σ σ Now, with κ and κα given in (5.10) and (5.11), respectively, continuing to parallel the proof of Theorem 5.4, taking supremum we see that κα is bounded by (5.25), and since κ ≤ 0.4α + κα , substitution yields κ≤
aα + b , 1 − c/α
2μδ 2 μ where a = 0.4, b = D + 1.4 δ 2 , and c = . σ σ
Now setting α = 2c yields 4ac + 2b, the right hand side of (5.23).
We also present Theorem 5.7 which may be applied when the size bias coupling is monotone, that is, when Y s ≥ Y almost surely. The proof depends on the following concentration inequality, which is in some sense the ‘size bias’ version of Lemma 5.1. Lemma 5.2 Let Y be a nonnegative random variable with mean μ and finite positive variance σ 2 , and let Y s be given on the same space as Y , with the Y -size biased distribution, satisfying Y s ≥ Y . Then with W = (Y − μ)/σ and W s = Y s − μ /σ, for any z ∈ R and a ≥ 0, μ s E W − W 1{W s −W ≤a} 1{z≤W ≤z+a} ≤ a. σ Proof Let
Then
⎧ ⎨ −a f (w) = w − z − a ⎩ a
w ≤ z, z < w ≤ z + 2a, w > z + 2a.
a ≥ E Wf (W )
1 Y −μ = E(Y − μ)f σ σ μ s = E f W − f (W ) σ W s −W μ = E f (W + t)dt σ 0
W s −W μ 1{0≤t≤a} 1{z≤W ≤z+a} f (W + t)dt . ≥ E σ 0
160
5
L∞ by Bounded Couplings
Noting that f (w + t) = 1{z≤w+t≤z+2a} , we have 1{0≤t≤a} 1{z≤W ≤z+a} f (W + t) = 1{0≤t≤a} 1{z≤W ≤z+a} , and therefore
W s −W μ 1{0≤t≤a} 1{z≤W ≤z+a} dt a≥ E σ 0 μ = E min a, W s − W 1{z≤W ≤z+a} σ μ ≥ E W s − W 1{W s −W ≤a} 1{z≤W ≤z+a} . σ
With the use of Lemma 5.2 we present the following result for monotone size bias couplings, from Goldstein and Zhang (2010). Theorem 5.7 Let Y be a nonnegative random variable with mean μ and finite positive variance σ 2 , and let Y s be given on the same space as Y , with the Y -size biased distribution, satisfying Y s ≥ Y . Then with W = (Y − μ)/σ and W s = Y s − μ /σ, for any a ≥ 0,
supP (W ≤ z) − P (Z ≤ z) z∈R
≤ D + 0.82
μ a2μ + a + E W s − W 1{W s −W >a} , σ σ
where D is as in (5.21). If W s − W ≤ δ with probability 1, δ2 μ supP (W ≤ z) − P (Z ≤ z) ≤ D + 0.82 + δ. σ z∈R Proof Let z ∈ R and let f be the solution to the Stein equation (2.4) for h(w) = 1{w≤z} . Decompose Eh(W ) − N h as in (5.24) in proof of Theorem 5.6, and bound, as there, the first term by D, noting that (2.8) applies in the present case. For the remaining term of (5.24) we write s μ W −W f (W + t) − f (W ) dt σ 0 W s −W μ f (W + t) − f (W ) dt = 1{W s −W >a} σ 0 W s −W μ f (W + t) − f (W ) dt + 1{W s −W ≤a} σ 0 := J1 + J2 , say.
5.4 Size Biasing and Smoothing Inequalities
161
By (2.8), μ s E W − W 1{W s −W >a} , σ yielding the last term in the first bound of the theorem. Now express J2 using (2.4) as the sum W s −W μ (W + t)f (W + t) − Wf (W ) dt 1{W s −W ≤a} σ 0 W s −W μ (1{W +t≤z} − 1{W ≤z} )dt. + 1{W s −W ≤a} σ 0 |EJ1 | ≤
(5.26)
Applying (2.10) to the first term in (5.26) shows that the absolute value of its expectation is bounded by √
W s −W μ 2π E 1{W s −W ≤a} tdt |W | + σ 4 0 √
2 s 2π μ E W − W 1{W s −W ≤a} |W | + ≤ 2σ 4 √
μ 2 2π ≤ a 1+ 2σ 4 a2μ . σ Taking the expectation of the absolute value of the second term in (5.26), we have W s −W μ (1{W +t≤z} − 1{W ≤z} )dt E 1{W s −W ≤a} σ 0
W s −W μ 1{z−t<W ≤z} dt = E 1{W s −W ≤a} σ 0 μ s ≤ E W − W 1{W s −W ≤a} 1{z−a<W ≤z} σ which is bounded by a, by Lemma 5.2, completing the proof of the first claim. The second claim follows immediately by letting a = δ. ≤ 0.82
5.4 Size Biasing and Smoothing Inequalities In this section we present one further result which may be applied in situations where bounded size bias couplings exist. The method here, using smoothing inequalities, yields bounds in terms of supremums over function classes H, and are more general than methods which only produce bounds in the Kolmogorov distance. Naturally, we may pay the price in larger constants. We follow the approach of Rinott and Rotar (1997), itself stemming from Götze (1991) and Bhattacharya and Rao (1986).
162
5
L∞ by Bounded Couplings
In order to state our results we now introduce conditions on the function classes H we consider. Since in Sect. 12.4 we will consider approximation in Rp we state Condition 5.1 in this generality, and in particular we take Z in (iii) to be a standard normal variable with mean zero and identity covariance matrix in this space. In the present chapter we consider only the one dimensional case p = 1. Condition 5.1 H is a class of real valued measurable functions on Rp such that (i) The functions h ∈ H are uniformly bounded in absolute value by a constant, which we take to be 1 without loss of generality. (ii) For any real numbers c and d, and for any h(x) ∈ H, the function h(cx + d) ∈ H. − (iii) For any > 0 and h ∈ H, the functions h+ , h are also in H, and E h˜ (Z) ≤ a
(5.27)
for some constant a that depends only on the class H, where h+ (x) = sup h(x + y), |y|≤
h− (x) =
− inf h(x + y) and h˜ (x) = h+ (x) − h (x).
|y|≤
Given a function class H and random variables X and Y , let L(X) − L(Y ) = sup Eh(X) − Eh(Y ). H h∈H
(5.28)
(5.29)
In one dimension, the collection of indicators of all half lines, and indicators of √ 2/π and all intervals, each form classes H that satisfy Condition 5.1 with a = √ a = 2 2/π respectively (see e.g. Rinott and Rotar 1997); clearly, in the first case the distance (5.29) specializes to the Kolmogorov metric. Theorem 5.8 Let Y be a nonnegative random variable with finite, nonzero mean μ and variance σ 2 ∈ (0, ∞), and suppose there exists a variable Y s , having the Y -size biased distribution, defined on the same space as Y , satisfying |Y s − Y | ≤ A √ 3/2 for some A ≤ σ / 9μ. Then, when H satisfies Condition 5.1 for some constant a, 2 3
L(W ) − L(Z) ≤ 0.21aA + μ (12.4 + 58.1a) A + 2.5A + 15D, H σ σ σ2 σ2 where W = (Y − μ)/σ , Z is a standard normal, and D is given by (5.21). Specializing to the case where H is the collection of indicators of half lines and √ a = 2/π yields the bound
0.17A μ 58.8A2 2.5A3 + 2 + + 15D, supP (W ≤ z) − P (Z ≤ z) ≤ σ σ σ σ2 z∈R demonstrating, by comparison with, say, the bound in Theorem 5.6, that the consideration of general function classes H comes at some expense. One reason for the
5.4 Size Biasing and Smoothing Inequalities
163
increase in the magnitude of the constants is that general bounds on the solution to the Stein equation, as given by Lemma 2.4, must be applied here, and not, say, the more specialized bounds of Lemma 2.3 which require that the function h be an indicator. Let φ(y) denote the standard normal density and for h ∈ H and t ∈ (0, 1), define (5.30) ht (w) = h(w + ty)φ(y)dy. The function ht (w) is a smoothed version of h(w), with smoothing parameter t , and clearly ht ≤ h. Furthermore, in this section, let κ = sup Eh(W ) − N h: h ∈ H (5.31) and for t ∈ (0, 1) set κt = sup Eht (W ) − N ht : h ∈ H . Lemma 5.3 Let H be a class of functions satisfying Condition 5.1 with constant a. Then, for any random variable W , κ ≤ 2.8κt + 4.7at
for all t ∈ (0, 1).
Furthermore, for all δ > 0, t ∈ (0, 1) and h˜ as in Condition 5.1,
E h˜ δ+t|y| (W )φ (y)dy ≤ 1.6κ + a(δ + t).
(5.32)
(5.33)
Proof Inequality (5.32) is Lemma 4.1 of Rinott and Rotar (1997), following Lemma 2.11 of Götze (1991) from Bhattacharya and Rao (1986). As in Rinott and Rotar (1997), adding and subtracting to the left hand side of (5.33) we have
˜ ˜ ˜ E hδ+t|y| (W ) − hδ+t|y| (Z) φ (y) dy + hδ+t|y| (Z) φ (y) dy ˜ ˜ ≤ E hδ+t|y| (W ) − E hδ+t|y| (Z) φ (y) dy + E h˜ δ+t|y| (Z)φ (y)dy
≤ 1.6κ + a(δ + t|y|)|φ (y)|dy ≤ 1.6κ + a(δ + t), √ where we have used the definitions of h˜ and κ and that |φ (y)|dy = 2/π for the first term, and then additionally (5.27) and |y||φ (y)|dy = 1 for the second. Lemma 5.4 Let Y ≥ 0 be a random variable with mean μ and variance σ 2 ∈ (0, ∞), and let Y s be defined on the same space as Y , with the Y -size biased distribution, satisfying |Y s − Y |/σ ≤ δ for some δ. Then for all t ∈ (0, 1),
μ 2 1 1 κt ≤ 4D + (5.34) 3.3 + a δ 2 + δ 3 + 1.6κδ 2 + aδ 3 , σ 2 3 2t with D as in (5.21).
164
5
L∞ by Bounded Couplings
Proof With h ∈ H and t ∈ (0, 1) let f be the solution to the Stein equation (2.4) for ht . Letting W = (Y − μ)/σ and W s = (Y s − μ)/σ we have |W s − W | ≤ δ. From (5.19) we obtain, μ s EWf (W ) = (5.35) f W − f (W ) , σ and, so, letting V = W s − W , Eht (W ) − Nht = E f (W ) − Wf (W )
μ s f W − f (W ) = E f (W ) − σ
s μ W = E f (W ) − f (w)dw σ W
1 μ = E f (W ) − V f (W + uV )du σ 0
1 μ μ μ = E f (W ) 1 − V f (W + uV )du . +E Vf (W ) − V σ σ σ 0 (5.36) Bounding the first term in (5.36), by (2.12) and that ht ≤ 1, and definition (5.21), we have
E f (W )E 1 − μ V |W ≤ 4D. (5.37) σ By (5.30) and a change of variable we may write ht (w + s) − ht (w) = h(w + tx) φ(y − s/t) − φ(y) dy,
(5.38)
so, for the second term in (5.36), applying the dominated convergence theorem in (5.38) and differentiating the Stein equation (2.4), f (w) = f (w) + wf (w) + ht (w) 1 h(w + ty)φ (y)dy. with ht (w) = − t Hence, we may we write the second term in (5.36) as the expectation of 1 μ f (W + uV )du V f (W ) − σ 0 1 μ = V f (W ) − f (W + uV ) du σ 0 1 W +uV μ f (v)dvdu =− V σ 0 W 1 W +uV μ =− V f (v) + vf (v) + ht (v) dvdu. σ 0 W
(5.39)
(5.40)
5.4 Size Biasing and Smoothing Inequalities
165
We apply the triangle inequality and bound the three resulting terms separately. For the expectation arising from the first term on the right-hand side of (5.40), by (2.12) and that ht ≤ 1 we have 1 W +uV E μ V f (v)dvdu σ 0 W 1 √ μ μ u|V |du ≤ 1.3 δ 2 . (5.41) ≤ 2π E |V | σ σ 0 For the second term in (5.40), again applying (2.12), 1 W +uV μ E V vf (v)dvdu σ 0 W 1 W +uV 2μ 2|v|dv du E|V | ≤ σ 0 W 1 2μ ≤ 2u|W V | + u2 V 2 du E|V | σ 0 1 2μ 2δuE|W | + u2 δ 2 du δ ≤ σ 0 2μ ≤ δ δ + δ 2 /3 . σ For the last term in (5.40), beginning with the inner integral, we have W +uV 1 ht (v)dv = uV ht (W + xuV )dx 0
W
and using (5.39),
(5.42)
φ (y)dy = 0,
and Lemma 5.3 we have 1 1 μ EV 2 uht (W + xuV )dxdu σ 0 0 1 1 μ 2 uh(W + xuV + ty)φ (y)dydxdu = EV σt 0 0 1 1 μ u h(W + xuV + ty) − h(W + xuV ) φ (y)dydxdu = EV 2 σt 0 0
1 μ − φ (y)dudy u h+ (W ) − h (W ) ≤ E V2 |V |+t|y| |V |+t|y| σt
0 + μ φ (y)dy h|V |+t|y| (W ) − h− = (W ) E V2 |V |+t|y| 2σ t
μ 2 ≤ h˜ δ+t|y| (W )φ (y)dy δ E 2σ t
166
5
L∞ by Bounded Couplings
μ 2 δ 1.6κ + a(δ + t) 2σ t μ 2 μ 1.6κδ 2 + aδ 3 + aδ . = 2σ t 2σ Combining (5.37), (5.41), (5.42), and (5.43) completes the proof. ≤
(5.43)
Proof of Theorem 5.8 Substituting (5.34) into (5.32) of Lemma 5.3 we obtain
μ 2 1 1 κ ≤ 2.8 4D + + 4.7at, 3.3 + a δ 2 + δ 3 + 1.6κδ 2 + aδ 3 σ 2 3 2t or, κ≤
2.8(4D + (μ/σ )((3.3 + 12 a)δ 2 + 23 δ 3 + aδ 3 /2t)) + 4.7at . 1 − 2.24μδ 2 /(σ t)
(5.44)
Setting t = 4 × 2.24μδ 2 /σ , which is a number in (0, 1) since δ ≤ (σ/(9μ))1/2 , we obtain
4 2 3 σ μ 1 2 κ ≤ × 2.8 4D + 3.3 + a δ + δ + aδ 3 σ 2 3 2(8.96μ)
μδ 2 4 + × 4.7a 8.96 3 σ μ ≤ 0.21aδ + (12.4 + 58.1a)δ 2 + 2.5δ 3 + 15D. σ Substituting δ = A/σ now completes the proof.
Chapter 6
L∞ : Applications
In this chapter we consider the application of the results of Chap. 5 to obtain L∞ bounds for the combinatorial central limit theorem, counting the occurrences of patterns, the anti-voter model, and for the binary expansion of a random integer.
6.1 Combinatorial Central Limit Theorem Recall that in the combinatorial central limit theorem we study the distribution of Y=
n
(6.1)
ai,π(i)
i=1
where A = {aij }ni,j =1 is a given array of real numbers and π a random permutation. This setting was introduced in Example 2.3, and L1 bounds to the normal were derived in Sect. 4.4 for the case where π is chosen uniformly from the symmetric group Sn ; some further background, motivation, applications, references and history on the combinatorial CLT were also presented in that section. For π chosen uniformly, von Bahr (1976) and Ho and Chen (1978) obtained L∞ bounds to the normal when the matrix A is random, which yield the correct rate O(n−1/2 ) only under some boundedness conditions. Here we focus on the case where A is non-random. In Sect. 6.1.1 we present the result of Bolthausen (1984), which gives a bound of the correct order in terms of a third-moment quantity of the type (4.107), but with an unspecified constant. In this same section, based on Goldstein (2005), we give bounds of the correct order and with an explicit constant, but in terms of the maximum absolute array value. In Sect. 6.1.2 we also give L∞ bounds when the distribution of the permutation π is constant on cycle type and has no fixed points, expressing the bounds again in terms of the maximum array value. For the last two results mentioned we make use of Lemma 4.6, which, given π , constructs permutations π † and π ‡ on the same space as π such that n n Y† = aiπ † (i) and Y ‡ = aiπ ‡ (i) i=1
i=1
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_6, © Springer-Verlag Berlin Heidelberg 2011
167
168
6
L∞ : Applications
have the square bias distribution as in Proposition 4.6. As noted in Sect. 4.4 for L1 bounds in the uniform case, the permutations π, π † and π ‡ agree on the complement of some small index set I, and hence we may write Y = S + T, where S= ai,π(i) , i∈ /I
T=
Y† = S + T †
ai,π(i) ,
i∈I
T† =
and
Y ‡ = S + T ‡,
ai,π † (i)
and
T‡ =
i∈I
(6.2)
ai,π ‡ (i) .
i∈I
(6.3) Now, as Y ∗ = U Y † + (1 − U )Y ‡ has the Y -zero bias distribution by Proposition 4.6, we have |Y ∗ − Y | = U T † + (1 − U )T ‡ − T ≤ U T † +(1 − U )T ‡ + |T |. (6.4) Hence when I is almost surely bounded (6.4) gives an upper bound on |Y ∗ − Y | equal to the largest size of I times twice the largest absolute array value. Now Theorem 5.1 for bounded zero bias couplings yields an L∞ norm bound in any instance where such constructions can be achieved. In the remainder of this section, to avoid trivial cases we assume that Var(Y ) = σ 2 > 0, and for ease of notation we will write Y and π interchangeably for Y and π , respectively.
6.1.1 Uniform Distribution on the Symmetric Group We approach the uniform permutation case in two different ways, first using zero biasing, then by an inductive method. Using zero biasing, combining the coupling given in Sect. 4.4 with Theorem 5.1 quickly leads to the following result. Theorem 6.1 Let {aij }ni,j =1 be an array of real numbers and let π be a random permutation with uniform distribution over Sn . Then, with Y as in (6.1) and W = (Y − μ)/σ , supP (W ≤ z) − P (Z ≤ z) ≤ 16.3C/σ for n ≥ 3, z∈R
where μ and
σ2
= Var(Y ) are given by (4.105), and C = max |aij − ai + aj − a |, 1≤i,j ≤n
with the row, column and overall array averages ai , aj and a as in (2.44). Proof By (2.45) we may first replace aij by aij − ai − aj + a , and in particular assume EY = 0. Following the construction in Lemma 4.6, we obtain the variable Y ∗ = U Y † + (1 − U )Y ‡ with the Y -zero biased distribution, where Y, Y † and Y ‡ may be written as in (6.2) and (6.3) with |I| = |{I † , J † , π −1 (K † ), π −1 (L† )}| ≤ 4
6.1 Combinatorial Central Limit Theorem
169
by (4.135). As W ∗ = Y ∗ /σ has the W -zero bias distribution by (2.59), applying (6.4) we obtain E|W ∗ − W | = E|Y ∗ − Y |/σ ≤ 8C/σ. Our claim now follows from Theorem 5.1 by taking δ = 8C/σ .
With a bit more work, we can use the zero bias variation of Ghosh (2009) on the inductive method in Bolthausen (1984) to prove an L∞ bound depending on a third moment type quantity of the array, like the L1 bound in Theorem 4.8. On the other hand, the bound in Theorem 6.2 depends on an unspecified constant, whereas the constant in Theorem 6.1 is explicit. Though induction was used in Sect. 3.4.2 for the independent case, the inductive approach taken here has a somewhat different flavor. Bolthausen’s inductive method has also been put to use by Fulman (2006) for character ratios, and by Goldstein (2010b) for addressing questions about random graphs. Theorem 6.2 Let {aij }ni,j =1 be an array of real numbers and let π be a random permutation with the uniform distribution over Sn . Let Y be as in (6.1) and μA and σA2 = Var(Y ) be given by (4.105). Then, with W = (Y − μA )/σA , there exists a constant c such that supP (W ≤ z) − P (Z ≤ z) ≤ cγA / σA3 n for all n ≥ 2, z∈R
where γA is given in (4.107). To prepare for the proof we need some additional notation. For n ∈ N and an array E ∈ Rn×n let WE =
n
ei,π(i) ,
i=1
and let E 0 be the centered array with components 0 eij = eij − ei − ej + e
(6.5)
be the where the array averages are given by (2.44). In addition, when σE2 > 0, let E array given by 0 /σE , eij = eij
and set βE = γE /σE3 .
(6.6)
Clearly, if E is an array with σE2 > 0 then βE = βE 0 = βE.
(6.7)
For any E ∈ Rn×n let E be the truncated array whose components are given by eij = eij 1 |eij | ≤ 1/2 . (6.8) For β > 0 let
170
6
L∞ : Applications
n (β) = E ∈ Rn×n : ej = ei = 0 for all i, j = 1, . . . , n, σE2 = 1, βE ≤ β , M n (β) : |eij | ≤ 1 for all i, j = 1, . . . , n , Mn1 (β) = E ∈ M and n = M
n (β) M
and Mn1 =
β>0
Mn1 (β).
β>0
∈ M n (β) for all β ≥ βE , We note that if E is any n × n array with σE2 > 0 then E and if E ∈ Mn then E = E. Let (6.9) δ 1 (β, n) = sup P (WE ≤ z) − (z): z ∈ R, E ∈ Mn1 (β) . The proof of the theorem depends on the following four lemmas, whose proofs are deferred to the end of this section. The first two lemmas are used to control the effects of truncation and scaling. n let E be the truncated E array given by (6.8). Lemma 6.1 For n ≥ 2 and E ∈ M Then there exists c1 ≥ 1 such that P (WE = WE ) ≤ c1 βE /n
and
|μE | ≤ c1 βE /n.
In addition, there exist constants 1 and c2 such that when βE /n ≤ 1 2 σ − 1 ≤ c2 βE /n, E ∈ M 1 and βE ≤ c1 βE . n E
(6.10)
(6.11)
n for some n ≥ 2 Lemma 6.2 There exist constants 2 and c3 such that if E ∈ M and E is as in (6.8), (6.6) and (6.5), then whenever βE /n ≤ 2 supP (WE ≤ z) − (z) ≤ supP (WE ≤ z) − (z) + c3 βE /n. z∈R
z∈R
The following lemma handles the effects of deleting rows and columns from an array in Mn1 . Lemma 6.3 There exist n0 ≥ 16, 3 > 0 and c4 ≥ 1 such that if n ≥ n0 , l ≤ 4 and C ∈ Mn1 , when D is the (n − l) × (n − l) array formed by removing the l rows R ⊂ {1, 2, . . . , n} and l columns C ⊂ {1, 2, . . . , n} from C, we have |μD | ≤ 8, and if βC /n ≤ 3 then 2 σ − 1 ≤ 3/4 and βD ≤ c4 βC . D The proof, being inductive in nature, expresses the distance to normality for a problem of a given size in terms of the distances to normality for the same problem, but of smaller sizes. This last lemma is used to handle the resulting recursion for the relation between these distances.
6.1 Combinatorial Central Limit Theorem
171
Lemma 6.4 Let {sn }n≥1 be a sequence of nonnegative numbers and m ≥ 5 a positive integer such that sn ≤ d + α max sn−l l∈{2,3,4}
for all n ≥ m,
(6.12)
with d ≥ 0 and α ∈ (0, 1). Then sup sn < ∞. n≥1
Proof of Theorem 6.2 In view of (2.45) and (6.7) it suffices to prove the theorem for n . Let 1 , c1 and c2 be as in Lemma 6.1, 2 and c3 as in Lemma 6.2, WB with B ∈ M and n0 , 3 and c4 as in Lemma 6.3. Noting that from the lemmas we have n0 ≥ 16 and c1 ≥ 1 and c4 ≥ 1, set
(6.13)
0 = min 1/(2n0 ), 1 /c1 , 3 /c1 , 31 /(4c4 c1 ), 32 /(4c4 c1 ) .
We first demonstrate that it suffices to prove the theorem for βB /n < 0 and n > n0 . By Hölder’s inequality and (4.105), for all n ∈ N, n 1/2 n 1/3 1 (n − 1)1/2 1/3 2 3 = 1/3 bij ≤ |bij | = βB . (6.14) 1/3 n n i,j =1
i,j =1
As inequality (6.14) implies that βB ≥ 1/2 for all n ≥ 2, we have βB /n ≥ 0 for all 2 ≤ n ≤ n0 . Hence, taking c ≥ 1/0 the theorem holds if either 2 ≤ n ≤ n0 or B satisfies βB /n ≥ 0 . We may therefore assume n ≥ n0 and βB /n ≤ 0 . as in (6.8), (6.6) and (6.5), Lemma 6.2 yields As βB /n ≤ 0 , setting C = B supP (WB ≤ z) − (z) ≤ supP (WC ≤ z) − (z) + c3 βB /n. (6.15) z∈R
z∈R
By (6.7) and (6.11) of Lemma 6.1 we have that βC /n = βB /n ≤ c1 βB /n ≤ c1 0 , and also that C ∈ Mn1 . Hence, exists a constant c5 such that
(6.16)
by (6.15) and (6.16) it suffices to prove that there
δ 1 (β, n) ≤ c5 β/n
for all n ≥ n0 and β/n ≤ c1 0 .
(6.17)
For z ∈ R and α > 0 let hz,α (w) be the smoothed indicator function of (−∞, z], which decays linearly to zero over the interval [z, z + α], as given by (2.14), and set (6.18) δ 1 (α, β, n) = sup Ehz,α (WC ) − N hz,α : z ∈ R, C ∈ Mn1 (β) . Also, define hz,0 (x) = 1(x ≤ z). As the collection of arrays
Mn1 (β)
increases in β, so therefore does δ 1 (α, β, n).
172
6
L∞ : Applications
Now, since for any z, w ∈ R and α > 0, hz,0 (w) ≤ hz,α (w) ≤ hz+α,0 (w), for all C ∈ Mn1 (β) and all α > 0 we have α supP (WC ≤ z) − (z) ≤ supEhz,α (WC ) − Ehz,α (Z) + √ , 2π z∈R z∈R and taking supremum yields α δ 1 (β, n) ≤ δ 1 (α, β, n) + √ . 2π
(6.19)
To prove (6.17), for n ≥ n0 let C ∈ Mn1 satisfy βC /n ≤ c1 0 , and let f be the solution to the Stein equation (2.4) with h = hz,α as in (2.14), for some fixed z ∈ R. Following the construction in Lemma 4.6, we obtain the variable WC∗ = U WC† + (1 − U )WC‡ with the WC -zero biased distribution. Now, using the bound (2.16) from Lemma 2.5 on the differences of the derivative of f , write Eh(WC ) − N h = E f (WC ) − WC f (WC ) = E f (WC ) − f WC∗ (6.20) ≤ E f WC∗ − f (WC ) ≤ A1 + A2 + A2 , where
A1 = E WC∗ − WC , A2 = E WC WC∗ − WC and
1 1 A3 = E WC∗ − WC 1[z,z+α] WC + r WC∗ − WC dr . α 0
First, from the L1 bound in Lemma 4.7, noting that γ in the lemma equals βC as σC2 = 1, we obtain (6.21) A1 = E WC∗ − WC ≤ c6 βC /n. Next, to estimate A2 , note that by (4.135) of Lemma 4.6 we may write WC† and WC‡ as in (6.2) and (6.3) with I = {I † , J † , π −1 (K † ), π −1 (L† )} and WC∗ − WC = U WC† + (1 − U )WC‡ − WC = U S + T † + (1 − U ) S + T ‡ − (S + T ) = U T † + (1 − U )T ‡ − T .
(6.22)
Now let I = (I † , J † , π −1 (K † ), π −1 (L† ), π(I † ), π(J † ), K † , L† ). By the construction in Lemma 4.5 the right hand side of (6.22), and hence WC∗ − WC , is measurable with respect to I = {I, U }. Furthermore, since C ∈ Mn1 and |I| ≤ 4, we have |ciπ(i) | ≤ |S| + 4. |WC | = |S + T | ≤ |S| + |T | ≤ |S| + i∈I
Now, using the definition of A2 , and that U is independent of {S, I}, we obtain
6.1 Combinatorial Central Limit Theorem
173
A2 = E WC WC∗ − WC = E WC∗ − WC E |WC ||I ≤ E WC∗ − WC E |S| + 4|I ≤ E WC∗ − WC E S 2 |I + 4E WC∗ − WC .
(6.23)
In the following, for ıa realization of I, let l denote the number of distinct elements of ı. Since S = i ∈/ I ciπ(i) and π is chosen uniformly from Sn , we have that
L(S|I = i) = L(WD ),
(6.24)
where WD = 1≤i≤n−l diθ(i) with D the (n − l) × (n − l) array formed by removing from C the rows {I † , J † , π −1 (K † ), π −1 (L† )} and columns {π(I † ), π(J † ), K † , L† }, and θ chosen uniformly from Sn−l . Using l ∈ {2, 3, 4}, that n ≥ n0 and βC /n ≤ c1 0 ≤ 3 , Lemma 6.3 yields |μD | ≤ 8 and that 2 σ − 1 ≤ 3/4, so that EW 2 ≤ c7 . (6.25) D D In particular E S 2 |I = i = EWD2 ≤ c7
for all i, and hence E S 2 |I ≤ c7 .
Now using (6.23) and (6.21), we obtain A2 ≤ c8 βC /n.
(6.26)
Finally, we are left with bounding A3 . First we note that for any r ∈ R, WC + r WC∗ − WC = rWC∗ + (1 − r)WC = r S + U T † + (1 − U )T ‡ + (1 − r)(S + T ) = S + rU T † + r(1 − U )T ‡ + (1 − r)T = S + gr
where gr = rU T † + r(1 − U )T ‡ + (1 − r)T .
Now, from the definition of A3 , again using that WC − WC∗ is I measurable,
1 1 1[z,z+α] WC + r WC∗ − WC dr A3 = E WC − WC∗ α 0
1
∗ 1 ∗ 1[z,z+α] WC + r WC − WC dr|I = E W C − WC E α 0
1 1 P WC + r WC∗ − WC ∈ [z, z + α]|I dr = E WC − WC∗ α 0
1 1 P S + gr ∈ [z, z + α]|I dr = E WC − WC∗ α 0
174
6
= ≤ = = =
L∞ : Applications
1 1 P S ∈ [z − gr , z + α − gr ]|I dr E WC − WC∗ α 0
1 1 ∗ sup P S ∈ [z − gr , z + α − gr ]|I dr E W C − WC α 0 z∈R
1 1 ∗ sup P S ∈ [z, z + α]|I dr E W C − WC α 0 z∈R 1 E WC − WC∗ sup P S ∈ [z, z + α]|I α z∈R 1 E WC − WC∗ sup P S ∈ [z, z + α]|I , α z∈R
(6.27)
(6.28)
where to obtain equality in (6.27) we have used the fact that gr is measurable with respect to I for all r, and the equality in (6.28) follows from the independence of U from {S, I}. Regarding P (S ∈ [z, z + α]|I), we claim that sup P S ∈ [z, z + α]|I = i = sup P WD ∈ [z, z + α] z∈R z∈R = sup P WD 0 ∈ [z, z + α] z∈R z z+α = sup P WD ∈ , σD σD z∈R ≤ sup P WD (6.29) ∈ [z, z + 2α] . z∈R
n−l The first equality is (6.24), the second follows from (6.5) and that i=1 di , n−l n−l d and d do not depend on θ , and the next is by definition (6.6) i=1 θ(i) i=1 of D. The inequality follows from (6.25), which implies σD ≥ 1/2. Using that βC /n ≤ c1 0 ≤ 3 and Lemma 6.3, we have Let E = D. βD ≤ c4 βC so that by (6.7) βE βD c4 βC nc4 c1 0 3 n = ≤ ≤ ≤ min{1 , 2 } ≤ min{1 , 2 }, n−l n−l n−l n−l 4n−4 since n0 ≥ 16. Now Lemma 6.1 and (6.7) yield ∈ M 1 E n−l
and
βE = βE ≤ c1 βE = c1 βD .
Furthermore, Lemma 6.2 and (6.7) may be invoked to yield P WD ∈ [z, z + 2α] = P WE ∈ [z, z + 2α] ≤ P (WE ≤ z + 2α) − (z + 2α) + (z + 2α) − (z) + (z) − P (WE < z)
(6.30)
6.1 Combinatorial Central Limit Theorem
175
2c3 βD 2α +√ n−l 2π β 2c 2α 3 D ≤ 2 max δ 1 (c1 βD , n − l) + +√ l∈{2,3,4} n−l 2π c10 βC 2α 1 +√ , ≤ 2 max δ (c9 βC , n − l) + l∈{2,3,4} n 2π
≤ 2δ 1 (c1 βD , n − l) +
(6.31)
where in the final inequality we have again invoked (6.30) and set c9 = c1 c4 ≥ 1, by (6.13). As (6.31) does not depend on z or i, by (6.29), it bounds supz∈R P (S ∈ [z, z + α]|I). Now using (6.28), (6.31) and (6.21), we obtain
1 c10 βC 2α A3 ≤ E WC − WC∗ 2 max δ 1 (c9 βC , n − l) + +√ α l∈{2,3,4} n 2π
c6 βC β c 2α 10 C ≤ . (6.32) 2 max δ 1 (c9 βC , n − l) + +√ l∈{2,3,4} nα n 2π Recalling h = hz,α , as the bound on A3 does not depend on z ∈ R, combining (6.21), (6.26) and (6.32), then taking supremum over z ∈ R on the left hand side of (6.20), we obtain, supEhz,α (WC ) − N hz,α z∈R
≤
c10 βC c6 βC 2α c11 βC , + 2 max δ 1 (c9 βC , n − l) + +√ l∈{2,3,4} n nα n 2π
and now taking supremum over C ∈ Mn1 (β) with β/n ≤ c1 0 we have
c11 β c6 β c10 β 2α 1 1 δ (α, β, n) ≤ . + 2 max δ (c9 β, n − l) + +√ l∈{2,3,4} n nα n 2π Recalling (6.19), we obtain
α c11 β c6 β c10 β 2α δ 1 (β, n) ≤ +√ . + 2 max δ 1 (c9 β, n − l) + +√ l∈{2,3,4} n nα n 2π 2π Setting α = 4c6 c9 β/n yields δ 1 (β, n) ≤
δ 1 (c9 β, n − l) c12 β 1 . + max n 2 l∈{2,3,4} c9
Multiplying by n/β we obtain nδ 1 (β, n) nδ 1 (c9 β, n − l) 1 ≤ c12 + max . β 2 l∈{2,3,4} c9 β Taking supremum over positive β satisfying β/n ≤ c1 0 , and using n0 ≥ 16 we obtain sup 0< βn ≤c1 0
nδ 1 (β, n) 2c12 sup ≤ max β 3 l∈{2,3,4} 0< β ≤c c n
1 9 0
(n − l)δ 1 (β, n − l) . β
(6.33)
176
6
L∞ : Applications
Clearly sup β/n>c1 0
(n − l)δ 1 (β, n − l) ≤ 1/(c1 0 ), β
so letting sn =
sup 0 1/2} and i = {j : (i, j ) ∈ } for i = 1, . . . , n. By a Chebyshev type argument we may bound the size of by 1 |eij | > 1/2 ≤ 8 |eij |3 = 8βE . (6.34) || = i,j
i,j
Now the inclusion {WE = WE } ⊂
n
i, π(i) ∈ i=1
implies P (WE = WE ) ≤ E
n n 1 i, π(i) ∈ = |i |/n = ||/n ≤ 8βE /n, i=1
i=1
proving the first claim of (6.10) taking c1 = 8. Hölder’s inequality and (6.34) yield that for all r ∈ (0, 3]
r/3 r 1−r/3 3 |eij | ≤ || |eij | ≤ c1 βE . i,j
(i,j )∈
Similarly, as |i | =
1 |eij | > 1/2 ≤ 8 |eij |3 , j
we have
j
(6.35)
6.1 Combinatorial Central Limit Theorem
177
1/3 2/3 3 eij ≤ |i | |eij | ≤4 |eij |3 ≤ c1 βE , j ∈i
j ∈i
(6.36)
j
with the same bound holding when interchanging the roles of i and j . Regarding the mean μE , since ij eij = 0, we have 1 1 1 1 = |μE | = = = eij e e e ij ij ij n n n n c c i,j
(i,j )∈
(i,j )∈
1 |eij | ≤ c1 βE /n, ≤ n
(i,j )∈
(i,j )∈
by (6.35) with r = 1, proving the second claim in (6.10). To prove the bound on σE2 , recalling the form of the variance in (4.105) we have 2 2 2 2 2 2 σ − 1 = 1 e − e − e + e − e ij i j ij E n − 1 i,j i,j i,j i,j i,j
1 2 2 2 eij + ei + ej + e 2 . ≤ n−1 (i,j )∈
i,j
i,j
i,j
Since n ≥ 2 the first term is bounded by 2c1 βE /n using (6.35) with r = 2. By (6.36), we have that 1 1 1 4 4βE e = = = e e e |eij |3 = . (6.37) ij ij ≤ i ij n n n n n j ∈ / i
j
Hence, for n ≥ 2, 1 n−1
i,j
2 ei ≤
j ∈i
j
16βE 4βE ei ≤ |eij |3 ≤ 32βE2 /n2 , n−1 n(n − 1) i
i,j
with the same bound holding when i and j are interchanged. In addition, by the second claim in (6.10), e = |μE |/n ≤ c1 βE /n2 , (6.38) and so 1 2 n2 c12 βE2 e ≤ ≤ 2c12 βE2 /n3 . n−1 n − 1 n4 i,j
Hence 2 σ − 1 ≤ βE 2c1 + 64βE /n + 2c2 βE /n2 . 1 E n Now the first claim of (6.11) holds with c2 = 2c1 + 64 + 2c12 taking any 1 ∈ (0, 1). Requiring further that 1 ∈ (0, 1/(3c2 )), when βE /n ≤ 1 then 2 σ − 1 ≤ 1/3, E
178
6
L∞ : Applications
so that σE2 > 2/3, implying σE > 2/3. Therefore, when βE /n ≤ 1 the elements of satisfy E e − e − e + e /σE ≤ 3 + 3 e + e + e , ij i j j 4 2 i and by (6.37) and (6.38) there exists 1 sufficiently small such that the elements of are all bounded by 1, thus showing the second claim of (6.11). Lastly, by the E lower bound on σE we have 3 3 ij |eij | ij |eij | ≤ ≤ c1 βE , βE = σE3 σE3
completing the proof of the lemma. Proof of Lemma 6.2 With 1 , c1 and c2 as in Lemma 6.1, set 2 = min 1 , 1/(9c2 ) and assume βE /n ≤ 2 . The first inequality in (6.10) of Lemma 6.1 yields supP (WE ≤ z) − (z) z∈R
≤ supP (WE ≤ z) − (z) + c1 βE /n z∈R
z − μE z − μE ≤ supP (WE ≤ z) − − (z) + c1 βE /n + sup σ σ E E z∈R z∈R z − μ E ≤ sup P (WE ≤ z) − (z) + sup − (z) + c1 βE /n. σE z∈R z∈R Hence we need only show that there exists some c14 such that z − μE sup − (z) ≤ c14 βE /n. σE z∈R
(6.39)
From the first inequality in (6.11) of Lemma 6.1, since βE /n ≤ 1/(9c2 ) we have |σE2 − 1| ≤ 1/9 and so σE ∈ [2/3, 4/3]. First consider the case where |z| ≥ c1 βE /n. It is easy to show that √ z exp −az2 /2 ≤ 1/ a for all a > 0, z ∈ R. (6.40) Hence
z exp − 9 (z − μE )2 ≤ (z − μE ) exp − 9 (z − μE )2 + |μE | 32 32 4 + |μE | 3 4 ≤ 1 + |μE | . 3
≤
(6.41)
6.1 Combinatorial Central Limit Theorem
179
Since σE ≥ 2/3 and (6.11) gives that |σE2 − 1| ≤ c2 βE /n, we find that |σE2 − 1| (6.42) ≤ c2 βE /n. σE + 1 Letting z = (z − μE )/σE , since |μE | ≤ c1 βE /n by (6.10) of Lemma 6.1, z and z will be on the same side of the origin. Now, using the mean value theorem, that σE ∈ [2/3, 4/3], and Lemma 6.1, we obtain ( z) − (z)
z − μE z − μE − z where φ = ≤ max φ , φ(z) σE σE
2 z(1 − σE ) 9 1 z 2 ≤ √ max exp − (z − μE ) , exp − 32 2 σE 2π 1 μE + √ 2π σE
2 3 9 z 2 ≤ √ |σE − 1| max z exp − (z − μE ) , z exp − 32 2 2 2π 1 μE + √ 2π σE 3 2 ≤ √ |σE − 1|(1 + |μE |) + |μE |. 4 2π |σE − 1| =
This last inequality using (6.41), and (6.40) with a = 1. But now, using (6.10) and (6.42), we have 3 2 √ |σE − 1| 1 + |μE | + |μE | 4 2π
c1 βE 3c1 βE 2c2 βE 1+ + ≤ √ n 4n n 2π
3c1 βE 2c2 (6.43) , since βE /n ≤ 2 . ≤ √ (1 + c1 2 ) + 4 n 2π When |z| < c1 βE /n, the bound is easier. Since z lies in the interval with boundary points 3(z − μE )/2 and 3(z − μE )/4, we have 3(|z| + |μE |) . (6.44) 2 Now using that |z| < c1 βE /n, and |μE | ≤ c1 βE /n by (6.10), from (6.44) we obtain 1 ( z − z| z) − (z) ≤ √ | 2π 1 ≤√ 3|z| + 2|μE | 2π 5c1 βE ≤√ . (6.45) 2π n | z| ≤
180
6
L∞ : Applications
The proof of (6.39), and therefore of the lemma, is now completed by letting c14 be the maximum of the constants that multiply βE /n in (6.43) and (6.45). Proof of Lemma 6.3 Let m = n − l. Since ci = 0, |cij | ≤ 1 and l ≤ 4 we have m 1 1 1 |C| 4 dij = cij = cij ≤ |di | = = , (6.46) m m m m m j∈ /C
j =1
j ∈C
with the same bound holding when the roles of i and j are interchanged. Similarly, as c = 0, m 1 1 |d | = 2 dij = 2 cij m m i,j =1 {i ∈ / R}∩{j ∈ /C} 1 cij = 2 m {i∈R}∪{j ∈C }
≤
8 |R| + |C| = , m m
(6.47)
and the first claim now follows, since μD = md . To handle σD2 , recalling σC2 = 1, by (4.105) there exists some n2 ≥ 16 such that for all l ≤ 4 n 1 2 n−1 3 cij = (6.48) ∈ 1, 1 when n ≥ n2 . m−1 m−1 8 i,j =1
Again from (4.105), σD2
1 = m−1
m
i,j =1
dij2
−m
m
di2
−m
i=1
m j =1
2 dj
+ m2 d2
.
Applying (6.46) and (6.47), when βC /n ≤ 3 , a value yet to be specified, n 1 2 2 cij σD − m−1 i,j =1 n m m 1 2 2 2 2 2 2 cij − cij + m di + m dj + m d ≤ m−1 i,j =1 i=1 j =1 {i ∈ / R}∩{j ∈ /C}
1 2 ≤ cij + 96 m−1 {i∈R}∪{j ∈C }
≤
1 2/3 8 n + 96 , m−1 3
where for (6.49), we have, by Hölder’s inequality
(6.49)
6.1 Combinatorial Central Limit Theorem n
2 cij ≤n
1 3
181
n
j =1
2 3
|cij |3
1
2
≤ n 3 βC3 ,
j =1
with the same inequality holding when the roles of i and j are reversed, and so, when βC /n ≤ 3 ,
2 cij ≤
{i∈R}∪{j ∈C }
n
2 cij +
i∈R j =1
n
1
2
2/3
2 cij ≤ 2ln 3 βC3 ≤ 83 n.
j ∈C i=1
Now choosing n3 ≥ n2 such that 96/(n3 − 5) ≤ 3/16, and then choosing 3 such 2/3 that 83 n3 /(n3 − 5) ≤ 3/16, by (6.48) and (6.49) we obtain |σD2 − 1| ≤ 3/4 for all n ≥ n3 , proving the second claim in the lemma for any n0 ≥ n3 . To prove the final claim, first note m 3 m m m 1 1 3 3 |di | = 3 dij = 3 cij m m i=1 i=1 j =1 i=1 j ∈ /C m m 1 3 l2 l 2 βC 3 = 3 c ≤ |c | ≤ , (6.50) ij ij m m3 m3 i=1 j ∈C
i=1 j ∈C
with the same bound holding when i and j are interchanged. Now, since
cij =
{i∈R}∪{j ∈C }
n i=1 j ∈C
cij +
n j =1 i∈R
cij −
cij ,
{i∈R}∩{j ∈C }
we obtain
m 3 3 3 1 1 1 dij = 6 cij = 6 cij |d | = 6 m m m i,j =1 {i ∈ / R}∩{j ∈ /C} {i∈R}∪{j ∈C } 3
3 3 9 ≤ 6 cij + cij + cij . m 3
i∈ / R j ∈C
i∈R j ∈ /C
{i∈R}∩{j ∈C }
Hence, using that | i ∈/ R j ∈C cij |3 ≤ (nl)2 ni=1 j ∈C |cij |3 , with the same bound holding for the second term and a similar one for the last, we find that for some n0 ≥ n3 , for all n ≥ n0 we have 9 28l 2 2 4 2(nl) β + l ≤ βC . (6.51) C m6 m4 Now, when βD /n ≤ 3 and n ≥ n0 , since σD2 ≥ 1/4, by (6.50) and (6.51), |d |3 ≤
βD = σD−3 =8
m m 0 3 0 3 d ≤ 8 d ij ij
i,j =1 m
i,j =1
|dij − di − dj + d |3
i,j =1
182
L∞ : Applications
6
≤8×4
2
n
|cij | + 3
i,j =1
30l 2 ≤ 128 1 + 2 βC m ≤ c4 βC ,
m
|di | + |dj | + |d | 3
3
3
i,j =1
thus proving the final claim of the lemma. Proof of Lemma 6.4 Let the sequence {tn }n≥m be given by tm = max sm−k
and
0≤k≤3
tn+1 = d + αtn
for n ≥ m.
(6.52)
Explicitly solving the recursion yields t n = d 1 α n + d2
where d1 =
(1−α)tm −d α m (1−α)
and d2 =
d 1−α .
We note that since limn→∞ tn = d2 the sequence {tn }n≥m is bounded, and it suffices to prove sn ≤ tn for all n ≥ m. We consider the two cases, (a) d1 < 0 and (b) d1 ≥ 0. (a) When d1 < 0 the sequence {tn }n≥m is increasing. By (6.52) we have sm ≤ tm . In addition, sm+1 ≤ d + α max{sm−1 , sm−2 , sm−3 } ≤ d + αtm = tm+1 ,
(6.53)
sm+2 ≤ d + α max{sm , sm−1 , sm−2 } ≤ d + αtm = tm+1 ≤ tm+2 , and hence sm+3 ≤ d + α max{sm+1 , sm , sm−1 } ≤ d + α max{tm+1 , tm } = d + αtm+1 = tm+2 ≤ tm+3 . Hence, for k = 3, sn ≤ tn
for m ≤ n ≤ m + k.
(6.54)
Assuming now that (6.54) holds for some k ≥ 3, for n = m + k we have sn+1 ≤ d + α max{sn−1 , sn−2 , sn−3 } ≤ d + α max{tn−1 , tn−2 , tn−3 } = d + αtn−1 = tn ≤ tn+1 , thus completing the inductive step showing that (6.54) holds for all k ≥ 0 in case (a). (b) When d1 ≥ 0 the sequence {tn }n≥m is non-increasing. In a similar way we can show that for k = 5, sn ≤ t m
for m ≤ n ≤ m + k.
(6.55)
Assuming now that (6.55) holds for some k ≥ 5, for n = m + k we have sn+1 ≤ d + α max{sn−1 , sn−2 , sn−3 } ≤ d + αtm = tm+1 ≤ tm , thus completing the inductive step showing that (6.55) holds for all k ≥ 0 in case (b).
6.1 Combinatorial Central Limit Theorem
183
6.1.2 Distribution Constant on Conjugacy Classes In this section we focus on the normal approximation of Y in (6.1) when the distribution of π is a function only of its cycle type. This framework includes two special cases of note, one where π is a uniformly chosen fixed point free involution, considered by Goldstein and Rinott (2003) and Ghosh (2009), and the other where π has the uniform distribution over permutations with a single cycle, considered by Kolchin and Chistyakov (1973) with the additional restriction that aij = bi cj . Both Goldstein and Rinott (2003) and Ghosh (2009) obtained an explicit constant, the latter in terms of third moment quantities on the array aij , rather than on its maximum, as in the former. Kolchin and Chistyakov (1973) considered normal convergence for the long cycle case, but did not provide bounds on the error. As discussed in Sect. 4.4, being able to approximate the distribution of Y is important for performing permutation tests in statistics. In particular, the case where π is a fixed point free involution arises when testing if a given pairing of n = 2m observations shows an unusually high level of similarity, as in Schiffman et al. (1978). In this case, the test statistic yτ is of the form (6.1) with π replaced by a given pairing τ , and where aij = d(xi , xj ) measures the similarity between observations xi and xj . Under the null hypotheses that no pairing is distinguished, the value of yτ will tend to lie near the center of the distribution of Y when π is an involution having no fixed points, chosen uniformly. This instance is the particular case where π is constant on conjugacy classes, as defined below in (6.57), where the probability of any π with m 2-cycles is constant, and has probability zero otherwise. In the involution case Goldstein and Rinott (2003) used an exchangeable pair construction in which π is obtained from π by a transformation which preserves the m 2-cycle structure. The construction in Theorem 6.3 preserves the cycle structure in general, and when there are m 2-cycles, specializes to a construction similar, but not equivalent, to that of Goldstein and Rinott (2003). We note that in the case where π is a fixed point free involution the sum Y contains both aiπ(i) and aπ(i)i , making the symmetry assumption of Theorem 6.3 without loss of generality. This assumption is also satisfied in many statistical applications where one wishes to test the equality of the two distributions generating the samples X1 , . . . , Xn and Y1 , . . . , Yn , and aij = d(Xi , Yj ), a symmetric ‘distance’ function evaluated at the observed data points Xi and Yj . Consider a permutation π ∈ Sn represented in cycle form; in S7 for example, π = ((1, 3, 7, 5), (2, 6, 4)) is the permutation consisting of one 4 cycle in which 1 → 3 → 7 → 5 → 1, and one 3 cycle where 2 → 6 → 4 → 2. For q = 1, . . . , n, let cq (π) be the number of q cycles of π , and let c(π) = c1 (π), . . . , cn (π) . We say the permutations π and σ are of the same cycle type if c(π) = c(σ ), and that a distribution P on Sn is constant on cycle type if P (π) depends only on c(π), that is, P (π) = P (σ )
whenever c(π) = c(σ ).
(6.56)
184
6
L∞ : Applications
Equivalently, see Sagan (1991) for instance, π and σ are of the same cycle type if and only if π and σ are conjugate, that is, if and only if there exists a permutation ρ such that π = ρ −1 σρ. Hence, a probability measure P on Sn is constant over cycle type if and only if (6.57) P (π) = P ρ −1 πρ for all π, ρ ∈ Sn . A special case of a distribution constant on cycle type is one uniformly distributed over all permutations of some fixed type. Letting n Nn = (c1 , . . . , cn ) ∈ Nn0 : ci = n , i=1
the set of possible cycle types for a permutation π ∈ Sn , the number N (c) of permutations in Sn having cycle type c is given by Cauchy’s formula n c j 1 1 N (c) = n! (6.58) for c ∈ Nn . j cj ! j =1
For c ∈ Nn let U (c) denote the distribution over Sn which is uniform on cycle type c, that is, the distribution P given by 1/N(c) if c(π) = c, P (π) = (6.59) 0 otherwise. The situations where π is chosen uniformly from the set of all fixed point free involutions, and where π is chosen uniformly from all permutations having a single cycle, are both distributions of type U(c), the first with c = (0, n/2, 0, . . . , 0) the second with c = (0, . . . , 0, 1). The following lemma shows that every distribution P that is constant on cycle type is a mixture of U(c) distributions. Lemma 6.5 If the distribution P on Sn is constant on cycle type then ρc U(c) where ρc = P c(π) = c . P=
(6.60)
c∈Nn
Proof If c ∈ Nn is such that ρc = 0 then by (6.56), P γ |c(γ ) = c = N (c)P π|c(π) = c , 1= γ :c(γ )=c
and therefore P (π|c(π) = c) = 1/N(c). Hence, for any π ∈ Sn , with c = c(π), P (π) = P π|c(π) = c P c(π) = c = ρc /N (c),
that is, P is the mixture (6.60). For {i, j, k} ⊂ {1, . . . , n} distinct, let A = π: π(k) = j and
B = π: π(i) = j ,
6.1 Combinatorial Central Limit Theorem
185
and let τik be the transposition of i and k. Then π ∈A
−1 τik πτik ∈ B.
if and only if
Hence, if the distribution of π is constant on conjugacy classes, −1 P (A) = P τik Aτik = P (B), so if in addition π has no fixed points, P π(k) = j = P π(i) = j 1= k: k=j
k: k=j
and hence P π(i) = j =
1 n−1
for i = j .
(6.61)
If π has no fixed points with probability one then no aii appears in the sum (6.1), and we may take aii = 0 for all i for convenience. In this case, letting 1 aij , n−2 n
aio =
j =1
1 aij n−2 n
aoj =
aoo =
and
1 aij , (n − 1)(n − 2) ij
i=1
by (6.61) we have EY =
n
Eai,π(i) =
n n n 1 1 aij = aij = (n − 2)aoo . n−1 n−1 i=1 j : j =i
i=1
i=1 j =1
Now note that n
aio = (n − 1)aoo
i=1
and
n i=1
aoπ(i) =
n
aoj = (n − 1)aoo ,
j =1
the latter equality holding since π is a permutation. Letting aij − aio − aoj + aoo for i = j , = a ij 0 for i = j ,
(6.62)
where the choice of a ii is arbitrary, when π has no fixed points, using that {π(j ): j = 1, . . . , n} = {1, . . . , n}, we have n
aiπ(i) =
i=1
=
n i=1 n i=1
aiπ(i) − 2(n − 1)aoo + naoo aiπ(i) − (n − 2)aoo =
n i=1
aiπ(i) − EY.
(6.63)
186
6
L∞ : Applications
Additionally, noting
aij =
i: i=j
n
aij = (n − 2)aoj ,
i=1
and
aio =
i: i=j
n
aio − aoj = (n − 1)aoo − aoj ,
i=1
we have n i=1
a ij =
a ij
i: i=j
= (n − 2)aoj − (n − 1)aoo − aoj − (n − 1)aoj + (n − 1)aoo = 0. (6.64)
In summary, in view of (6.63), (6.64), and the corresponding identity when the roles of i and j are reversed, when π has no fixed points, by replacing aij by a ij , we may without loss of generality assume that EY = 0, and in particular, that
aio = 0,
and
aoj = 0,
aij = 0.
(6.65)
ij
Lastly note that if aij is symmetric then so is a ij , and in this case aij − 2aio + aoo for i = j , a ij = 0 for i = j .
(6.66)
Regarding the variance of Y , Lemma 6.7 below shows that when π is chosen uniformly over a fixed cycle type c without fixed points, n ≥ 4 and aij = aj i , the variance σc2 = Var(Y ) is given by
1 2c2 σc2 = (aij − 2aio + aoo )2 . (6.67) + n − 1 n(n − 3) i=j
Remarkably, for a given n the variance σc2 depends on the vector c of cycle types only though c2 , the number of 2-cycles. When π is uniform over the set of fixed point free involutions, n is even and c2 = n/2, (6.67) yields 2(n − 2) (aij − 2aio + aoo )2 . (6.68) σc2 = (n − 1)(n − 3) i=j
On the other hand, if π has no 2-cycles, c2 = 0 and 1 (aij − 2aio + aoo )2 . σc2 = n−1 i=j
(6.69)
6.1 Combinatorial Central Limit Theorem
187
When normal approximations hold for Y when π has distribution U(c) for some c ∈ Nn it is clear upon comparing (6.68) with (6.69) that the variance of the approximating normal variable depends on c. More generally, when the distribution π is constant on cycle type, the mixture property (6.60) allows for an approximation of Y in terms of mixtures of mean zero normal variables as in the following theorem. Theorem 6.3 Let n ≥ 5 and let {aij }ni,j =1 be an array of real numbers satisfying aij = aj i .
(6.70)
Let π ∈ Sn be a random permutation with distribution constant on cycle type, having no fixed points. Then, with Y given by (6.1) and W = (Y − EY )/σρ , we have √
2π ρc 1 + sup P (W ≤ z) − P (Zρ ≤ z) ≤ 40C 1 + √ 4 σc 2π z∈R c∈Nn
where σρ2 =
c∈Nn
ρc σc2
and
L(Zρ ) =
ρc L(Zc /σρ )
(6.71)
c∈Nn
with Zc ∼ N (0, σc2 ), σc2 given by (6.67), ρc = P (c(π) = c), and C = maxi=j |aij − 2aio + aoo |. In the special case where π is uniformly distributed on fixed point free involutions, with W = (Y − EY )/σc and σc2 given by (6.68), √
1 2π supP (W ≤ z) − P (Z ≤ z) ≤ 24C 1 + √ + σc 4 2π z∈R where Z ∼ N (0, 1). We note the numerical value of the coefficient of C in the general, and the involution case, are approximately equal to 125.07 and 75.04, respectively. The proof of the theorem follows fairly quickly from Lemma 6.10, which considers the special case where π has distribution U(c), and the mixture property in Lemma 6.5. The proof of Lemma 6.10 is preceded by a sequence of lemmas. Lemma 6.6 provides a helpful decomposition. Lemma 6.7 gives the variance of Y in (6.1) when π has distribution U(c) for some c ∈ Nn . Lemma 6.8 records some properties of the difference of the pair y , y , given functions of two fixed permutations π and π , related by transpositions. Lemma 6.9 constructs a Stein pair (Y , Y ). Then, Lemma 6.10 is shown by following the outline in Sect. 4.4.1 to construct the appropriate square bias variables, followed by applying Theorem 5.1 to the resulting zero bias coupling. To better highlight the reason for the imposition of the symmetry condition (6.70) and the exclusion of fixed points, in Lemmas 6.6, 6.8 and the proof of Lemma 6.9 we consider an array satisfying only (6.65) and allow fixed points. For a given permutation π and i, j ∈ {1, . . . , n}, write i ∼ j if i and j are in the same cycle of π , and let |i| denote the length of the cycle of π containing i.
188
6
L∞ : Applications
Lemma 6.6 Let π be a fixed permutation. For any i = j , distinct elements of {1, . . . , n}, the sets A0 , . . . , A5 form a partition of the space where, A0 = i, j, π(i), π(j ) = 2 A1 = |i| = 1, |j | ≥ 2 , A2 = |i| ≥ 2, |j | = 1 and A3 = |i| ≥ 3, π(i) = j , A4 = |j | ≥ 3, π(j ) = i A5 = i, j, π(i), π(j ) = 4 . Additionally, the sets A0,1 and A0,1 partition A0 , where A0,1 = π(i) = i, π(j ) = j , A0,2 = π(i) = j, π(j ) = i , and we may also write A1 = π(i) = i, π(j ) = j , A3 = π(i) = j, π(j ) = i ,
A2 = π(i) = i, π(j ) = j A4 = π(j ) = i, π(i) = j ,
and membership in Am , m = 0, . . . , 5 depends only on i, j, π(i), π(j ). Lastly, the sets A5,m , m = 1, . . . , 4 partition A5 , where A5,1 = |i| = 2, |j | = 2, i ∼ j A5,2 = |i| = 2, |j | ≥ 3 , A5,3 = |i| ≥ 3, |j | = 2 A5,4 = |i| ≥ 3, |j | ≥ 3 ∩ A5 , and membership in A5,m , m = 1, . . . , 4 depends only on i, j , π −1 (j ), π −1 (i), π(i), π(j ). Proof The sets Am , m = 0, . . . , 5 are clearly disjoint, so we need only demonstrate that they are exhaustive. Let s = |{i, j, π(i), π(j )}|. Since i = j , we have 2 ≤ s ≤ 4. The case A0 is exactly the case s = 2. There are four cases when s = 3. Either exactly one of i or j is a fixed point, that is, either we are in case A1 or A2 , or neither i nor j is a fixed point, and so i = π(i) and j = π(j ). As i = j , and therefore π(i) = π(j ), the only equalities among i, π(i), j, π(j ) which are yet possible are π(i) = j or π(j ) = i. Both equalities cannot hold at once, as then s = 2. The case where only the first equality is satisfied is A3 , and only the second is A4 . Clearly what remains now is exactly A5 . The sets A0,1 and A0,2 are clearly disjoint and union to A0 , and the alternative ways to express A1 , A2 , A3 and A4 are clear, as, therefore, are the claims about what values are sufficient to determine membership is these sets. The sets A5,m , m = 1, . . . , 5 are also clearly disjoint. If either i or j is a fixed point then s ≤ 3, so on A5 we must have |i| ≥ 2 and |j | ≥ 2. The set A5,1 is where both i and j are in 2-cycles, in which case these cycles must be distinct in order that s = 4. The sets A5,2 and A5,3 are the cases where exactly one of i or j is in a 2-cycle, and these are already subsets of A5 . The remaining case in A5 is when i and j are both in cycles of length at least 3, yielding A5,4 . We now calculate the variance of Y when π in (6.1) is chosen uniformly over all permutations of some fixed cycle type with no fixed points.
6.1 Combinatorial Central Limit Theorem
189
Lemma 6.7 For n ≥ 4 let c ∈ Nn with c1 = 0, and let π be uniformly chosen from all permutations with cycle type c. Assume that aij = aj i . Then the variance of Y in (6.1) is given by
1 2c2 (aij − 2aio + aoo )2 . + σc2 = n − 1 n(n − 3) i=j
Proof Without loss of generality we may take aii = 0 and then replace aij by aij − 2aio + aoo for i = j as in (6.66), and so, in particular, we may assume (6.65) holds. In particular EY = 0 and Var(Y ) = EY 2 . Expanding, 2 ai,π(i) aj,π(j ) = E ai,π(i) +E ai,π(i) aj,π(j ) . EY 2 = E ij
i=j
i
For the first term, by (6.61), we have 2 Eai,π(i) = i
1 2 aij . n−1
(6.72)
ij
It is helpful to write the second term as ai,π(i) aj,π(j ) = n(n − 1)EaI,π(J ) aJ,π(J ) E
(6.73)
i=j
where I and J are chosen uniformly from all distinct pairs, independently of π . We evaluate this expectation with the help of the decomposition in Lemma 6.6, starting with A0 . Noting A0,1 is null as c1 = 0, from A0,2 we have 2 1{π(I )=J,π(J )=I } EaI,π(I ) aJ,π(J ) 1A0,2 = EaI,J 2c2 = 2 aij2 , 2 n (n − 1)
(6.74)
i=j
noting that there are n(n − 1) possibilities for I and J , another factor of n(n − 1) for the possible values of π(i) and π(j ), and c2 ways that (i, j ) can be placed as a 2-cycle, with the same holding for (j, i). As A1 and A2 are null, moving on to A3 we have, by similar reasoning, EaI,π(J ) aJ,π(J ) 1A3 = EaI,J aJ,π(J ) 1{|I |≥3,π(I )=J } b≥3 bcb = 2 aij aj k . 2 n (n − 1) (n − 2)
(6.75)
|{i,j,k}|=3
By symmetry the event A4 contributes the same. Lastly, consider the contributions from A5 . Starting with A5,1 , we have EaI,π(J ) aJ,π(J ) 1A5,1 = EaI,π(J ) aJ,π(J ) 1{|I |=2, |J |=2, I ∼J } 4c2 (c2 − 1) = 2 2 n (n − 1) (n − 2)(n − 3)
|{i,j,k,l}|=4
aik aj l ,
(6.76)
190
L∞ : Applications
6
and EaI,π(J ) aJ,π(J ) 1A5,2 = EaI,π(J ) aJ,π(J ) 1{|I |=2, |J |≥3} 2c2 b≥3 bcb = 2 n (n − 1)2 (n − 2)(n − 3)
aik aj l .
(6.77)
|{i,j,k,l}|=4
The contribution from A5,3 is the same as that from A5,2 . We break A5,4 into two subcases, depending on whether or not I and J are in a common cycle. When they are, we obtain EaI,π(J ) aJ,π(J ) 1{A5,4 ,I ∼J } = EaI,π(J ) aJ,π(J ) 1{|I |≥3, I ∼J, A5 } b≥3 bcb (b − 3) = 2 2 n (n − 1) (n − 2)(n − 3)
aik aj l , (6.78)
|{i,j,k,l}|=4
where the term b − 3 accounts for the fact that on A5 the value of π(j ) in the cycle of length b cannot lie in {i, j, π(i)}. When I and J are in disjoint cycles we have EaI,π(J ) aJ,π(J ) 1{A5,4 ,I ∼J } = EaI,π(J ) aJ,π(J ) 1{|I |≥3,|J |≥3, I ∼J }
1 = 2 bcb dcd − b n (n − 1)2 (n − 2)(n − 3) b≥3
d≥3
aik aj l ,
(6.79)
|{i,j,k,l}|=4
where the term −b accounts for the fact that j must lie in a cycle of length at least three, different from the one of length b ≥ 3 that contains i. To simplify the sums, using that aio = 0 we obtain aij aj k = − aij2 i=j
|{i,j,k}|=3
and therefore
aik aj l = −
|{i,j,k,l}|=4
|{i,j,k}|=3
=
i=j
aij2 +
aik aj i − i=k
aik aj k
|{i,j,k}|=3 2 aik =2
aij2 .
i=j
Summing the contributions to (6.73) from the events A0 , . . . , A4 , that is, (6.74) and twice (6.75), using b≥2 bcb = n and letting (n)k = n(n − 1) · · · (n − k + 1) denote the falling factorial, yields i=j aij2 /(n)4 times
2 b≥3 bcb 2c2 (n)4 − = (n − 3) (n − 2)2c2 − 2(n − 2c2 ) n(n − 1) n(n − 1)(n − 2) = (n − 3)2n(c2 − 1). (6.80) Adding up the contributions to (6.73) from A5 , that is, (6.76), twice (6.77), (6.78) and (6.79), yields i=j aij2 /(n)4 times
6.1 Combinatorial Central Limit Theorem
8c2 (c2 − 1) + 8c2
bcb + 2
b≥3
= 8c2 (c2 − 1) + 8c2
b≥3
191
bcb (b − 3) + 2
b≥3
bcb − 6
b≥3
bcb
b≥3
bcb + 2
b≥3
bcb
dcd − b
(6.81)
d≥3
dcd
d≥3
= 8c2 (c2 − 1) + 8c2 (n − 2c2 ) − 6(n − 2c2 ) + 2(n − 2c2 )2 = 2n2 − 6n + 4c2 .
(6.82)
Now totalling all contributions, adding (6.80) to (6.82) we obtain (n − 3)2n(c2 − 1) + 2n2 − 6n + 4c2 = 2c2 (n − 1)(n − 2). Dividing by (n)4 gives the second term in the expression for σc2 . The first term is (6.72). We will use Y , y and π interchangeably for Y , y and π , respectively. Again, for i ∈ {1, . . . , n} we let |i| denote the number of elements in the cycle of π that contains i. Due to the way that π is formed from π in Lemma 6.8 using two distinct indices i and j , the various cases for expressing the difference Y − Y depend only on i and j and their pre and post images under π . Lemma 6.8 Let π be a fixed permutation and i and j distinct elements of {1, . . . , n}. Letting π(−α) = π −1 (α) for α ∈ {1, . . . , n} set χi,j = {−j, −i, i, j }, so that π(α), α ∈ χi,j = π −1 (j ), π −1 (i), π(i), π(j ) . Then, for π = τij π τij with τij the transposition of i and j , and y and y given by (6.1) with π and π replacing π , respectively, y − y = b i, j, π(α), α ∈ χi,j where 5 b i, j, π(α), α ∈ χi,j = bm i, j, π(α), α ∈ χi,j 1Am
(6.83)
m=0
with Am , m = 0, . . . , 5 as in Lemma 6.6, b0 (i, j, π(α), α ∈ χi,j ) = 0, b1 i, j, π(α), α ∈ χi,j = aii + aπ −1 (j ),j + aj,π(j ) − (ajj + aπ −1 (j ),i + ai,π(j ) ), b2 i, j, π(α), α ∈ χi,j = ajj + aπ −1 (i),i + ai,π(i) − (aii + aπ −1 (i),j + aj,π(i) ), b3 i, j, π(α), α ∈ χi,j = aπ −1 (i),i + aij + aj,π(j ) − (aπ −1 (i),j + aj i + ai,π(j ) ), b4 i, j, π(α), α ∈ χi,j = aπ −1 (j ),j + aj i + ai,π(i) − (aπ −1 (j ),i + aij + aj,π(i) ), and
b5 i, j, π(α), α ∈ χi,j = aπ −1 (i),i + ai,π(i) + aπ −1 (j ),j + aj,π(j ) − (aπ −1 (i),j + aj,π(i) + aπ −1 (j ),i + ai,π(j ) ).
192
6
L∞ : Applications
Proof First we note that equality (6.83) defines a function, as Lemma 6.6 shows that 1Am , m = 0, . . . , 5 depend only on the given variables. Now considering the difference, under A0 either π(i) = i and π(j ) = j , or π(i) = j and π(j ) = i; in the both cases π = π and therefore y = y, and their difference is zero, corresponding to the claimed form for b0 . When A1 is true, since |i| = 1, and π(j ) = j we have π (j ) = τij πτij (j ) = τij π(i) = τij (i) = j and π π −1 (j ) = τij πτij π −1 (j ) = τij π π −1 (j ) = τij j = i and π (i) = τij πτij (i) = τij π(j ) = π(j ). / {i, j } so τij (k) = k, and therefore If k ∈ / {i, π −1 (j ), j } then π(k) ∈ / i, π −1 (j ), j . π (k) = τij πτij (k) = τij π(k) = π(k) for all k ∈ That is, on A1 the permutations π and π only differ in that where π has the action i → i, π −1 (j ) → j → π(j ), leading to the terms aii + aπ −1 (j ),j + aj,π(j ) , the permutation π
has the action j → j , π −1 (j ) → i → π(j ), leading to the terms ajj + aπ −1 (j ),i + ai,π(j ) .
Taking the difference now leads to the form claimed for b1 when A1 is true. By symmetry, on A2 we have the same result as for A1 upon interchanging i and j . Similarly, when A3 is true the only difference between π and π is that the former has the action π −1 (i) → i → j → π(j ), leading to the terms aπ −1 (i),i + aij + aj,π(j ) , while that latter has π −1 (i) → j → i → π(j ), leading to aπ −1 (i),j + aj i + ai,π(j ) . Again, A4 is the same as A3 with the roles of i and j interchanged. Lastly, when |{i, j, π(i), π(j )}| = 4 the permutation π has the action π −1 (i) → i → π(i) and π −1 (j ) → j → π(j ) while π has π −1 (i) → j → π(i) and π −1 (j ) → i → π(j ), making the form of b5 clear. Our next task is the construction of a Stein pair Y , Y , which we accomplish in the following lemma in a manner similar to that in Sect. 4.4.2. We remind the reader that we consider the symbols π and Y interchangeable with π and Y , respectively. Lemma 6.9 For n ≥ 5 let {aij }ni,j =1 be an array of real numbers satisfying aij = aj i
and
aii = 0.
Let π ∈ Sn be a random permutation with distribution constant on cycle type, having no fixed points, and let Y be given by (6.1). Further, let I, J be chosen independently of π , uniformly from all pairs of distinct elements of {1, . . . , n}. Then, letting π = τI J πτI J and Y be given by (6.1) with π replacing π , (Y, Y ) is a 4/n-Stein pair.
6.1 Combinatorial Central Limit Theorem
193
Proof First we show that the pair of permutations π , π is exchangeable. For fixed permutations σ , σ , if σ = τI J σ τI J then P (π = σ , π = σ ) = 0 = P (π = σ , π = σ ). Otherwise σ = τI J σ τI J , and using (6.57) for the second equality followed by τij−1 = τij , we have P (π = σ , π = σ ) = P (π = σ ) = P (π = τI J σ τI J ) = P (π = σ ) = P (π = σ , π = σ ). Consequently, π and π , and therefore Y and Y , given by (6.1) with permutations π and π , respectively, are exchangeable. It remains to demonstrate that Y , Y satisfies the linearity condition (4.108) with λ = 4/n, for which it suffices to show 4 E(Y − Y |π) = Y . (6.84) n We 5 prove (6.84) by computing the conditional expectation given π of the sum m=0 bm (i, j, π(α), α ∈ χi,j )1Am in (6.83) of Lemma 6.6, with A0 , . . . , A5 given in Lemma 6.8, with i, j replaced by I, J . First we have that b0 = 0. Next, we claim that the contribution to n(n − 1)E(Y − Y |π) from b1 and b2 totals to aii + 4c1 (π) ai,π(i) 2 n − c1 (π) |i|=1
− 2c1 (π)
aii − 2
|i|≥2
|i|≥2
aij − 2
|i|=1, |j |≥2
aij .
(6.85)
|i|≥2, |j |=1
In particular, for the first term aI I in the function b1 , by summing below over j we obtain 1 n − c1 (π) E(aI I 1A1 |π) = aii 1{|i|=1, |j |≥2} = aii . (6.86) n(n − 1) n(n − 1) |i|=1
i,j
For the next two terms of b1 , noting that the sum of aj,π(j ) over a given cycle of π equals the sum of aπ −1 (j ),j over that same cycle, we obtain E(aπ −1 (J ),J 1A1 |π) + E(aJ,π(J ) 1A1 |π) = 2E(aJ,π(J ) 1A1 |π) 2 aj,π(j ) 1{|i|=1,|j |≥2} n(n − 1) n
=
j =1
2c1 (π) = aj,π(j ) . n(n − 1) |j |≥2
Moving to the final three terms of b1 , we have similarly that
(6.87)
194
6
E(aJ,J 1A1 |π) =
L∞ : Applications
c1 (π) ajj , n(n − 1) |j |≥2
1 E(aπ −1 (J ),I 1A1 |π) = n(n − 1)
aπ −1 (j ),i =
|i|=1,|j |≥2
1 n(n − 1)
aj i
|i|=1,|j |≥2
and E(aI,π(J ) 1A1 |π) =
1 n(n − 1)
ai,π(j ) =
|i|=1, |j |≥2
1 n(n − 1)
aij .
|i|=1, |j |≥2
Summing (6.86) and (6.87) and subtracting these last three contributions, and then using the fact that the contribution from b2 is the same as that from b1 by symmetry, we obtain (6.85). Next, it is easy to see that the first three contributions to n(n − 1)E(Y − Y |π) from b3 , on the event A3 = 1(π(I )= J, |I | ≥ 3), all equal |i|≥3 ai,π(i) , that thefourth and sixth both equal − |i|≥3 aπ −1 (i),π(i) , and that the fifth equals − |i|≥3 aπ(i),i . Combining this quantity with the equal amount from b4 yields ai,π(i) − 4 aπ −1 (i),π(i) − 2 aπ(i),i . (6.88) 6 |i|≥3
|i|≥3
|i|≥3
Next, write A5 = 1{|I | ≥ 2, |J | ≥ 2, I = J, π(I ) = J, π(J ) = I }. The first term in b5 , aπ −1 (I ),I , has conditional expectation given π of (n(n − 1))−1 times aπ −1 (i),i 1 |i| ≥ 2, |j | ≥ 2, i = j, π(i) = j, π(j ) = i . (6.89) Write i ∼ j when i and j are elements of the same cycle. When i ∼ j and {i, j, π(i), π(j )} are distinct, then |i| ≥ 4 and there are |i| − 3 possible choices for j ∼ i that satisfy the conditions in the indicator in (6.89). Hence, the case i ∼ j contributes aπ −1 (i),i 1 i = j, π(i) = j, π(j ) = i = aπ −1 (i),i |i| − 3 j ∼i
|i|≥4
|i|≥4
=
|i| − 3 ai,π(i) .
|i|≥3
When i ∼ j the conditions in the indicator function in (6.89) are satisfied if and only if |i| ≥ 2, |j | ≥ 2. For |i| ≥ 2 there are n − |i| − c1 (π) choices for j , so the case i ∼ j contributes aπ −1 (i),i 1 |i|≥2
=
j ∼i, |j |≥2
n − |i| − c1 (π) ai,π(i)
|i|≥2
n − |i| − c1 (π) ai,π(i) . = n − 2 − c1 (π) ai,π(i) +
|i|=2
|i|≥3
6.1 Combinatorial Central Limit Theorem
195
As the first four terms of b5 all yield the same contribution, they account for a total of ai,π(i) + 4 n − 3 − c1 (π) ai,π(i) . (6.90) 4 n − 2 − c1 (π) |i|=2
|i|≥3
Decomposing the contribution from the fifth term −aπ −1 (I ),J of b5 , according to whether i ∼ j or i ∼ j , gives − aπ −1 (i),j 1 i = j, π(i) = j, π(j ) = i |i|≥2,|j |≥2
=−
|i|≥4 j ∼i
=−
aπ −1 (i),j +
|i|≥4 j ∼i
−
aπ −1 (i),j
|i|≥2,|j |≥2 j ∼i
(aπ −1 (i),i + aπ −1 (i),π(i) + aπ −1 (i),π −1 (i) )
aij
aij +
|i|≥4 j ∼i
|i|≥4
|i|≥2,|j |≥2 j ∼i
=−
aπ −1 (i),j 1 i = j, π(i) = j, π(j ) = i −
(ai,π(i) + aπ −1 (i),π(i) + aii ) −
aij . (6.91)
|i|≥2,|j |≥2 j ∼i
|i|≥4
Tosimplify (6.91), let a ∧ b = min(a, b) and consider a decomposition of the sum ij aij first by whether i ∼ j or not, and then according to cycle sizes, and in the first case further as to whether the length of the common cycle of i and j is greater than 4, and in the second case as to whether the distinct cycles of i and j both have size at least 2. That is, write, n
aij =
aij +
|i|≥4 j ∼i
i,j =1
+
|i|≤3 j ∼i
aij +
aij
|i|≥2,|j |≥2 j ∼i
(6.92)
aij .
|i|∧|j |=1 j ∼i
Since i,j aij = 0 by (6.65), we may replace the sum of the first and third terms in (6.91) by the sum of the second and fourth terms on the right hand side of (6.92). Hence, the contribution from aπ −1 (I ),J on A5 equals aij + aij + (ai,π(i) + aπ −1 (i),π(i) + aii ) |i|≤3 j ∼i
=
|i|∧|j |=1 j ∼i
aij +
|i|≤2 j ∼i
|i|≥4
aij +
|i|∧|j |=1 j ∼i
(ai,π(i) + aπ −1 (i),π(i) + aii ),
|i|≥3
where to obtain the equality we used the fact that π 2 (i) = π −1 (i) when |i| = 3. Dealing similarly with the |i| = 2, j ∼ i term we obtain aii + aij + (ai,π(i) + aii ) + aπ −1 (i),π(i) |i|=1
=
|i|∧|j |=1 j ∼i
|i|∧|j |=1 j ∼i
aij +
|i|≥2
|i|≥2
ai,π(i) +
|i|≥1
|i|≥3
aii +
|i|≥3
aπ −1 (i),π(i) .
196
6
L∞ : Applications
Combining this contribution with the next three terms of A5 , each of which yields the same amount, gives the total ai,π(i) + 4 aπ −1 (i),π(i) + 4 aii + 4 aij . (6.93) 4 |i|≥2
|i|≥3
|i|≥1
|i|∧|j |=1 j ∼i
Combining (6.93) with the contribution (6.90) from the first four terms in b5 , the b1 and b2 terms in (6.85) and the b3 and b4 terms (6.88), yields n(n − 1)E(Y − Y |π ), which, canceling the terms involving aπ −1 (i),π(i) and rearranging to group like terms, can be written 4(n − 1) ai,π(i) + (4n − 2) ai,π(i) − 2 aπ(i),i (6.94) |i|=2
|i|≥3
|i|≥3
aii − 2 c1 (π) − 2 aii + 2 n − c1 (π) + 2 |i|=1
+4
aij − 2
|i|∧|j |=1,j ∼i
|i|=1,|j |≥2
(6.95)
|i|≥2
aij − 2
aij .
(6.96)
|i|≥2,|j |=1
The assumption that aii = 0 causes the contribution from (6.95) to vanish, the assumption that there are no 1-cycles causes the contribution from (6.96) to vanish, and the assumption that aij = aj i allows the combination of the second and third terms in (6.94) to yield
1 E(Y − Y |π ) = ai,π (i) + (4n − 4) ai,π (i) 4(n − 1) n(n − 1) |i|=2
=
4 n
n i=1
|i|≥3
4 ai,π (i) = Y . n
Hence, the linearity condition (4.108) is satisfied with λ = 4/n, completing the ar gument that Y , Y is a 4/n-Stein pair. We now prove the special case of Theorem 6.3 when π is uniform over cycle type. Lemma 6.10 Let n ≥ 5 and let {aij }ni,j =1 be an array of real numbers satisfying aij = aj i . Let π ∈ Sn be a random permutation with distribution U(c), uniform on cycle type c ∈ Nn , having no fixed points. Then, letting Y be the sum in (6.1), σc2 given by (6.67) and W = (Y − EY )/σc , √
1 2π + (6.97) σc , supP (W ≤ z) − P (Z ≤ z) ≤ 40C 1 + √ 4 2π z∈R where C = maxi=j |aij − 2aio + aoo | and Z is a standard normal variable. When π is uniform over involutions without fixed points, then 40 in (6.97) may be replaced by 24, and σc2 specializes to the form given in (6.68).
6.1 Combinatorial Central Limit Theorem
197
Proof We may set aii = 0, and then by replacing aij by aij − 2aio + aoo when i = j , assume without loss of generality that aio = aoj = EY = 0. We write Y and π interchangeably for Y and π , respectively. We follow the outline in Sect. 4.4.1 to produce a coupling of Y to a pair Y † , Y ‡ with the square bias distribution as in Proposition 4.6, satisfying (6.2) and (6.3). We then produce a coupling of Y to Y ∗ having the Y -zero bias distribution using the uniform interpolation as in that proposition, and lastly invoke Theorem 5.1 to obtain the bound. First construct the Stein pair Y , Y as in Lemma 6.9. Let π be a permutation with distribution U(c). Then, with I and J having distribution P (I = i, J = j ) =
1 n(n − 1)
for i = j ,
set π = τI J π τI J where τij is the transposition of i and j . Now Y and Y are given by (6.1) with π replaced by π and π , respectively. To specialize the outline in Sect. 4.4.1 to this case, we let I = (I, J ) and α = π(α). In keeping with the notation of Lemma 6.8, with χ = {1, . . . , n} we let π(−j ) = π −1 (j ) for j ∈ χ , and with i and j distinct elements of χ we set χi,j = {−j, −i, i, j } and pi,j (ξα , α ∈ χi,j ) = P π(α) = ξα , α ∈ χi,j , the distribution of the pre and post images of i and j under π . Equality (4.116) gives the factorization of the variables from which π and π are constructed as / χi |ξα , α ∈ χi ). P (i, ξα , α ∈ χ) = P (I = i)Pi (ξα , α ∈ χi )Pic |i (ξα , α ∈ The factorization can be interpreted as saying that first we choose I, J , then construct the pre and post images of I and J , under π , then, conditional on what has already been chosen, the values of π on the remaining variables. For the distribution of the pair with the square bias distribution, equality (4.118) gives the parallel factorization, / χi |ξα , α ∈ χi ) P † (i, ξα , α ∈ χ) = P † (I = i)Pi† (ξα , α ∈ χi )Pic |i (ξα , α ∈
(6.98)
where P † (I = i), the distribution of indices we will label I† , is given by (4.117) and Pi† (ξα , α ∈ χi ) by (4.119). Let σ † , σ ‡ have distribution given by (6.98), that is, with I † , J † and α , α ∈ χ having distribution (6.98), σ † (α) = α and σ ‡ = τI † ,J † σ † τI † ,J † . These permutations do not need to be constructed, we only introduce them so that we can conveniently refer to their distribution, which is the one targeted for π † , π ‡ . We construct π † , π ‡ , of which Y † , Y ‡ will be a function, in stages, beginning with the indices I † , J † , and their pre and post images under π † . Following (4.117), with λ = 4/n, let I † , J † have distribution ri,j P I † = i, J † = j = 2λσc2 (6.99) where ri,j = P (I = i, J = j )Eb2 i, j, π(α), α ∈ χi,j
198
6
L∞ : Applications
with b(i, j, ξα , α ∈ χi,j ) as in Lemma 6.8. Next, given I † = i and J † = j , from (4.119), let the pre and post images π −† (J † ), π −† (I † ), π † (I † ), π † (J † ) have distribution † pi,j (ξα , α ∈ χi,j ) =
b2 (i, j, ξα , α ∈ χi,j ) pi,j (ξα , α ∈ χi,j ). Eb2 (i, j, π(α), α ∈ χi,j )
(6.100)
We will place I † and J † , along with these generated pre and post images, into cycles of appropriate length. The conditional distribution of the remaining values of π † , given I † , J † and their pre and post images, by (4.118), has the same conditional distribution as that of π , which is the uniform distribution over all permutations of cycle type c where I † and J † have the specified pre and post images. Hence, to complete the specification of π † we fill in the remaining values of π † uniformly. For this last step we will use the values of π to construct π † in a way that makes π † and π close. Lemma 6.6 gives that, for π † , membership in A0 , . . . , A4 and A5,1 , . . . , A5,4 is determined by (6.101) I † , J † , π −† (J ), π −† (I ), π † I † , π † J † . As b0 = 0 from Lemma 6.8 the case A0 has probability zero. Note that the distribution of σ † , σ ‡ is absolutely continuous with respect to that of π , π , and therefore the permutations σ † , σ ‡ have the same cycle structure, namely c, as π , π . In particular, since π has no fixed points, A2 is eliminated and we need only consider the events A3 , A4 and A5 . For the purpose of conditioning on the values in (6.101), for ease of notation we will write (α, β) = I † , J † and (γ , δ, , ζ ) = π −† (J ), π −† (I ), π † I † , π † J † . The specification π † depends on which case, or subcase, of the events A3 , A4 , A5 is determined by the variables (6.101). In every subcase, however, π † will be specified in terms of π by conjugating with transpositions as π † = τι,ι† πτι,ι†
where τι,ι† =
κ k=1
τi
† k ,ik
,
(6.102)
for ι = (i1 , . . . , iκ ) and ι† = (i1† , . . . , iκ† ), vectors of disjoint indices of some length κ. Note that when π † is given by π through (6.102) then, / Iι,ι† , where π † (k) = π(k) for all k ∈ † −1 −1 † ik , ik : k = 1, . . . , κ . Iι,ι† = π (ik ), ik , π
(6.103)
Consider first the case where the generated values determine an outcome in A3 , that is, when J † = π † (I † ) and {π −† (I † ), I † , π † (I † )} are distinct. If π † (J † ) ∈ {π −† (I † ), I † , π † (I † )} then π † (J † ) = π −† (I † ) and the generated values form a 3I † , J † are consecutive elecycle. By the symmetry of aij we have that b3 = 0 if ment of a 3-cycle, so A3 has probability zero unless b≥4 cb ≥ 1, that is, unless
6.1 Combinatorial Central Limit Theorem
199
the cycle type c has cycles of length at least 4. Hence, if so, under A3 the elements π −† (I † ), I † , π † (I † ), π † (J † ) must be distinct and form part of a cycle of π † of length at least 4. Conditioning on the values in (6.101), and letting c(σ † , α) be the length of the cycle in σ † containing α, select a cycle length b according to the distribution P c σ † , α = b|σ −† (α) = δ, σ † (α) = , σ †2 (α) = ζ and let I be chosen uniformly, and independently from the b-cycles of π . Now let π † be given by (6.102) with ι = π −1 (I), I, π(I), π 2 (I) and ι† = π −† I † , I † , π † I † , π † J † . As the inverse images under π of the components in ι are all again components of this vector, with the possible exception of π −1 (I ), the set (6.103) can have size at most (4 + 1) + 2 × 4 = 13 in this case. The construction on A4 is analogous, with the roles of I † and J † reversed. Moving on to A5 , consider A5,1 , where if c2 ≥ 2, the elements I † and J † are to be placed in distinct 2-cycles. Choosing I and J from pairs of indices in distinct 2-cycles, let π † be given by (6.102) with ι = I, π(I), J, π(J) and ι† = I † , π † I † , J † , π † J † . As I and J are members of 2-cycles of π , the vector ι already contains all of its inverse images under π , and therefore the set (6.103) can have size at most 4 + 2 × 4 = 12. When π is an involution without fixed points, this is the only case. Similarly, if c2 and b≥3 cb are both nonzero, then the probability of A5,2 is positive, and we let I and J be chosen independently, the first uniformly from the 2-cycles of π , the second uniformly from elements of the b-cycles of π where b has distribution (6.104) P c σ † , β = b|σ −† (β) = γ , σ † (β) = ζ . Now let π † be given by (6.102) with ι = I, π(I), π −1 (J), J, π(J) and
ι† = I † , π I † , π −† J † , J † , π J † .
Arguing as above, as I is in a 2-cycle, the set (6.103) can have size at most (5 + 1) + 2 × 5 = 16. The argument is analogous on A5,3 . Before beginning our consideration of the final case, A5,4 , we note that though the generated values (6.101) are placed in π † according to the correct conditional distributions, such as (6.104), as we are considering a worst case analysis, the actual values of these probabilities never enter our considerations. Hence, on A5,4 , no matter how I and J are selected to be consistent with A5,4 , the result will be that π † will be given by (6.102) with ι = π −1 (I), I, π(I), π −1 (J), J, π(J) and ι† = π −† I † , I † , π I † , π −† J † , J † , π J † . In this case the set (6.103) can have size at most (6 + 2) + 2 × 6 = 20.
200
6
L∞ : Applications
As A0 , . . . , A5 is a partition, the construction of π † has been specified in every case. By arguments similar to those in Lemma 4.5, the conditional distribution P{i,j }c |{i,j } (ξα , α ∈ / χi,j |ξα , α ∈ χi,j ) of the remaining values, given the ones now determined, is uniform, so specifying π † by (6.102) and setting π ‡ = τI † ,J † π † τI † ,J † results in a collection of variables I † , J † and a pair of permutations with the square bias distribution (4.113). Hence, letting Y, Y † and Y ‡ be given by (6.1) with π, π † and π ‡ , respectively results in a coupling of Y to the variables Y † , Y ‡ with the square bias distribution. Now with T , T † and T ‡ given by (6.3), we have U T † + (1 − U )T ‡ + |T | ≤ 2 max |I † | C ι,ι
ι, ι†
where the maximum is over the values of appearing in the possible cases. For fixed point free involutions, A5,1 is the only case, giving the coefficient 2 × 12 = 24 on C. In general, the coefficient is bounded by 2 × 20 = 40, determined by the worst case on A5,4 . Now (6.4) gives |Y ∗ − Y | ≤ 40C in general, and the bound 24C for involutions. As |W ∗ − W | = |Y ∗ − Y |/σc by (2.59), invoking Theorem 5.1 with δ = 40C/σc and δ = 24C/σc now completes the proof. Lemma 6.10 and the mixing property of Lemma 6.5 are the key ingredients of the following argument. Proof of Theorem 6.3 First, note that the claim in Theorem 6.3 regarding involutions is part of Lemma 6.10. Otherwise, by replacing aij by a ij given in (6.66) we may without loss of generality assume that EY = 0 whenever Y is given by (6.1) with π having distribution constant on cycle type. In this case, writing Yc for the variable given by (6.1) when π ∼ U(c), the mixture property of Lemma 6.5 yields P (Y ≤ z) = ρc P (Yc ≤ z), c∈Nn
with ρc = P (c(π) = c), and in addition, from (6.71), P (Zρ ≤ z) = ρc P (Zc /σρ ≤ z). c∈Nn
Hence, with W = Y/σρ , by changes of variable, supP (W ≤ z) − P (Zρ ≤ z) = supP (Y ≤ z) − P (σρ Zρ ≤ z) z∈R
z∈R
≤
c∈Nn
=
c∈Nn
ρc P (Yc ≤ z) − P (Zc ≤ z) ρc P (Wc ≤ z) − P (Z ≤ z)
6.1 Combinatorial Central Limit Theorem
201
where Wc = Yc /σc . Now applying the uniform bound in Lemma 6.10 completes the proof.
6.1.3 Doubly Indexed Permutation Statistics In Sect. 4.4 we observed how the distribution of the permutation statistic Y in (4.104), that is, Y=
n
aiπ(i) ,
i=1
can be used to test whether there is an unusually high degree of similarity in a particular matching between the observations x1 , . . . , xn and y1 , . . . , yn . In particular, if, say d(x, y) is a function which reflects the similarity between x and y and aij = d(xi , yj ), one compares the ‘overall similarity’ score yτ =
n
aiτ (i)
i=1
of the distinguished matching τ to the distribution of Y , that is, to this same similarity score for random matchings. In spatial or spatio-temporal association, two dimensional generalizations of the permutation test statistic Y become of interest. In particular, if aij and bij are two different measures of closeness of xi and yj , which may or may not be related, then the relevant null distribution is that of W= aij bπ(i),π(j ) (6.105) (i,j ): i=j
where the permutation π is chosen uniformly from Sn ; see, for instance, Moran (1948) and Geary (1954) for applications in geography, Knox (1964) and Mantel (1967) in epidemiology, as well as the book of Hubert (1987). Following some initial results which yield the asymptotic normality of W , see Barbour and Chen (2005a) for history and references, much less restrictive conditions were given in Barbour and Eagleson (1986). Theorem 6.4 of Barbour and Chen (2005a) provides a Berry–Esseen bound for this convergence; to state it we first need to introduce some notation. As the diagonal elements play no role, we may set aii = bii = 0. For such an array {aij }ni,j =1 , let A0 = A22 =
1 n(n − 1) 1 n(n − 1)
aij ,
A12 = n−1
n ∗ 2 ai ,
(i,j ): i=j
(i,j ): i=j
i=1
1 ∗ 3 ai n n
a˜ ij2
and
A13 =
i=1
202
6
L∞ : Applications
where ai∗ =
1 (aij − A0 ) and a˜ ij = aij − ai∗ − aj∗ − A0 , n−2 j :j =i
and let the analogous definitions hold for {bij }. In addition, let μ=
1 n(n − 1)
aij blm
and
σ2 =
(i,j ),(l,m): i=j,l=m
4n2 (n − 2)2 A12 B12 . n−1
Theorem 6.4 For W as given in (6.105) with A and B symmetric arrays, we have √ supP (W − μ ≤ σ z) − (z) ≤ (2 + c)δ + 12δ 2 + (1 + 2)δ˜2 , z∈R
where δ = 128n4 σ −3 A13 B13 , δ˜22 =
(n − 1)3 A22 B22 2 2n(n − 2) (n − 3) A12 B12
and c is the constant in Theorem 6.2. It turns out that statistics such as W can be expressed as a singly indexed permutation statistic upon which known bounds may be applied, plus a remainder term which may be handled using concentration inequalities and exploiting exchangeability, somewhat similar to the way that some non-linear statistics are handled in Chap. 10. The bounds of Theorem 6.4 compare favorably with those of Zhao et al. (1997).
6.2 Patterns in Graphs and Permutations In this section we will prove and apply corollaries of Theorem 5.6 to evaluate the quality of the normal approximation for various counts that arise in graphs and permutations, in particular, coloring patterns, local maxima, and the occurrence of subgraphs of finite random graphs, and for the number of occurrences of fixed, relatively ordered sub-sequences, such as rising sequences, of random permutations. We explore the consequences of Theorem 5.6 under a local dependence condition on a collection of random variables X = {Xα , α ∈ A}, over some arbitrary, finite, index set A. In particular, we consider situations where for every α ∈ A there exists a dependency neighborhood Bα ⊂ A of Xα , containing α, such that Xα
and {Xβ : β ∈ / Bα } are independent.
(6.106)
First recalling the definition of size biasing in a coordinate direction given in (2.68) in Sect. 2.3.4, we begin with the following corollary of Theorem 5.6.
6.2 Patterns in Graphs and Permutations
203
Corollary 6.1 Let X = {Xα , α ∈ A} be a finite collection of random variables with values in [0, M] and let Xα . Y= α∈A
Let μ = α∈A EXα denote the mean of Y and assume that the variance σ 2 = Var(Y ) is positive and finite. Let EXβ and p = max pα . (6.107) pα = EXα / α∈A
β∈A
Next, for each α ∈ A let Bα ⊂ A be a dependency neighborhood of Xα such that (6.106) holds, and let b = max |Bα |. α∈A
(6.108)
For each α ∈ A, let (X, Xα ) be a coupling of X to a collection of random variables Xα having the X-size biased distribution in direction α such that for some F ⊃ σ {Y } and D ⊂ A × A, / D, then for all (β1 , β2 ) ∈ Bα1 × Bα2 if (α1 , α2 ) ∈ α1 Cov E Xβ1 − Xβ1 |F , E Xβα22 − Xβ2 |F = 0.
(6.109)
Then with W = (Y − μ)/σ ,
√ 6μb2 M 2 2μpbM |D| + . sup P (W ≤ z) − P (Z ≤ z) ≤ σ3 σ2 z∈R
Proof In view of Theorem 5.6 and (5.21), it suffices to couple Y s , with the Y -size biased distribution, to Y such that s Y − Y ≤ bM and ≤ pbM |D|. (6.110) Assume without loss of generality that EXα > 0 for each α ∈ A. Note that for every α ∈ A the distribution dF (x) of X factors as / Bα |xα ) dFα (xα )dFBαc |α (xβ , β ∈ / Bα , × dFBα \{α}|{α}∪Bαc xγ , γ ∈ Bα \ {α}|xα , xβ , β ∈ which, by the independence condition (6.106) we may write as dFα (xα )dFBαc (xβ , β ∈ Bα ) / Bα . × dFBα \{α}|{α}∪Bαc xγ , γ ∈ Bα \ {α}|xα , xβ , β ∈ Hence, as in (2.73), the coordinate size biased distribution dF α (x) may be factored as / Bα ) dF α (x) = dFαα (xα )dFBαc (xβ , β ∈ / Bα , × dFBα \{α}|{α}∪Bαc xγ , γ ∈ Bα \ {α}|xα , xβ , β ∈
204
6
L∞ : Applications
where dFαα (xα ) =
xα dFα (xα ) . EXα
(6.111)
Given a realization of X, this factorization shows that we can construct Xα by first choosing Xαα from the Xα -size bias distribution (6.111), then the variables Xβ for β ∈ Bαc according to their original distribution, and so in particular set Xβα = Xβ
for all β ∈ Bαc ,
and finally the variables Xβα , β ∈ B \ {α} using their original conditional distribution given the variables {Xαα , Xβ , β ∈ B c }. As the distribution of Xα is absolutely continuous with respect to that of X, we have Xβα ∈ [0, M] for all α, β, and therefore α X − Xβ ≤ M for all α, β ∈ A. (6.112) β By Proposition 2.2, Y s = β∈A XβI has the Y -size biased distribution, where the random index I has distribution P (I = α) = pα and is chosen independently of {(X, Xα ), α ∈ A} and F . In particular Ys − Y = XβI − Xβ , (6.113) β∈BI
yielding the first inequality in (6.110). Recalling the definition of in (5.21), since σ {Y } ⊂ F , by (4.143), 2 = Var E Y s − Y |Y ≤ Var E Y s − Y |F . Taking conditional expectation with respect to F in (6.113) yields, pα E Xβα − Xβ |F , E Y s − Y |F = α∈A
and therefore, Var E Y s − Y |F =E (α1 ,α2 )∈A×A (β1 ,β2 )∈Bα1 ×Bα2
=E
(α1 ,α2 )∈D (β1 ,β2 )∈Bα1 ×Bα2
β∈Bα
pα1 pα2 Cov E Xβα11 − Xβ1 |F , E Xβα22 − Xβ2 |F pα1 pα2 Cov E Xβα11 − Xβ1 |F , E Xβα22 − Xβ2 |F ,
where we have applied (6.109) to obtain the last equality. By (6.112), the covariances are bounded by M 2 , hence 2 ≤ Var E Y s − Y |F ≤ M 2 pα1 pα2 (α1 ,α2 )∈D (β1 ,β2 )∈Bα1 ×Bα2
6.2 Patterns in Graphs and Permutations
= M2
205
pα1 pα2 |Bα1 ||Bα2 |
(α1 ,α2 )∈D
≤ M2
p 2 b2 = p 2 b2 M 2 |D|,
(α1 ,α2 )∈D
by (6.107) and (6.108), thus yielding the second inequality in (6.110).
Though Corollary 6.1 provides bounds for finite problems, asymptotically, when the mean and variance of Y grow such that μ/σ 2 is bounded, and when b and M stay bounded, then the first term in the bound of the corollary is of order 1/σ . Additionally, if Xα have comparable expectations, so that p is of order 1/|A|, and if the ‘dependence diagonal’ D ⊂ A × A has size comparable to that of A, then the second term will also be of order 1/σ . We next specialize to the case where the summand variables {Xα , α ∈ A} are functions of independent random variables. Corollary 6.2 With G and A index sets, let {Cg , g ∈ G} be a collection of independent random elements taking values in an arbitrary set C, let {Gα , α ∈ A} be a finite collection of subsets of G, and, for α ∈ A, let Xα = Xα (Cg : g ∈ Gα ) be a real valued function of the variables {Cg , g ∈ Gα }, taking values in [0, M]. Then for Y = α Xα with mean μ and finite, positive variance σ 2 , the variable W = (Y − μ)/σ satisfies √ 6μb2 M 2 2μpbM |D| + , sup P (W ≤ z) − P (Z ≤ z) ≤ σ3 σ2 z∈R where p and b are given in (6.107) and (6.108), respectively, for any Bα ⊃ {β ∈ A: Gβ ∩ Gα = ∅},
(6.114)
and any D for which D ⊃ (α1 , α2 ): there exists (β1 , β2 ) ∈ Bα1 × Bα2 with Gβ1 ∩ Gβ2 = ∅ . (6.115) Proof We apply Corollary 6.1. Since Xα and Xβ are functions of disjoint sets of independent variables whenever Gα ∩ Gβ = ∅, the independence condition (6.106) holds when the dependency neighborhoods satisfy (6.114). To verify the remaining conditions of Corollary 6.1, for each α ∈ A we consider the following coupling of X and Xα . We may assume without loss of generality that (α) EXα > 0. Given {Cg , g ∈ G} upon which X depends, for every α ∈ A let {Cg , g ∈ Gα } be independent of {Cg , g ∈ G} and have distribution dF α (cg , g ∈ Gα ) =
Xα (cg , g ∈ Gα ) dF (cg , g ∈ Gα ), EXα (Cg , g ∈ Gα )
206
6
L∞ : Applications
so that the random variables {Cgα , g ∈ Gα } ∪ {Cg , g ∈ / G} have distribution dF α (cg , α g ∈ Gα )dF (cg , g ∈ / Gα ). Thus, letting X have coordinates given by Xβα = Xβ Cg , g ∈ Gβ ∩ Gαc , Cg(α) , g ∈ Gβ ∩ Gα , β ∈ A for any bounded continuous function f we find EXα f (X) = xα f (x)dF (cg , g ∈ G) xα dF (cg , g ∈ Gα ) / Gα ) = EXα f (x) dF (cg , g ∈ EXα (Cg , g ∈ Gα ) = EXα f (x)dF α (cg , g ∈ Gα )dF (cg , g ∈ Gα ) = EXα Ef Xα . That is, Xα has the X distribution biased in direction α, as defined in (2.68). Lastly, taking F = {Cg : g ∈ G}, so that Y is F measurable, we verify (6.109). / Gβ }, Since Xβα and {Cg , g ∈ Gβ } are independent of {Cg , g ∈ α α / Gβ = E Xβα |Cg , g ∈ Gβ , E Xβ |F = E Xβ |Cg , g ∈ Gβ , Cg , g ∈ and, since E(Xβ |F) = Xβ = E(Xβ |Cg , g ∈ Gβ ), the difference E(Xβα − Xβ |F) is a function of {Cg , g ∈ Gβ } only. By choice of D, if (α1 , α2 ) ∈ / D then for all β1 ∈ Bα1 and β2 ∈ Bα2 we have Gβ1 ∩ Gβ2 = ∅, and so E(Xβα11 − Xβ1 |F ) and E(Xβα22 − Xβ2 |F) are independent, yielding (6.109). The verification of the conditions of Corollary 6.1 is now complete. With the exception of Example 6.2, in the remainder of this section we consider graphs G = (V, E) having random elements {Cg }g∈V ∪E assigned to their vertices V and edges E , and applications of Corollary 6.2 to the sum Y = α∈A Xα of bounded functions Xα = Xα (Cg , g ∈ Vα ∪ Eα ), where Gα = (Vα , Eα ), α ∈ A is a given finite family of subgraphs of G. We abuse notation slightly in that a graph G is replaced by V ∪ E when used as an index set for the underlying variables Cg . When applying Corollary 6.2 in this setting, in (6.114) and (6.115) the intersection of the two graphs (V1 , E1 ) and (V2 , E2 ) is the graph (V1 ∩ V2 , E1 ∩ E2 ). Given a metric d on V, for every v ∈ V and r ≥ 0 we can consider the restriction Gv,r of G to the vertices at most a distance r from v, that is, the graph with vertex and edge sets Vv,r = w ∈ V: d(v, w) ≤ r and Ev,r = {w, u} ∈ E: w, u ∈ Vv,r (6.116) respectively. We say that a graph G is distance r-regular if Gv,r is isomorphic to some graph (Vr , Er ) for all v. This notion of distance r-regular is related to, but not the same as, the notion of a distance-regular graph as given in Biggs (1993) and Brouwer et al. (1989). A graph of constant degree with no cliques of size 3 is distance 1-regular.
6.2 Patterns in Graphs and Permutations
207
When Vα , α ∈ V is given by (6.116) for some fixed r, regarding the choice of the dependency neighborhoods Bα , α ∈ A, we note that if d(α1 , α2 ) > 2r and (β1 , β2 ) ∈ Vα1 × Vα2 , then rearranging yields 2r < d(α1 , α2 ) ≤ d(α1 , β1 ) + d(β1 , β2 ) + d(β2 , α2 ), and using that d(αi , βi ) ≤ r implies d(β1 , β2 ) > 0, hence Vα2 = ∅. d(α1 , α2 ) > 2r implies Vα1
(6.117)
Natural families of graphs in Rp can be generated using the vertex set V = {1, . . . , n}p with componentwise addition modulo n, and d(α, β) given by e.g. some Lp distance between α and β. We apply the following result when the subgraphs are indexed by some subset of the vertices only, in which case we take A ⊂ V. Corollary 6.3 Let G be a finite graph with a family of isomorphic subgraphs {Gα , α ∈ A} for some A ⊂ V, let d be a metric on A, and set ρ = min : d(α, β) > implies Vα ∩ Vβ = ∅ . (6.118) For each α ∈ A, let Xα be given by Xα = X(Cg , g ∈ Gα ) for a fixed function X taking values in [0, M], and let {Cg , g ∈ G} be a collection of independent variables such that the distribution of {Cg : g ∈ Gα } is the same for all α ∈ A. If G is a distance-3ρ-regular graph, then with Y = α∈A Xα having mean μ and finite, positive variance σ 2 , the variable W = (Y − μ)/σ satisfies 6μV (ρ)2 M 2 2μV (ρ)M + 2 1/2 V (3ρ), supP (W ≤ z) − P (Z ≤ z) ≤ σ3 σ |A| z∈R where V (r) = |Vr |.
(6.119)
Proof We verify that conditions (6.114) and (6.115) of Corollary 6.2 are satisfied with Bα = β: d(α, β) ≤ ρ and D = (α1 , α2 ): d(α1 , α2 ) ≤ 3ρ . (6.120) First note that to show the intersection of two graphs is empty it suffices to show that the vertex sets of the graphs do not intersect. Since for any α ∈ A, by (6.118), Bαc = β: d(β, α) > ρ ⊂ {β: Vβ ∩ Vα = ∅}, we see that condition (6.114) is satisfied. To verify (6.115), note that rearranging d(α1 , α2 ) ≤ d(α1 , β1 ) + d(β1 , β2 ) + d(β2 , α2 ) gives, for (α1 , α2 ) ∈ / D and (β1 , β2 ) ∈ Bα1 × Bα2 ,
208
6
L∞ : Applications
d(β1 , β2 ) ≥ d(α1 , α2 ) − d(α1 , β1 ) + d(α2 , β2 ) ≥ d(α1 , α2 ) − 2ρ > ρ, and hence Vβ1 ∩ Vβ2 = ∅. As EXα is constant we have p = maxα pα = 1/|A|, and in addition, that b = max |Bα | = V (ρ) and α∈A
|D| = |A|V (3ρ).
Substituting these quantities into the bound of Corollary 6.2 now yields the result. Example 6.1 (Sliding m-window) For n ≥ m ≥ 1, let A = V = {1, . . . , n} with addition modulo n, {Cg : g ∈ G} i.i.d. real valued random variables, and for each α ∈ A set Gα = (Vα , Eα ) where Vα = {v ∈ V: α ≤ v ≤ α + m − 1}
and
Eα = ∅.
: Rm
(6.121)
Then for X → [0, 1], Corollary 6.3 may be applied to the sum Y = α∈A Xα of the m-dependent sequence Xα = X(Cα , . . . , Cα+m−1 ), formed by applying the function X to the variables in the ‘m-window’ Vα . In this example, taking d(α, β) = |α − β| the bound of Corollary 6.3 obtains with ρ = m − 1 by (6.118) and V (r) ≤ 2r + 1 by (6.119). In Example 6.2 the underlying variables are not independent, so we turn to Corollary 6.1. Example 6.2 (Relatively ordered sub-sequences of a random permutation) For n ≥ m ≥ 1, let V and (Gα , Vα ), α ∈ V be as specified in (6.121). For π and τ permutations of {1, . . . , n} and {1, . . . , m}, respectively, we say the pattern τ appears at location α if the values {π(v)}v∈Vα and {τ (v)}v∈V1 are in the same relative order. Equivalently, the pattern τ appears at α if and only if π(τ −1 (v) + α − 1), v ∈ V1 is an increasing sequence. Letting π be chosen uniformly from all permutations of {1, . . . , n}, and setting Xα to be the indicator that τ appears at α, we may write Xα π(v), v ∈ Vα = 1 π τ −1 (1) + α − 1 < · · · < π τ −1 (m) + α − 1 , and the sum Y = α∈V Xα counts the number of m-element-long segments of π that have the same relative order as τ . For α ∈ V we may generate Xα = {Xβα , β ∈ V} with the X = {Xβ , β ∈ V} distribution biased in direction α as follows. Let σα be the permutation of {1, . . . , m} for which π σα (1) + α − 1 < · · · < π σα (m) + α − 1 , and set
π α (v) =
π(σα (τ (v − α + 1)) + α − 1) v ∈ Vα , v∈ / Vα .
π(v)
6.2 Patterns in Graphs and Permutations
209
In other words π α is the permutation π with values π(v), v ∈ Vα reordered so that the values of π α (γ ) for γ ∈ Vα are in the same relative order as τ . Now let Xβα = Xβ π α (v), v ∈ Vβ , the indicator that τ appears at position β in the reordered permutation π α . Since the relative order of non-overlapping segments of the values of π are independent, (6.106) holds for Xα , α ∈ V with Bα = β: |β − α| ≤ m − 1 . Next, note that with F = σ {π}, for β ∈ Bα the random variables E(Xβα |F ) and Xβ depend only on the relative order of π(v) for v ∈ β∈Bα Bβ . Since
Bβ1 Bβ2 = ∅ when |α1 − α2 | > 3(m − 1), β1 ∈Bα1
β2 ∈Bα2
for such α1 , α2 , and (β1 , β2 ) ∈ Bα1 × Bα1 , the variables E(Xβα11 |F ) − Xβ1 and E(Xβα22 |F ) − Xβ2 are independent. Hence (6.109) holds with D = (α1 , α2 ) : |α1 − α2 | ≤ 3(m − 1) , and Corollary 6.1 gives bounds of the same form as for Example 6.1. When τ = ιm , the identity permutation of length m, we say that π has a rising sequence of length m at position α if Xα = 1. Rising sequences were studied in Bayer and Diaconis (1992) in connection with card tricks and card shuffling. Due to the regular-self-overlap property of rising sequences, namely that a non-empty intersection of two rising sequences is again a rising sequence, some improvement on the constant in the bound can be obtained by a more careful consideration of the conditional variance. Example 6.3 (Coloring patterns and subgraph occurrences on a finite graph G) With n, p ∈ N, let V = A = {1, . . . , n}p , again with addition modulo n, and for α, β ∈ V let d(α, β) = α − β where · denotes the supremum norm. Further, let E = {{w, v}: d(w, v) = 1}, and, for each α ∈ A, let Gα = (Vα , Eα ) where Vα = v: d(v, α) ≤ 1 and Eα = {v, w}: v, w ∈ Vα , d(w, v) = 1 . Let C be a set (of e.g. colors) from which is formed a given pattern {cg : g ∈ G0 }, let {Cg , g ∈ G} be independent variables in C with {Cg : g ∈ Gα }α∈A identically distributed, and let X(Cg , g ∈ G0 ) = 1(Cg = cg ), (6.122) g∈G0
and Xα = X(Cg , g ∈ Gα ). Then Y = α∈A Xα counts the number of times the pattern appears in the subgraphs Gα . Taking ρ = 2 by (6.117) the conclusion of Corollary 6.3 holds with M = 1, V (r) = (2r + 1)p and |A| = np .
210
6
L∞ : Applications
Such multi-dimensional pattern occurrences are a generalization of the wellstudied case in which one-dimensional sequences are scanned for pattern occurrences; see, for instance, Glaz et al. (2001) and Naus (1982) for scan and window statistics, see Huang (2002) for applications of the normal approximation in this context to molecular sequence data, and see also Darling and Waterman (1985, 1986), where higher-dimensional extensions are considered. Occurrences of subgraphs can be handled as a special case. For example, with (V, E) the graph above, let G be the random subgraph with vertex set V and random edge set {e ∈ E : Ce = 1} where {Ce }e∈E are independent and identically distributed be the inBernoulli variables. Then letting the function X(Cg , g ∈ G0 ) in (6.122) dicator of the occurrence of a distinguished subgraph of G0 , sum Y = α∈A Xα counts the number of times that copies of the subgraph appear in the random graph G; the same bounds hold as above. Example 6.4 (Local extremes) For a given graph G, let Gα , α ∈ A, be a collection of subgraphs of G isomorphic to some subgraph G0 of G, and let v ∈ V0 be a distinguished vertex in G0 . Let {Cg , g ∈ V} be a collection of independent and identically distributed random variables, and let Xα = X(Cβ , β ∈ Vα ) where X(Cβ , β ∈ V0 ) = 1(Cv ≥ Cβ , β ∈ V0 ). Then the sum Y = α∈A Xα counts the number of times the vertex in Gα , the one corresponding under the isomorphism to the distinguished vertex v ∈ V0 , is a local maxima. Corollary 6.3 holds with M = 1; the other quantities determining the bound are dependent on the structure of G. Consider, for example, the hypercube V = {0, 1}n and E = {{v, w}: v − w = 1}, where · is the Hamming distance (see also Baldi et al. 1989 and Baldi and Rinott 1989). Let v = 0 be the distinguished vertex, A = V, and, for each α ∈ A, let w}: v,w∈ Vα , v − w = 1}. Corollary 6.3 Vα = {β: β − α ≤ 1} and Eα = {{v, applies with ρ = 2 by (6.117), V (r) = rj =0 nj , and |A| = 2n .
6.3 The Lightbulb Process The following problem arises from a study in the pharmaceutical industry on the effects of dermal patches designed to activate targeted receptors. An active receptor will become inactive, and an inactive one active, if it receives a dose of medicine released from the dermal patch. Let the number of receptors, all initially inactive, be denoted by n. On study day i over a period of n days, exactly i randomly selected receptors each will receive one dose of medicine, thus changing their status between inactive and active. The problem has the following, somewhat more colorful, though equivalent, formulation. Consider n toggle switches, each being connected to a lightbulb. Pressing the toggle switch connected to a bulb changes its status from off to on and vice versa. At each stage i = 1, . . . , n, exactly i of the n switches are randomly pressed.
6.3 The Lightbulb Process
211
Interest centers on the random variable Y , which records the number of lightbulbs that are on at the terminal time n. The problem of determining the properties of Y was first considered in Rao et al. (2007) where the following expressions for the mean μ = EY and variance σ 2 = Var(Y ) were derived, n n 2i 1− , (6.123) μ= 1− 2 n i=1
and
! " n n 4i 4i(i − 1) σ = 1− 1− + 4 n n(n − 1) i=1 ! n " n n2 4i 4i(i − 1) 2i 2 + − 1− + 1− . 4 n n(n − 1) n 2
i=1
(6.124)
i=1
Other results, for instance, recursions for determining the exact finite sample distribution of Y , are derived in Rao et al. (2007). In addition, approximations to the distribution of Y , including by the normal, are also considered there, though the question of the asymptotic normality of Y was left open. Note that when n is even then μ = n/2 exactly, as the product in (6.123), containing the term i = n/2, is zero. By results in Rao et al. (2007), in the odd case μ = (n/2)(1 + O(e−n )), and in both the even and odd cases σ 2 = (n/2)(1 + O(e−n )). The following theorem of Goldstein and Zhang (2010) provides a bound to the normal which holds for all finite n, and which tends to zero as n tends to infinity at the rate n−1/2 , thus showing the asymptotic distribution of Y is normal as n → ∞. Though the results of Goldstein and Zhang (2010) provide a bound no matter the parity of n, for simplicity we only consider the case where n even. Theorem 6.5 With Y the number of bulbs on at the terminal time n and W = (Y − μ)/σ where μ = n/2 and σ 2 is given by (6.124), for all n even n 2 n supP (W ≤ z) − P (Z ≤ z) ≤ 2 + 1.64 3 + σ 2σ σ z∈R where 1 1 + e−n/2 ≤ √ + 2 n 2n
for n ≥ 6.
(6.125)
We now more formally describe the random variable Y . Let Y = {Yri : r, i = 1, . . . , n} be the Bernoulli ‘switch’ variables which have the interpretation 1 if the status of bulb i is changed at stage r, Yri = 0 otherwise. We continue to suppress the dependence of Y , and also of Yri , on n. As the set of r bulbs which have their status changed at stage r is chosen uniformly over all sets of
212
6
L∞ : Applications
size r, and as the stages are independent of each other, with e1 , . . . , en ∈ {0, 1} the joint distribution of Yr1 , . . . , Yrn is given by n −1 if e1 + · · · + en = r, P (Yr1 = e1 , . . . , Yrn = en ) = r 0 otherwise, with the collections {Yr1 , . . . , Yrn } independent for r = 1, . . . , n. Clearly, for each stage r, the variables (Yr1 , . . . , Yrn ) are exchangeable, and the marginal distribution for each r, i = 1, . . . , n is given by r and P (Yri = 0) = 1 − . n For r, i = 1, 2, . . . , n the quantity ( rs=1 Ysi ) mod 2 is the indicator that bulb i is on at time r, and therefore
n n Y= Yi where Yi = Yri mod 2 (6.126) P (Yri = 1) =
r n
i=1
r=1
is the number of bulbs on at the terminal time. The lightbulb process, where the n individual states evolve according to the same marginal Markov chain, is a special case of a certain class of multivariate chains studied in Zhou and Lange (2009), termed ‘Composition Markov chains of multinomial type.’ As shown there, such chains admit explicit full spectral decompositions, and in particular, each transition matrix of the lightbulb process can be simultaneously diagonalized by a Hadamard matrix. These properties were, in fact, put to use in Rao et al. (2007) for the calculation of the moments needed for (6.123) and (6.124). We now describe the coupling given by Goldstein and Zhang (2010), which shows that when n is even, Y may be coupled monotonically to a variable Y s having the Y -size bias distribution, in particular, such that Y ≤ Y s ≤ Y + 2.
(6.127)
For every i ∈ {1, . . . , n} construct the collection of variables Yi from Y as follows. If Yi = 1, that is, if bulb i is on, let Yi = Y. Otherwise, with J i a uniformly chosen i : r, k = 1, . . . , n} where index over the set {j : Yn/2,j = 1 − Yn/2,i }, let Yi = {Yrk ⎧ r = n/2, Yrk ⎪ ⎪ ⎪ ⎨ / {i, J i }, Yn/2,k r = n/2, k ∈ i Yrk = Yn/2,J i r = n/2, k = i, ⎪ ⎪ ⎪ ⎩ Yn/2,i r = n/2, k = J i , and let Y i = nk=1 Yki where n i i Yk = Yj k mod 2. j =1
6.4 Anti-voter Model
213
In other words, if bulb i is off, then the switch variable Yn/2,i of bulb i at stage n/2 is interchanged with that of a variable whose switch variable at this stage has the opposite status. With I uniformly chosen from {1, . . . , n} and independent of all other variables, it is shown in Goldstein and Zhang (2010) that the mixture Y s = Y I has the Y size biased distribution, essentially due to the fact that L Yi = L(Y|Yi = 1) for all i = 1, . . . , n. It is not difficult to see that Y s satisfies (6.127). If YI = 1 then XI = X, and so in this case Y s = Y . Otherwise YI = 0, and we obtain YI by interchanging, at stage n/2, the unequal switch variables Yn/2,I and Yn/2,J I , which changes the status of both bulbs I and J I . If bulb J I was on, that is, if YJ I = 1, then after the interchange YII = 1 and YJI I = 0, in which case Y s = Y . Otherwise bulb J I was off, that is, YJ I = 0, in which case after the interchange we have YII = 1 and YJI I = 1, yielding Y s = Y + 2. As the coupling is both monotone and bounded, by (6.127) Theorem 5.7 may be invoked with δ = 2/σ . In fact, the first two terms of the bound in Theorem 6.5 arise directly from Theorem 5.7 with this δ. The bound (6.125) is calculated in Goldstein and Zhang (2010), making heavy use of the spectral decomposition provided by Zhou and Lange (2009) to determine various joint probabilities of fourth, but no higher, order.
6.4 Anti-voter Model The anti-voter model was introduced by Matloff (1977) on infinite lattices. Donnelly and Welsh (1984), and Aldous and Fill (1994) consider, as we do here, the case of finite graphs; see also Liggett (1985), and references there. The treatment below closely follows Rinott and Rotar (1997), who deal with a discrete time version. Let G = (V, E), a graph with n vertices V and edges E , which was assume to be r-regular, that is, all vertices v ∈ V have degree r. Consider the following transition rule for a Markov chain {X(t) , t = 0, 1, . . .} with state space {−1, 1}V . At each time t , a vertex v is chosen uniformly from V, and then a different vertex w is chosen uniformly from the set Nv = w: {v, w} ∈ E of neighbors of v, and then we let Xu(t+1)
=
Xu
(t)
u = v,
(t) −Xw
u = v.
That is, the configuration at time t + 1 is the same at time t , but that vertex v takes the sign opposite of its randomly chosen neighbor w. Following Donnelly and Welsh (1984), and Aldous and Fill (1994), when G is neither an n cycle nor bipartite, the chain is irreducible on the state space consisting
214
6
L∞ : Applications
(t)
of the 2n − 2 configurations which exclude those where Xv are identical, and has a stationary distribution supported on this set. We suppose the distribution of X(0) , the chain at time zero, is this stationary distribution. The exchangeable pair coupling yields the following result on the quality of the normal approximation to the distribution of the standardized net sign of the stationary configuration. Theorem 6.6 Let X have the stationary distribution of the anti voter chain on an n vertex, r-regular graph G, neither an n cycle nor bipartite, and let W be the standardized net sign U of the configuration X, that is, with σ 2 = Var(U ) let W = U/σ where U = Xv . (6.128) v∈V
U
is the net sign obtained by applying the one step transition to the configThen, if uration X, (U, U ) is a 2/n-Stein pair that satisfies |U − U | ≤ 2 and E (U − U )2 |X = 8(a + b)/(rn) (6.129) where a and b are the number of edges that are incident on vertices both of which are in state +1, or −1, respectively. In addition, √ 12n Var(Q) sup P (W ≤ z) − P (Z ≤ z) ≤ 3 + σ rσ 2 z∈R where Q=
Xv Xw .
(6.130)
v∈V w∈Nv
When σ 2 and Var(Q) are of order n, the bound given in the theorem has order √ 1/ n. Use of (5.13) results in a somewhat more complex, but superior bound. The first order of business in proving Theorem 6.6 is the construction of an exchangeable pair. It is immediate that if X(t) is a reversible Markov chain in stationarity then for any measurable function f on the chain, (f (X(s) ), f (X(t) )) is exchangeable, for any s and t. Even when a chain is not reversible, as is the case for the anti-voter model, the following lemma may be invoked for functions of chains whose increments are the same as that of a birth death process. Lemma 6.11 Let {X(t) , t = 0, 1, . . .} be a stationary process, and suppose that T (X(t) ) assumes nonnegative integer values such that T X(t+1) − T X(t) ∈ {−1, 0, 1} for all t = 0, 1, . . . . (6.131) Then for any measurable function f , W, W = f T X(t) , f T X(t+1) is an exchangeable pair.
6.4 Anti-voter Model
215
Proof The process T (t) = T (X(t) ) is stationary and has values in the nonnegative integers. For integers i, j in the range of T (·), set πi = P (T (t) = i) and pij = P (T (t+1) = j | T (t) = i). By stationarity, these probabilities do not depend on t . Using stationarity to obtain the second equality, and setting πi and pij = 0 for all i < 0, we have for all nonnegative integers j , (t+1) P T = j, T (t) = i πj = P T (t) = j = P T (t+1) = j = =
i∈N0
P T (t+1) = j |T (t) = i P T (t) = i =
πi pij ,
i: |i−j |≤1
i∈N0
where we have restricted the sum in the last equality due to the condition imposed by (6.131). This same system of equations arises in birth and death chains and it is well-known that if it has a solution then it is unique, can be written explicitly, and satisfies πi pij = πj pj i (which implies reversibility for birth and death chains). Here, the latter relation is equivalent to P T (t) = i, T (t+1) = j = P T (t) = j, T (t+1) = i ,
implying that (T (t) , T (t+1) ) is an exchangeable pair. With this result in hand, we may now proceed to the Proof of Theorem 6.6 We apply Theorem 5.4. By Lemma 6.11, W, W = W X(t) , W X(t+1)
is an exchangeable pair when W (X) is the standardized net sign of the configuration X, as in (6.128). With U and U the net signs of X(t) and X(t+1) , respectively, since at most a single 1 becomes −1, or a −1 a 1 in a one step transition, clearly the first claim of (6.129) holds. We next verify that (W, W ) satisfies the linearity condition (2.33) with λ = 2/n. Let T= 1{Xv =1} , v∈V
the number of vertices with sign 1, and let 1{Xu =Xv =1} , a= {u,v}∈E
b=
1{Xu =Xv =−1}
{u,v}∈E
and c =
1{Xu =Xv } ,
{u,v}∈E
the number of edges both of whose incident vertices take the value 1, the value −1, or both these values, respectively. For an r-regular graph, 1{Xv =1,Xw =1} + 1{Xv =1,Xw =−1} , r1{Xv =1} = w∈Nv
w∈Nv
216
6
hence summing over v ∈ V yields rT = 1{Xv =1,Xw =1} + v∈V ,w∈Nv
L∞ : Applications
1{Xv =1,Xw =−1} = 2a + c,
v∈V , w∈Nv
and so T = (2a + c)/r
and likewise
n − T = (2b + c)/r.
(6.132)
Note U = 2T − n and U = 2T − n are the net signs of the configurations X(t) and X(t+1) , respectively. When making a transition one first chooses a vertex uniformly, then one of its neighbors, uniformly, and so since the graph is regular the edge so chosen is uniform. As the net sign U decreases by 2 in a transition if and only if a 1 becomes a −1, and this event occurs if and only if one of the rn/2 edges counted by a is chosen, we have P (U − U = −2|X) =
2a rn
and likewise
P (U − U = 2|X) =
2b , rn
(6.133)
and therefore, by (6.132), 4b 4a 2(n − 2T ) 2 − = = − U. E U − U |X = rn rn n n Hence, (2.33) is satisfied for W = U/σ with λ = 2/n, and Theorem 5.4 obtains with this value of λ and, as |U − U | ≤ 2, with δ = 2/σ . Next we bound in (5.3). By (6.133) we have
2a 2b 2 E (U − U ) |X = 4 + , rn rn proving the second claim in (6.129). Next, recalling the definition of Q in (6.130), note the relations 2a + 2b + 2c = rn
and 2a + 2b − 2c = Q,
imply 4(a + b) = Q + rn. Hence E[(U − U )2 |X] = 2(Q + rn)/(rn), and therefore, using that W of X and (4.143), ≤
Var E (W − W )2 |X =
is a function
'
2Q 2 Var Var(Q). = 2 rnσ rnσ 2
Applying Theorem 5.4 along with (5.4) and the computed upper bound for , and that λ = 2/n and δ = 2/σ , the proof of the theorem is complete. The quantities σ and Var(Q) depend heavily on the particular graph under consideration. For details on how these quantities may be bounded for graphs having certain regularity properties, and examples which include the Hamming graph and the k-bipartite graph, see Rinott and Rotar (1997).
6.5 Binary Expansion of a Random Integer
217
6.5 Binary Expansion of a Random Integer Let n ≥ 2 be a natural number and x an integer in the set {0, 1, . . . , n − 1}. For m = [log2 (n − 1)] + 1, consider the binary expansion of x x=
m
xi 2m−i .
i=1
Clearly any leading zeros contribute nothing to the sum x. With X uniformly chosen from {0, 1, . . . , n − 1}, the sum S = X1 + · · · + Xm is the number of ones in the expansion of X. When n = 2m a uniform random integer between 0 and 2m − 1 may be constructed by choosing its m binary digits to be zeros and ones with equal probability, and independently, so the distribution of S in this case has the symmetric binomial distribution with m trials, which of course can be well approximated by the normal. Theorem 6.7 shows that the same is true for any large n, and provides an explicit bound. We follow Diaconis (1977). We approach the problem using exchangeable pairs. For x an integer in {0, . . . , n − 1} let Q(x, n) be the number of zeros in the m long expansion of x which, when changed to 1, result in an integer n or larger, that is, Q(x, n) = |Jx | where Jx = i ∈ {1, . . . , m}: xi = 0, x + 2m−i ≥ n .
(6.134)
For example, Q(10, 5) = 1. With I a random index on {1, . . . , m} let X + (1 − 2XI )2m−I if I ∈ / JX , X = X otherwise. That is, the I th digit of X is changed from XI to 1 − XI , if doing so produces a number between {0, . . . , n − 1}. Clearly (X, X ) are exchangeable, and S , the number of ones in the expansion of X , is given by S + 1 − 2XI if I ∈ / JX , S = S otherwise. As we see from the following lemma (S, S ) is not a Stein pair, as it fails to satisfy the linearity condition. Nevertheless, Theorem 3.5 applies. The lemma also provides the mean and variance of S. Lemma 6.12 For n ≥ 2 let X be uniformly chosen from {0, 1, . . . , n − 1} and Q = Q(X, n). Then
2S Q Q E(S − S|X) = 1 − − , E (S − S)2 |X = 1 − , m m m and 1 ES = (m − EQ) and 2
Var(S) =
m EQ + 2 Cov(S, Q) 1− . 4 m
218
6
L∞ : Applications
Proof To derive the first identity, write E(S − S|X) = P (S − S = 1|X) − P (S − S = −1|X) = P XI = 0, XI = 1|X − P (XI = 1|X) = P (XI = 0|X) − P XI = 0, XI = 0|X − P (XI = 1|X) = 1 − P XI = 0, XI = 0|X − 2P (XI = 1|X) =1−
1 2 1{Xi =0,X+2m−i ≥n} − Xi m m m
m
i=1
i=1
2S Q =1− − . (6.135) m m The expectation of S can now be calculated using that E(S − S) = 0. Similarly, since S − S ∈ {−1, 0, 1}, E (S − S)2 |X = P (S − S = 1|X) + P (S − S = −1|X) = P XI = 0, XI = 1|X + P (XI = 1|X) = P (XI = 0|X) − P XI = 0, XI = 0|X + P (XI = 1|X) = 1 − P XI = 0, XI = 0|X 1 1{Xi =0,X+2m−i ≥n} m m
=1−
i=1
Q =1− . m To calculate the variance, note first
(6.136)
0 = E(S − S)(S + S) = E 2(S − S)S + (S − S)2 .
Now taking expectation in (6.136), using identity (6.135) and that the quantities involved have mean zero, we obtain
Q E 1− = E(S − S)2 = 2E S(S − S ) m
2S Q = 2E S −1+ m m
4 Var(S) Q = + 2 Cov S, . m m Solving for the variance now completes the proof.
Theorem 6.7 For n ≥ 2 let X be a random integer chosen uniformly from {0, . . . , n − 1}. Then with S the number of ones in the binary expansion of X and m = [log2 (n − 1)] + 1, the random variable S − m/2 W= √ m/4
(6.137)
6.5 Binary Expansion of a Random Integer
219
satisfies 6.9 5.4 supP (W ≤ z) − P (Z ≤ z) ≤ √ + . m m z∈R Proof With W given by (6.137) with S replaced by S , the pair (W, W ) is exchangeable, and, with λ = 2/m, Lemma 6.12 yields
E(Q|W ) (6.138) E(W − W |W ) = λ W + √ m and
1 E(Q|W ) E (W − W )2 |W = 1 − . 2λ m
(6.139)
Further, we have EQ =
m P Xi = 0, X + 2m−i ≥ n i=1
m P X ≥ n − 2m−i ≤ i=1
= ≤
m 2m−i
n
i=1 2m − 1
=
2m − 1 n
≤ 2. (6.140) 2m−1 Since W , W are exchangeable, for any function f for which the expectations below exist, identity (2.32) yields 0 = E (W − W )f (W ) + f (W ) = E (W − W )f (W ) − f (W ) + 2E f (W )(W − W ) (W )E W − W | W = E (W − W ) f (W ) − f (W ) + 2E f E(Q|W ) f (W ) . = E (W − W ) f (W ) − f (W ) + 2λE W + √ m Solving for EWf (W ) and then reasoning as in the proof of Lemma 2.7 we obtain 1 1 E (W − W ) f (W ) − f (W ) − √ E E(Q|W )f (W ) 2λ m ∞ 1 ˆ f (W + t)K(t)dt − √ E E(Q|W )f (W ) =E m −∞
EWf (W ) =
where ˆ = (1{−≤t≤0} − 1{0 0, let Zμ,σ 2 be the discretized normal with distribution given by k − μ − 1/2 k − μ + 1/2 P (Zμ,σ 2 = k) = P 1}
i=1
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_8, © Springer-Verlag Berlin Heidelberg 2011
233
234
8
Non-uniform Bounds for Independent Random Variables
We prove Proposition 8.1 using the Bennett–Hoeffding inequality. Lemma 8.1 For some α > 0 let η1 , . . . , ηn be independent random variables satn 2 2 isfyingEηi ≤ 0, ηi ≤ α for all 1 ≤ i ≤ n, and Eη i=1 i ≤ B . Then, with n Sn = i=1 ηi , EetSn ≤ exp α −2 etα − 1 − tα B 2 for t > 0, (8.5)
B2 αx αx αx P (Sn ≥ x) ≤ exp − 2 1 + 2 log 1 + 2 − 2 , (8.6) α B B B and
P (Sn ≥ x) ≤ exp −
x2 2 2(B + αx)
(8.7)
for x > 0. Proof First, one may prove that (es − 1 − s)/s 2 is an increasing function for s ∈ R, from which it follows that ets ≤ 1 + ts + (ts)2 etα − 1 − tα /(tα)2 (8.8) for s ≤ α, and t > 0. Using the properties of the ηi ’s, for t > 0 we have EetSn = ≤
n
Eetηi
i=1 n
1 + tEηi + α −2 etα − 1 − tα Eηi2
i=1
≤
n
1 + α −2 etα − 1 − tα Eηi2
i=1
≤ exp α −2 etα − 1 − tα B 2 , proving (8.5). To prove inequality (8.6), with x > 0 let αx 1 t = log 1 + 2 . α B Then by (8.5), P (Sn ≥ x) ≤ e−tx EetSn ≤ exp −tx + α −2 etα − 1 − tα B 2
αx αx αx B2 1 + 2 log 1 + 2 − 2 , = exp − 2 α B B B demonstrating (8.6). Lastly, in view of the fact that
8.1 A Non-uniform Concentration Inequality
235
(1 + s) log(1 + s) − s ≥
s2 2(1 + s)
for s > 0, (8.7) now follows.
Although the hypotheses of Lemma 8.1 require that ηi be bounded by above for all i = 1, . . . , n, the following result shows how the lemma may nevertheless be applied to unbounded variables. Lemma 8.2 Let η 1 , η2 , . . . , ηn be independent random variables satisfying Eηi ≤ 0 for 1 ≤ i ≤ n and ni=1 Eηi2 ≤ B 2 . Then, P (Sn ≥ x)
B2 αx αx αx 1 + 2 log 1 + 2 − 2 , (8.9) ≤ P max ηi > α + exp − 2 1≤i≤n α B B B for all α > 0 and x > 0. In particular, x ∨B x 2 −p p P (Sn ≥ x) ≤ P max ηi > (8.10) +e 1+ 1≤i≤n p pB 2 for x > 0 and p ≥ 1. Proof Letting η¯ i = ηi 1{ηi ≤α} we have
P (Sn ≥ x) ≤ P max ηi > α + P 1≤i≤n
n
η¯ i ≥ x .
i=1
As η¯ i ≤ α, E η¯ i = Eηi − Eηi 1{ηi >α} ≤ 0 and ni=1 E η¯ i2 ≤ ni=1 Eηi2 ≤ B 2 we may now apply Lemma 8.1 to yield (8.9). Inequality (8.10) is trivial when 0 < x < B and p ≥ 1, since then x 2 −p ≥ 1. For x > B, taking α = x/p in the second term in (8.9) yields ep (1 + pB 2)
αx αx αx B2 1 + 2 log 1 + 2 − 2 exp − 2 α B B B x αx x 2 −p ≤ exp − log 1 + 2 + p = ep 1 + . α B pB 2 This proves (8.10). We now proceed to the proof of Proposition 8.1, the non-uniform concentration inequality. Proof As ξ¯i , i = 1, . . . , n satisfies the hypotheses of Lemma 8.1 with α = 1 and B 2 = 1, it follows from (8.5) with t = 1/2 that P a ≤ W¯ (i) ≤ b ≤ P 0 ≤ W¯ (i) − a /2 3 ¯ (i) ≤ 1.2 e−a/2 . ≤ e−a/2 EeW /2 ≤ e−a/2 exp e1/2 − 2
236
8
Non-uniform Bounds for Independent Random Variables
Thus, (8.4) holds if 6(β2 + β3 ) ≥ 1.2 or b − a ≥ 1. Hence, it suffices to prove the claim assuming that (β2 + β3 ) ≤ 1.2/6 = 0.2 and b − a < 1. Similar to the proof of the concentration inequality in Lemma 3.1, define δ = (β2 + β3 )/2 and set ⎧ if w < a − δ, ⎨0 w/2 f (w) = e (w − a + δ) if a − δ ≤ w ≤ b + δ, (8.11) ⎩ w/2 e (b − a + 2δ) if w > b + δ. Let ¯ (i) (t) = M¯ i (t) = ξi (1{−xi ¯ i ≤t≤0} − 1{0 0, y > 0, we obtain E|ξj | min δ, |xi ¯ j| ≥ E|ξj |1{|ξj |≤1} min δ, |ξj |1{|ξj |≤1} j =i
≥
j =i n j =1
E|ξj |1{|ξj |≤1} min δ, |ξj |1{|ξj |≤1} − δE|ξi |1{|ξi |≤1}
8.2 Non-uniform Berry–Esseen Bounds
≥
237
n
Eξj2 1{|ξj |≤1}
−
E|ξj |3 1{|ξj |≤1}
j =1
4δ
1/3
− δβ3
4δβ2 + β3 1/3 − δβ3 4δ ≥ 0.5 − 0.1(0.2)1/3 ≥ 0.44,
=1−
(8.13)
where we have used δ = (β2 + β3 )/2, δ ≤ 0.1 and β2 + β3 ≤ 0.2 in the final inequality. Hence (8.14) H1 ≥ 0.44 ea/2 P a ≤ W¯ (i) ≤ b . Turning now to H2 , by the Bennett–Hoeffding inequality (8.5) with α = t = B = 1, ¯ (i)
EeW
≤ exp(e − 2).
(8.15)
Hence, by the Cauchy–Schwarz inequality, 1/2 ¯ (i) 1/2 H2 ≤ EeW |ξj | min δ, |xi ¯ j| Var
≤ exp(e/2 − 1)
j =i
2 Eξj2 min δ, |xi ¯ j|
j =i
≤ exp(e/2 − 1)δ
1/2
1/2 Eξj2
j =i
≤ exp(e/2 − 1)δ ≤ 1.44δ.
(8.16)
As to the left hand side of (8.12), we have ¯ (i) E W (i) f W¯ (i) ≤ (b − a + 2δ)E W (i) eW /2 2 1/2 W¯ (i) 1/2 ≤ (b − a + 2δ) E W (i) Ee ≤ (b − a + 2δ) exp(e/2 − 1) ≤ 1.44(b − a + 2δ). Combining the above inequalities and applying the bound δ ≤ 0.1 yields P a ≤ W¯ (i) ≤ b ≤ e−a/2 eδ/2 1.44(b − a + 2δ) + 1.44δ /0.44 ≤ e−a/2 4(b − a) + 12δ ≤ e−a/2 4(b − a) + 6(β2 + β3 ) , completing the proof of (8.4).
8.2 Non-uniform Berry–Esseen Bounds We begin our development of non-uniform bounds with the following lemma.
238
8
Non-uniform Bounds for Independent Random Variables
Lemma random variables with mean zero satis 8.3 Let ξ1 , . . . , ξn be independent fying ni=1 Var(ξi ) = 1 and let W = ni=1 ξi . Then for z ≥ 2 and p ≥ 2,
P W ≥ z, max ξi > 1 1≤i≤n −p z ≤2 P |ξi | > β2 + ep 1 + z2 /(4p) 2p 1≤i≤n
whenever β2 , given by (8.3), is bounded by 1. Proof Beginning with the left hand side, we have
P W ≥ z, max ξi > 1 1≤i≤n
≤
n
P (W ≥ z, ξi > 1)
i=1
n
≤ P max ξi > z/2 + P W (i) ≥ z/2, ξi > 1 1≤i≤n
i=1
n n ≤ P |ξi | > z/p + P W (i) ≥ z/2 P (ξi > 1) i=1
n ≤ P |ξi | > z/p
i=1
i=1
+
n
−p P max |ξi | > z/(2p) + ep 1 + z2 /(4p) P (ξi > 1) 1≤i≤n
i=1
[by (8.10)] n
−p P |ξi | > z/p + P max |ξi | > z/(2p) + ep 1 + z2 /(4p) β2 ≤
i=1
≤2
1≤i≤n
−p P |ξi | > z/(2p) + ep 1 + z2 /(4p) β2 ,
(8.17)
1≤i≤n
since β2 ≤ 1.
We are now ready to prove the following non-uniform Berry–Esseen inequality, generalizing (8.1). Theorem 8.1 For every p ≥ 2 there exists a finite constant Cp depending only on p such that for all z ∈ R P (W ≤ z) − (z) n −p 1 ∨ |z| ≤2 P |ξi | > (β2 + β3 ), (8.18) + Cp 1 + |z| 2p i=1
where β2 and β3 are given in (8.3).
8.2 Non-uniform Berry–Esseen Bounds
239
Inequality (8.1) follows from Theorem 8.1 using (1 + |z|)/2 ≤ 1 ∨ |z|, and that we may bound the first sum by p n n 1 + |z| 4p P |ξi | > E|ξi |p , ≤ 4p 1 + |z| i=1
i=1
and the remaining terms, since p ≥ 2, using β2 =
n
Eξi2 1{|ξi |>1}
≤
i=1
n
n
E|ξi | 1{|ξi |>1} ≤ p
i=1
E|ξi |p ,
i=1
and β3 =
n
E|ξi |3 1{|ξi |≤1} ≤
i=1
n
E|ξi |3∧p 1{|ξi |≤1} ≤
i=1
n
E|ξi |3∧p .
i=1
Hence, Theorem 8.1 implies that there exists Cp such that n P (W ≤ z) − (z) ≤ Cp 1 + |z| −p E|ξi |p + E|ξi |3∧p .
(8.19)
i=1
Proof By replacing W by −W it suffices to consider z ≥ 0. As W is the sum of independent variables with mean zero and Var(W ) ≤ 1, by (8.10) with B = 1, for all p ≥ 2 we obtain z∨1 z2 −p . + ep 1 + P (W > z) ≤ P max |ξi | > 1≤i≤n p p Thus (8.18) holds if β2 + β3 ≥ 1, and it suffices to prove the claim assuming that β2 + β3 < 1. We may also assume the lower bound z ≥ 2 holds, as the fact that (8.18) holds for z over any bounded range follows from the uniform bound (3.31) by choosing Cp sufficiently large. Let xi ¯ i , W¯ and W¯ (i) be defined as in (8.2). We first prove that P (W > z) and ¯ P (W > z) are close and then prove a non-uniform bound for W¯ . Observing that {W > z} = W > z, max ξi > 1 ∪ W > z, max ξi ≤ 1 1≤i≤n
1≤i≤n
⊂ W > z, max ξi > 1 ∪ {W¯ > z}, 1≤i≤n
(8.20)
we obtain the upper bound
P (W > z) ≤ P (W¯ > z) + P W > z, max ξi > 1 . 1≤i≤n
(8.21)
Clearly W ≥ W¯ , yielding the lower bound, P (W¯ > z) ≤ P (W > z).
(8.22)
240
8
Non-uniform Bounds for Independent Random Variables
Hence, in view of Lemma 8.3, to prove (8.18) it suffices to show that P (W¯ ≤ z) − (z) ≤ Ce−z/2 (β2 + β3 )
(8.23)
for some absolute constant C. For z ∈ R let fz be the solution to the Stein equation (2.2), and define ¯ i (1{0≤t≤xi K¯ i (t) = E xi ¯ i } − 1{xi ¯ i ≤t1} ,
(8.24)
i=1
recalling Eξi = 0 we have P (W¯ ≤ z) − (z) = Efz (W¯ ) − E W¯ fz (W¯ ) =
n E ξi2 1{ξi >1} Efz (W¯ ) i=1
+ +
n
1
i=1 −∞ n
E fz W¯ (i) + xi ¯ i − fz W¯ (i) + t K¯ i (t) dt
E{ξi 1{ξi >1} }Efz W¯ (i)
i=1
:= R1 + R2 + R3 .
(8.25)
By (2.80), (2.8) and (8.5), E fz (W¯ ) = E fz (W¯ )1{W¯ ≤z/2} + E fz (W¯ )1{W¯ >z/2} √ 2 ≤ 2π(z/2)ez /8 + 1 1 − (z) + P (W¯ > z/2) √ 2 ¯ ≤ 2π(z/2)ez /8 + 1 1 − (z) + e−z/2 EeW ≤ Ce−z/2 by a standard tail bound on 1 − (z), and hence |R1 | ≤ Cβ2 e−z/2 . Similarly, using (2.3) we obtain Efz
(W¯ (i) ) ≤ Ce−z/2
|R3 | ≤ Cβ2 e−z/2 .
(8.26) and (8.27)
8.2 Non-uniform Berry–Esseen Bounds
241
To estimate R2 , use (2.2) to write R2 = R2,1 + R2,2 , where n
R2,1 =
1
i=1 −∞
¯ E(1{W¯ (i) +xi ¯ i ≤z} − 1{W¯ (i) +t≤z} )Ki (t) dt
and R2,2 =
n
1
i=1 −∞
E W¯ (i) + xi ¯ i fz W¯ (i) + xi ¯ i − W¯ (i) + t fz W¯ (i) + t K¯ i (t) dt.
By Proposition 8.1, with C not necessarily the same at each occurrence, n 1 ¯ (i) ≤ z − xi E 1{xi ¯ i | xi ¯ i K¯ i (t) dt R2,1 ≤ ¯ i ≤t} P z − t < W i=1 −∞ n 1
≤C
e−(z−t)/2 E min 1, |xi ¯ i | + |t| + β2 + β3 K¯ i (t) dt
i=1 −∞ n 1 −z/2
≤ Ce
i=1 −∞
E min 1, |xi ¯ i | + |t| + β2 + β3 K¯ i (t) dt.
From (8.24), Ce−z/2
n
1
i=1 −∞
(β2 + β2 )K¯ i (t)dt ≤ Ce−z/2 (β2 + β3 ).
Hence, to prove R2,1 ≤ Ce−z/2 (β2 + β3 ) it suffices to show that n 1 i=1 −∞
E min 1, |xi ¯ i | + |t| K¯ i (t) dt ≤ C(β2 + β3 ).
As 1{0≤t≤xi ¯ i } + 1{xi ¯ i ≤t1
n + E ξi3 1|ξi |≤1
i=1
i=1
= 4(β2 + β3 ), proving (8.29), and therefore (8.28). Similarly we may show R2,1 ≥ −Ce−z/2 (β2 + β3 ). Using Lemma 8.4 below for the second inequality, by the monotonicity of wfz (w) provided by (2.6), it follows that n 1 (i) ¯ + xi ¯i E 1{t≤xi ¯ i fz W¯ (i) + xi ¯ i | xi R2,2 ≤ ¯ i} E W i=1 −∞ (i) ¯
+ t fz W¯ (i) + t K¯ i (t) dt n 1 ≤ Ce−z/2 E min 1, |xi ¯ i | + |t| K¯ i (t) dt −E W
i=1 −∞
≤ Ce
−z/2
(β2 + β3 ),
(8.30)
where we have applied (8.29) for the last inequality. Therefore R2 ≤ Ce−z/2 (β2 + β3 ).
(8.31)
Similarly, we may demonstrate the lower bound R2 ≥ −Ce−z/2 (β2 + β3 ), thus proving the theorem.
It remains to prove the following lemma. Lemma 8.4 With W¯ (i) as in (8.2) and fz the solution to the Stein equation (2.2) for z > 0, for all s ≤ t ≤ 1 E W¯ (i) + t fz W¯ (i) + t − W¯ (i) + s fz W¯ (i) + s (8.32) ≤ Ce−z/2 min 1, |s| + |t| .
8.2 Non-uniform Berry–Esseen Bounds
243
Proof Let g(w) = (wfz (w)) . Then E W¯ (i) + t fz W¯ (i) + t − W¯ (i) + s fz W¯ (i) + s t Eg W¯ (i) + u du. =
(8.33)
s
From the formula (2.81) for g(w) and the bound √ 2 2π 1 + w 2 ew /2 (w) + w ≤
2 1 + |w|3
for w ≤ 0,
(8.34)
from (5.4) of Chen and Shao (2001), we obtain the w ≤ 0 case of the first inequality in ⎧ ⎨ 4(1+z2 )(1+z3 ) ez2 /8 (1 − (z)) if w ≤ z/2, 1+|w|3 g(w) ≤ 2 ⎩ 8(1 + z2 )ez /2 (1 − (z)) if z/2 < w < z or w > z. For 0 < w ≤ z/2 we apply the simpler inequality √ 2 2 2 2π 1 + w 2 ew /2 (w) + w ≤ 3 1 + z2 ez /8 + z ≤ 4 1 + z2 ez /8 , and the same reasoning yields the case z/2 ≤ w < z. For w > z we apply (8.34) with −w replacing w, and the inequality e−z /2 4(1 + z2 ) 2
1 − (z) ≥
for z > 0.
Hence, for any u ∈ [s, t], we have Eg W¯ (i) + u = E g W¯ (i) + u 1{W¯ (i) +u≤z/2} + E g W¯ (i) + u 1{W¯ (i) +u>z/2} 4(1 + z2 )(1 + z3 ) z2 /8 ≤E 1 − (z) e 1 + |W¯ (i) + u|3 2 + 8 1 + z2 ez /2 1 − (z) P W¯ (i) + u > z/2 3 −1 ¯ (i) + C(1 + z)e−z+2u Ee2W ≤ Ce−z/2 E 1 + W¯ (i) + u 3 −1 ≤ Ce−z/2 E 1 + W¯ (i) + u + C(1 + z)e−z+2u [by (8.5)] ≤ Ce−z/2 since u ≤ t ≤ 1. Hence, for all s ≤ t ≤ 1 we have t Eg W¯ (i) + u du ≤ Ce−z/2 (t − s) ≤ Ce−z/2 |t| + |s| , s
while also, now using that g(w) ≥ 0 by (2.6), from (8.35),
(8.35)
(8.36)
244
8 t
s
Eg W¯ (i) + u du ≤
1 −∞
≤ Ce
Non-uniform Bounds for Independent Random Variables
Eg W¯ (i) + u du
−z/2
E
1
−∞
1
du + Ce 1 + |W¯ (i) + u|3
≤ Ce−z/2 . Using (8.36) and (8.37) in (8.33) now completes the proof.
−z/2
1 −∞
e2u du (8.37)
Chapter 9
Uniform and Non-uniform Bounds Under Local Dependence
In this chapter we continue the study of Stein’s method under the types of local dependence that was first considered in Sect. 4.7 for the L1 distance, and also in Sect. 6.2 for the L∞ distance. We follow the work of Chen and Shao (2004), with the aim of establishing both uniform and non-uniform Berry–Esseen bounds having optimal asymptotic rates under various local dependence conditions. Throughout this chapter, J will denote an index set of cardinality n and {ξi , i ∈ J } a random field, that is, an indexedcollection of random variables, with zero means and finite variances. Define W = i∈J ξi and assume that Var(W ) = 1. For / A} and |A| the cardinality of A. A ⊂ J , let ξA = {ξi , i ∈ A}, Ac = {j ∈ J : j ∈ We introduce the following four local dependence conditions, the first two of which appeared in Sect. 4.7. In each, the set Ai can be thought of as a neighborhood of dependence for ξi . (LD1) For each i ∈ J there exists Ai ⊂ J such that ξi and ξAci are independent. (LD2) For each i ∈ J there exist Ai ⊂ Bi ⊂ J such that ξi is independent of ξAci and ξAi is independent of ξBic . (LD3) For each i ∈ J there exist Ai ⊂ Bi ⊂ Ci ⊂ J such that ξi is independent of ξAci , ξAi is independent of ξBic , and ξBi is independent of ξCic . (LD4∗ ) For each i ∈ J there exist Ai ⊂ Bi ⊂ Bi∗ ⊂ Ci∗ ⊂ Di∗ ⊂ J such that ξi is independent of ξAci , ξAi is independent of ξBic , ξAi is independent of {ξAj , j ∈ Bi∗c }, {ξAj , j ∈ Bi∗ } is independent of {ξAj , j ∈ Ci∗c }, and {ξAj , j ∈ Ci∗ } is independent of {ξAj , j ∈ Di∗c }. It is clear that each condition is implied by the one that follows it, that is, that (LD4∗ ) ⇒ (LD3) ⇒ (LD2) ⇒ (LD1). Roughly speaking, (LD4∗ ) is a version of (LD3) obtained by considering {ξAi , i ∈ J } as the basic random variables in the field. Though the conditions listed are increasingly more restrictive, in many ∗ ) hold cases the weakest one, (LD2), (LD3) or (LD4 (LD1), actually implies that upon taking Bi = j ∈Ai Aj , Ci = j ∈Bi Aj , Bi∗ = j ∈Ai Bj , Ci∗ = j ∈B ∗ Bj i and Di∗ = j ∈C ∗ Bj . For example, (LD1) implies (LD4∗ ) when {ξi , i ∈ J } is the i m-dependent random field considered at the end of the next section. We note that L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_9, © Springer-Verlag Berlin Heidelberg 2011
245
246
9
Uniform and Non-uniform Bounds Under Local Dependence
Bulinski and Suquet (2002) obtain results for random fields having both negative and positive dependence by Stein’s method.
9.1 Uniform and Non-uniform Berry–Esseen Bounds We first present a general uniform Berry–Esseen bound under assumption (LD2). Recall that |J | = n. Theorem 9.1 Let p ∈ (2, 4] and assume that there exists some κ such that (LD2) is satisfied with |N (Bi )| ≤ κ for all i ∈ J , where N (Bi ) = {j ∈ J : Bj ∩ Bi = ∅}. Then supP (W ≤ z) − (z) z∈R
≤ (13 + 11κ)
E|ξi |
3∧p
+ E|ηi |
i∈J
3∧p
1/2 p p + 2.5 κ E|ξi | + E|ηi | , i∈J
where ηi = j ∈Ai ξj . In particular, if there is some θ > 0 such that E|ξi |p + E|ηi |p ≤ θ p for all i ∈ J , then √ supP (W ≤ z) − (z) ≤ (13 + 11κ)nθ 3∧p + 2.5θ p/2 κn. z∈R
In typical asymptotic √ regimes κ is bounded and θ is of order of n−1/2 , yielding 3∧p p/2 +θ κn = O(n−(p−2)/4 ). When fourth moments exist we may the order κnθ take p = 4 and obtain the best possible order of n−1/2 . Assuming the stronger local dependence condition (LD3) allows us to relax the moment assumptions. Theorem 9.2 Let p ∈ (2, 3] and assume that there exists some κ such that (LD3) is satisfied with |N (Ci )| ≤ κ for all i ∈ J , where N (Ci ) = {j ∈ J : Ci ∩ Bj = ∅}. Then E|ξi |p . supP (W ≤ z) − (z) ≤ 75κ p−1 z∈R
i∈J
Under the strongest condition (LD4∗ ) we have the following general nonuniform bound. Theorem 9.3 Let p ∈ (2, 3] and assume that (LD4∗ ) is satisfied with κ = maxi∈J max(|Di∗ |, |{j : i ∈ Dj∗ }|). Then for all z ∈ R, P (W ≤ z) − (z) ≤ Cκ p 1 + |z| −p E|ξi |p , i∈J
where C is an absolute constant.
9.2 Outline of Proofs
247
The above results can immediately be applied to m-dependent random fields, indexed, for example, by elements of Nd , the d-dimensional space of positive integers. Letting |i − j | denote the L∞ distance |i − j | = max |il − jl | 1≤l≤d
between two points i = (i1 , . . . , id ) and j = (j1 , . . . , jd ) in Nd , define the distance ρ(A, B) between two subsets A and B of Nd by
ρ(A, B) = inf |i − j |: i ∈ A, j ∈ B . For a given subset J ⊂ Nd , a set of random variables {ξi , i ∈ J } is said to be an m-dependent random field if {ξi , i ∈ A} and {ξj , j ∈ B} are independent whenever ρ(A, B) > m, for any subsets A and B of J . It is readily verified that if {ξi , i ∈ J } is an m-dependent random field then (LD3) and (LD4∗ ) are satisfied by choosing Ai = {j ∈ J : |j − i| ≤ m}, Bi = {j ∈ J : |j − i| ≤ 2m}, Ci = {j ∈ J : |j − i| ≤ 3m}, Bi∗ = {j ∈ J : |j − i| ≤ 3m}, Ci∗ = {j ∈ J : |j − i| ≤ 5m}, and Di∗ = {j ∈ J : |j − i| ≤ 7m}. Hence, Theorems 9.2 and 9.3 yield the following uniform and non-uniform bounds. Theorem 9.4 If {ξi , i ∈ J } is a zero mean m-dependent random field then for all p ∈ (2, 3] E|ξi |p (9.1) supP (W ≤ z) − (z) ≤ 75(10m + 1)(p−1)d z∈R
i∈J
and for all z ∈ R, P (W ≤ z) − (z) ≤ C 1 + |z| −p (14m + 1)pd E|ξi |p
(9.2)
i∈J
where C is an absolute constant.
9.2 Outline of Proofs The main ideas behind the proofs of the results in Sect. 9.1 are similar to those in Sects. 3.4.1 and 8.2. First a Stein identity is derived, followed by uniform and non-uniform concentration inequalities. We outline the main steps under the local dependence condition (LD1), referring the reader to Chen and Shao (2004) for further details. Assume that (LD1) is satisfied and let ηi = j ∈Ai ξj . Define
Kˆ i (t) = ξi 1(−ηi ≤ t < 0) − 1(0 ≤ t ≤ −ηi ) , ˆ = ˆ Kˆ i (t), and K(t) = E K(t). K(t) i∈J
Ki (t) = E Kˆ i (t), (9.3)
248
9
Uniform and Non-uniform Bounds Under Local Dependence
We first derive a Stein identity for W . Let f be a bounded absolutely continuous function. Then, by the independence of ξi and W − ηi ,
E ξi f (W ) − f (W − ηi ) E Wf (W ) = i∈J
=
E ξi
i∈J
= E i∈J
=E
∞ −∞
0
−ηi
∞
f (W + t) dt
f (W + t)Kˆ i (t) dt
−∞
ˆ dt . f (W + t)K(t)
(9.4)
Now, by virtue of the fact that
∞ K(t) dt = E ξi ηi −∞
i∈J
=E
ξi ξj = E
i∈J , j ∈Ai
ξi ξj = EW 2 = 1,
(9.5)
i∈J , j ∈J
we have Ef (W ) − EWf (W )
∞
f (W )K(t) dt − E =E −∞
Let r1 =
i∈J
r2 =
∞
−∞
ˆ dt. f (W + t)K(t)
E|ξi ηi |1{|ηi |>1} , E|ξi | ηi2 ∧ 1 ,
and r3 =
i∈J
|t|≤1
ˆ Var K(t) dt.
(9.6)
We record some useful inequalities involving integrals of the functions K(t) and ˆ K(t) in the following lemma, the verification of which follows by simple computations, and are therefore omitted. ˆ Lemma 9.1 Let K(t) and K(t) be given by (9.3). Then
≤ K(t)dt ≤ ˆ dt ≤ r1 K(t)dt E K(t) |t|>1
and
|t|>1
|t|≤1
tK(t)dt ≤ E
|t|>1
|t|≤1
t K(t) ˆ dt ≤ 0.5r2 .
9.2 Outline of Proofs
249
The concentration inequality given by Proposition 9.1 is used in the proof of Theorem 9.1. Similar ideas are applied to prove Theorems 9.2 and 9.3, requiring conditional and non-uniform concentration inequalities, respectively. In the following, sometimes without mention, we will make use of the inequality 1 2 (9.7) ca + b2 /c for all c > 0. 2 Inequality (9.7) is an immediate consequence of the inequality resulting √ from replac√ ing a and b in the simpler special case when c = 1 by ca, and b/ c, respectively. ab ≤
Proposition 9.1 Assume that (LD1) is satisfied. Then for any real numbers a < b, P (a ≤ W ≤ b) ≤ 0.625(b − a) + 4r1 + 2.125r2 + 4r3 ,
(9.8)
where r1 , r2 and r3 are given in (9.6). ˆ Proof Since K(t) is not necessary non-negative we cannot use the function defined in (3.32) and must consider a modification. For a < b arbitrary and α = r2 define ⎧ −(b − a + α)/2 for w ≤ a − α, ⎪ ⎪ ⎪ 1 2 ⎪ for a − α < w ≤ a, ⎪ ⎨ 2α (w − a + α) − (b − a + α)/2 for a < w ≤ b, f (w) = w − (a + b)/2 ⎪ ⎪ ⎪ − 1 (w − b − α)2 + (b − a + α)/2 for b < w ≤ b + α, ⎪ ⎪ ⎩ 2α (b − a + α)/2 for w > b + α. Then f is the continuous function given by ⎧ for a ≤ w ≤ b, ⎨1
f (w) = 0 for w ≤ a − α or w ≥ b + α, ⎩ linear for a − α ≤ w ≤ a or b ≤ w ≤ b + α.
(9.9)
ˆ and K(t) as Clearly |f (w)| ≤ (b − a + α)/2. With this choice of f , and ηi , K(t) defined in (9.3), by the Cauchy–Schwarz inequality, EW 2 = 1 and (9.4),
∞ ˆ dt (b − a + α)/2 ≥ EWf (W ) = E f (W + t)K(t) −∞
K(t) dt + E f (W + t) − f (W ) K(t) dt = Ef (W ) |t|≤1 |t|≤1
ˆ dt f (W + t)K(t) +E |t|>1
ˆ − K(t) dt f (W + t) K(t) +E |t|≤1
:= H1 + H2 + H3 + H4 .
(9.10)
From (9.5), (9.9) and Lemma 9.1 we obtain H1 ≥ Ef (W )(1 − r1 ) ≥ P (a ≤ W ≤ b) − r1
and |H3 | ≤ r1 .
(9.11)
250
9
Uniform and Non-uniform Bounds Under Local Dependence
Moving on to H4 , we have
2 f (W + t) dt + 2E |H4 | ≤ (1/8)E |t|≤1
|t|≤1
ˆ − K(t) 2 dt K(t)
≤ (b − a + 2α)/8 + 2r3 .
(9.12)
Lastly to bound H2 , let L(α) = sup P (x ≤ W ≤ x + α). x∈R
Then, noting that
f
(w) = α −1 (1
1 t
H2 = E 0
[a−α,a] (w) − 1[b,b+α] (w))
f
(W + s)dsK(t)dt − E
0
0
0
−1 t
a.s., write
f
(W + s)dsK(t)dt
as α
−1
1 t
0
P (a − α ≤ W + s ≤ a) − P (b ≤ W + s ≤ b + α) dsK(t) dt
0
− α −1
0
−1 t
0
P (a − α ≤ W + s ≤ a) − P (b ≤ W + s ≤ b + α) dsK(t) dt.
Now, by Lemma 9.1 and that α = r2 ,
1 t
|H2 | ≤ α −1 L(α)ds K(t) dt + α −1 0
= α −1 L(α)
0
|t|≤1
tK(t) dt
0
−1 t
0
L(α)ds K(t) dt
1 1 ≤ α −1 r2 L(α) = L(α). 2 2
(9.13)
It follows from (9.10)–(9.13) that for all a < b P (a ≤ W ≤ b) ≤ 0.625(b − a) + 0.75α + 2r1 + 2r3 + 0.5L(α).
(9.14)
Substituting a = x and b = x + α in (9.14) and taking supremum over x we obtain L(α) ≤ 1.375α + 2r1 + 2r3 + 0.5L(α), and hence L(α) ≤ 2.75α + 4r1 + 4r3 .
(9.15)
Finally combining (9.14) and (9.15), and again recalling α = r2 , we obtain (9.8). Using Proposition 9.1 we prove the following Berry–Esseen bound for random fields satisfying (LD1), which enables one to derive Theorem 9.1. We leave details to the reader.
9.2 Outline of Proofs
251
Theorem 9.5 Under (LD1) we have supP (W ≤ z) − (z) ≤ 3.9r1 + 5.8r2 + 4.6r3 + r4 + 0.5r5 + 1.5r6 z∈R
where r1 , r2 and r3 are defined in (9.6), and
r4 = E (ξi ηi − Eξi ηi ), r5 = E |W ξi | ηi2 ∧ 1 i∈J
r6 =
|t|≤1
ˆ |t| Var K(t) dt
and
i∈J
1/2 .
Proof For z ∈ R and α > 0 let f be the solution of Stein equation (2.4) for the smoothed indicator function hz,α (w) given in (2.14). Substituting f into identity (9.4) and using (9.5) we obtain
E f (W ) − Wf (W )
∞
ˆ ˆ =E f (W ) − f (W + t) K(t)dt f (W ) K(t) − K(t) dt + E |t|>1 −∞
ˆ − K(t) dt f (W ) − f (W + t) K(t) +E |t|≤1
f (W ) − f (W + t) K(t)dt +E |t|≤1
:= R1 + R2 + R3 + R4 . By calculating as in (9.5), and applying the second inequality in (2.15) of Lemma 2.5 we obtain |R1 | = Ef (W ) (ξi ηi − Eξi ηi ) ≤ r4 , i∈J
and by the final inequality in (2.15), and Lemma 9.1, we have
f (W ) − f (W + t)K(t) ˆ dt ≤ ˆ dt ≤ r1 . E K(t) |R2 | ≤ E |t|>1
|t|>1
Applying the simple change of variable u = rt to the bound (2.16) of Lemma 2.5 on the smoothed indicator solution, we have f (w + t) − f (w)
1 1 ≤ |t| 1 + |w| + 1[z,z+α] (w + rt)dr α 0 1 t = 1 + |w| |t| + 1(z ≤ w + u ≤ z + α)du (9.16) α 0 ≤ 1 + |w| |t| + 1(z − 0 ∨ t ≤ w ≤ z − 0 ∧ t + α). (9.17) For R3 , the bound (9.17) will produce two terms. For the first,
252
9
E
Uniform and Non-uniform Bounds Under Local Dependence
ˆ − K(t)dt 1 + |W | |t|K(t)
|t|≤1
=E
|t|≤1
ˆ − K(t)dt + E|W | |t|K(t)
|t|≤1
ˆ − K(t)dt. |t|K(t)
Applying the triangle inequality and the bounds from Lemma 9.1, the first term above is bounded by r2 . Similarly, the second term may be bounded by 0.5r5 +0.5r2 . Hence |R3 | ≤ 1.5r2 + 0.5r5 + R3,1 + R3,2 , where
R3,1 = E
1
ˆ − K(t)dt 1(z − t ≤ W ≤ z + α)K(t)
and
0
R3,2 = E
0
−1
ˆ − K(t)dt. 1(z ≤ W ≤ z − t + α)K(t)
Let δ = 0.625α + 4r1 + 2.125r2 + 4r3 . Then by Proposition 9.1, P (z − t ≤ W ≤ z + α) ≤ δ + 0.625t
(9.18)
for t ≥ 0. Hence, 1 0.5α(δ + 0.625t)−1 1(z − t ≤ W ≤ z + α) R3,1 ≤ E 0 ˆ − K(t)2 dt + 0.5α −1 (δ + 0.625t)K(t) ≤ 0.5α + 0.5α −1 δ
1
ˆ Var K(t) dt + 0.32α −1
0
1
ˆ t Var K(t) dt.
0
As a corresponding upper bound holds for R3,2 , we arrive at |R3 | ≤ α + 0.5α −1 δr3 + 0.32α −1 r62 + 1.5r2 + 0.5r5 . By (9.16), (9.18) with t = 0, and Lemma 9.1 we have
|R4 | ≤ E 1 + |W | tK(t)dt |t|≤1 t
P (z ≤ W + u ≤ z + α)duK(t)dt + α −1 |t|≤1 0
−1 δ tK(t)dt ≤ r2 + 0.5α −1 δr2 . ≤ r2 + α |t|≤1
Combining the above inequalities yields Ehz,α (W ) − N hz,α
≤ r4 + r1 + 2.5r2 + 0.5r5 + α + α −1 δ(0.5r3 + 0.5r2 ) + 0.32r62 ≤ r4 + r1 + 2.82r2 + 0.5r5 + 0.32r3 + α
+ α −1 (4r1 + 2.125r2 + 4r3 )(0.5r3 + 0.5r2 ) + 0.32r62 .
(9.19)
9.3 Applications
253
Using the fact that Ehz−α,α (W ) ≤ P (W ≤ z) ≤ Ehz,α (W ) and that |(z + α) − (z)| ≤ (2π)−1/2 α, we have supP (W ≤ z) − (z) ≤ supEhz,α (W ) − N hz,α + 0.5α. z∈R
z∈R
Letting 1/2 α = (4r1 + 2.125r2 + 4r3 )(0.5r3 + 0.5r2 ) + 0.32r62 and applying the inequality (a + b)1/2 ≤ a 1/2 + b1/2 yields supP (W ≤ z) − (z) z∈R
≤ r4 + r1 + 2.82r2 + 0.5r5 + 0.32r3 1/2 + 2.5 (4r1 + 2.125r2 + 4r3 )(0.5r3 + 0.5r2 ) + 0.32r62 ≤ r4 + r1 + 2.82r2 + 0.5r5 + 0.32r3 + 1.5r6 1/2 + 2 (4r1 + 2.125r2 + 4r3 )(r3 + r2 ) . Now, applying inequality (9.7) on the last term, we obtain supP (W ≤ z) − (z) z∈R
≤ r4 + r1 + 2.82r2 + 0.5r5 + 0.32r3 + 1.5r6 √ + 2 0.5(4r1 + 2.125r2 + 4r3 ) + 2(0.5r3 + 0.5r2 ) ≤ r4 + 3.9r1 + 5.8r2 + 0.5r5 + 4.6r3 + 1.5r6 , completing the proof of Theorem 9.5.
We remark that if we use the Stein solution for the indicator hz (w) = 1(−∞,z] (w) instead of the one for the smoothed indicator hz,α (w), then the final integral in (9.19) can be no more than δ |t|≤1 |K(t)|dt, a term which is not clearly bounded by ≤ 1 + r1 . |t|≤1 |K(t)|dt, though |t|≤1 K(t)dt Under (LD2), letting τi = j ∈Bi ξj , the proof of Theorem 9.2 is based on a conditionalconcentration inequality for P (aτi ≤ W ≤ bτi |τi ), where τi = (ξ, ηi , ζi ), ζi = j ∈Bi ξj and aτi ≤ bτi are measurable functions of τi , while the proof of Theorem 9.3 relies on a non-uniform concentration inequality for E((1 + W )3 1{aτi ≤W ≤bτi } |τi ). We refer to Chen and Shao (2004) for details.
9.3 Applications The following three applications of our local dependence results were considered in Chen and Shao (2004).
254
9
Uniform and Non-uniform Bounds Under Local Dependence
Example 9.1 (Dependency Graphs) This example was discussed in Baldi and Rinott (1989) and Rinott (1994) where some results on uniform bound were obtained. Consider a set of random variables {Xi , i ∈ V} indexed by the vertices of a graph G = (V, E). G is said to be a dependency graph if for any pair of disjoint sets 1 and 2 in V such that no edge in E has one endpoint in 1 and the other in 2 , the sets of random variables {Xi , i ∈ 1 } and {Xi , i ∈ 2 } are independent. Let D denote the maximal degree of G, i.e., the maximal number of edges incident to a sin= {j ∈ V: there is an edge connecting j and i}, B = gle vertex. Let A i i j ∈Ai Aj , Ci = j ∈Bi Aj , Bi∗ = j ∈Ai Bj , Ci∗ = j ∈B ∗ Bj and Di∗ = j ∈C ∗ Bj . Noting i i that |Ai | ≤ D, |Bi | ≤ D 2 , |Ci | ≤ D 3 , Bi∗ ≤ D 3 , Ci∗ ≤ D 5 and D ∗ ≤ D 7 , i
we have that
κ1 = {j ∈ J : Ci ∩ Bj = ∅} ≤ D 5 and
κ2 = max D ∗ , j : i ∈ D ∗ ≤ D 7 . i∈J
i
j
Hence, applying Theorem 9.2 with κ = κ1 , and Theorem 9.3 with κ = κ2 , yields the following theorem. be random variables indexed by the vertices of a Theorem 9.6 Let {Xi , i ∈ V} dependency graph. Put W = i∈V Xi . Assume that EW 2 = 1, EXi = 0 and E|Xi |p ≤ θ p for i ∈ V and for some θ > 0. (9.20) supP (W ≤ z) − (z) ≤ 75D 5(p−1) |V|θ p z
and for z ∈ R,
P (W ≤ z) − (z) ≤ C(1 + |z|)−p D 7p |V|θ p .
The bound (9.20) compares favorably with those of Baldi and Rinott (1989). Example 9.2 (Exceedances of the m-scans process) Let X1 , X2 , . . . , be i.i.d. random variables and let Ri = m−1 k=0 Xi+k , i = 1, 2, . . . , n be the m-scans process. For a ∈ R consider the number of exceedances of a by {Ri : i = 1, . . . , n}, Y=
n
1{Ri > a}.
i=1
Assessing the statistical significance of exceedances of scan statistics in one and higher dimensions plays a key role in many areas of applied statistics, and is a well studied problem, see, for example Glaz et al. (2001) and Naus (1982). Scan statistics have been used, for example, for the evaluation of the significance of observed inhomogeneities in the distribution of markers along the length of long DNA sequences, see Dembo and Karlin (1992), and Karlub and Brede (1992). Dembo and Rinott
9.3 Applications
255
(1996) obtain a uniform Berry–Esseen bound for Y , of the best possible order, as n → ∞. Let p = P (R1 > a) and σ 2 = Var(Y ). From Dembo and Rinott (1996) we have 2 σ ≥ np(1 − p), and that {1{Ri > a}, 1 ≤ i ≤ n} are m-dependent. Let Y − np ξi = σ n
W=
where ξi = 1(Ri > a) − p /σ.
i=1
Since
σ2
≥ np(1 − p), we have
n 1 p(1 − p)3 + p 3 (1 − p) np(1 − p) . E ξi3 = n ≤ ≤√ 3 3 σ σ np(1 − p) i=1
Hence the following non-uniform bound is a consequence of Theorem 9.4. Theorem 9.7 There exists a universal constant C such that for all z ∈ R, P (W ≤ z) − (z) ≤
Cm3 . √ (1 + |z|)3 np(1 − p)
Chapter 10
Uniform and Non-uniform Bounds for Non-linear Statistics
In this chapter we consider uniform and non-uniform Berry–Esseen bounds for nonlinear statistics that can be written as a linear statistic plus an error term. We apply our results to U -statistics, multi-sample U -statistics, L-statistics, random sums, and functions of non-linear statistics, obtaining bounds with optimal asymptotic rates. The main tools are uniform and non-uniform randomized concentration inequalities. The work of Chen and Shao (2007) forms the basis of this chapter.
10.1 Introduction and Main Results Let X1 , X2 , . . . , Xn be independent random variables and let T := T (X1 , . . . , Xn ) be a general sampling statistic. In many cases of interest T can be written as a linear statistic plus a manageable error term, that is, as T = W + where W=
n
gn,i (Xi ),
and := (X1 , . . . , Xn ) = T − W,
i=1
for some functions gn,i . Let ξi = gn,i (Xi ). We assume that Eξi = 0 for i = 1, 2, . . . , n,
and
n
Var(ξi ) = 1,
(10.1)
i=1
and also that depends on Xi only through gn,i (Xi ), that is, with slight abuse of notation, = (ξ1 , . . . , ξn ). It is clear that if → 0 in probability as n → ∞ then the central limit theorem holds for W provided the Lindeberg condition is satisfied. If in addition, E||p < ∞ for some p > 0, then by the Chebyshev inequality followed by a simple minimization, one can obtain the following uniform bound, 1/(1+p) supP (T ≤ z) − (z) ≤ supP (W ≤ z) − (z) + 2 E||p , (10.2) z∈R
z∈R
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_10, © Springer-Verlag Berlin Heidelberg 2011
257
258
10
Uniform and Non-uniform Bounds for Non-linear Statistics
where the first term on the right hand side of (10.2) may be readily estimated by the Berry–Esseen inequality. However, after the addition of the second term the resulting bound will not generally be sharp for many commonly used statistics. Taking a different approach, by developing randomized versions of the concentration inequalities in Sects. 3.4.1 and 8.1, we can establish uniform and non-uniform Berry– Esseen bounds for T with optimal asymptotic rates. Let δ > 0 satisfy n
E|ξi | min δ, |ξi | ≥ 1/2
(10.3)
i=1
and recall that β2 =
n i=1
Eξi2 1{|ξi |>1}
and β3 =
n
E|ξi |3 1{|ξi |≤1} .
(10.4)
i=1
The following approximation of T by W provides our uniform Berry–Esseen bound for T . Theorem 10.1 Let ξ1 , . . . , ξn be independent random variables satisfying (10.1), W = ni=1 ξi and T = W + . For each i = 1, . . . , n, let i be a random variable such that ξi and (W − ξi , i ) are independent. Then for any δ satisfying (10.3), n supP (T ≤ z) − P (W ≤ z) ≤ 4δ + E|W | + E ξi ( − i ). z∈R
(10.5)
i=1
In particular, n E ξi ( − i ) (10.6) supP (T ≤ z) − P (W ≤ z) ≤ 2(β2 + β3 ) + E|W | + z∈R
i=1
and n E ξi ( − i ). sup P (T ≤ z) − (z) ≤ 6.1(β2 + β3 ) + E|W | + z∈R
(10.7)
i=1
With X2 denoting the L2 norm X2 = (EX 2 )1/2 of a random variable X, we now provide a corresponding non-uniform bound. Theorem 10.2 Let ξ1 , . . . , ξn be independent random variables satisfying (10.1), W = ni=1 ξi and T = W + . For each 1 ≤ i ≤ n, let i be a random variable such that ξi and (W − ξi , i ) are independent. Then for δ satisfying (10.3), and any p ≥ 2, P (T ≤ z) − P (W ≤ z) ≤ γz,p + e−|z|/3 τ for all z ∈ R, (10.8) where
10.1 Introduction and Main Results
259
n γz,p = P || > |z| + 1 /3 + 2 P |ξi | > |z| + 1 /(6p)
−p β2 + e 1 + z /(36p) p
2
τ = 22δ + 8.62 + 3.6
n
i=1
and
(10.9)
ξi 2 − i 2 .
i=1
If E|ξi |p < ∞ for some p > 2, then for some constant Cp depending on p only, P (T ≤ z) − (z) ≤ P || > |z| + 1 /3
n n Cp p 3∧p E|ξi | + E|ξi | . ξi 2 − i 2 + + 2 + (|z| + 1)p i=1
i=1
(10.10) The following remark shows how to choose δ so that (10.3) is satisfied. Remark 10.1 (i) When E|ξi |p < ∞ for p > 2 then one may verify that
1/(p−2) n 2(p − 2)p−2 p E|ξi | δ= (p − 1)p−1
(10.11)
i=1
satisfies (10.3) using the inequality min(x, y) ≥ y −
(p − 2)p−2 y p−1 (p − 1)p−1 x p−2
for x > 0, y ≥ 0.
(10.12)
Inequality (10.12) is trivial when y ≤ x. For y > x the inequality follows by replacing x and y by x/(p − 1) and y/(p − 2), respectively, resulting in the inequality p−2 p−2 x 1 y 1≤ , + p−1 y p−1 x which holds as the function 1 p−2 a+ a 2−p p−1 p−1
for a > 0
has a minimum of 1 at a = 1. (ii) If β2 + β3 ≤ 1/2, then (10.3) holds with δ = (β2 + β3 )/2. In fact, as (10.12) for p = 3 yields min(x, y) ≥ y − y 2 /(4x), we have n i=1
n E|ξi | min δ, |ξi | ≥ E|ξi |1{|ξi |≤1} min δ, |ξi | i=1
260
10
Uniform and Non-uniform Bounds for Non-linear Statistics
≥
n
Eξi2 1{|ξi |≤1}
i=1
E|ξi |3 1{|ξi |≤1} − 4δ
β2 + β3 4δβ2 + β3 ≥1− = 1/2. 4δ 4δ (iii) Recalling (10.1), we see that if δ > 0 satisfies =1−
n
Eξi2 1{|ξi |≥δ} < 1/2,
i=1
then (10.3) holds. In particular, when ξi , 1 ≤ i ≤ n are standardized i.i.d. ran√ dom variables, then δ may be taken to be of the order 1/ n, which may be much smaller than β2 + β3 . We turn now to our applications, deferring the proofs of Theorems 10.1 and 10.2 to Sect. 10.3.
10.2 Applications Theorems 10.1 and 10.2 can be applied to a wide range of different statistics, providing bounds of the best possible order in many instances. To illustrate the usefulness and generality of these results we present the following five applications.
10.2.1 U -statistics Let X1 , X2 , . . . , Xn be a sequence of i.i.d. random variables, and for some m ≥ 2 let h(x1 , . . . , xm ) be a symmetric, real-valued function, where m < n/2 may depend on n. Introduced by Hoeffding (1948), the class of U -statistics are those random variables that can be written as −1 n Un = h(Xi1 , . . . , Xim ). (10.13) m 1≤i1 c0 σ1 ) ≤ σ12 /2. If in addition E|g(X1 )|p < ∞ for some 2 < p ≤ 3, then √ √ 6.1E|g(X1 )|p n (1 + 2)(m − 1)σ Un ≤ z − (z) ≤ (p−2)/2 p + , supP mσ1 (m(n − m + 1))1/2 σ1 n σ1 z∈R (10.15) and there exists a universal constant C such that for all z ∈ R √ n P U ≤ z − (z) n mσ1 C E|g(X1 )|p m1/2 σ 9mσ 2 + + ≤ p . (|z| + 1)2 (n − m + 1)σ12 (|z| + 1)p (n − m + 1)1/2 σ1 n(p−2)/2 σ1 (10.16) Note that for the error bound in (10.14) to be of order O(n−1/2 ) it is necessary that σ 2 , the second moment of h, be finite. However, requiring σ 2 < ∞ is not the weakest assumption under which the uniform bound at this rate is known to hold; Friedrich (1989) obtained the order O(n−1/2 ) when E|h|5/3 < ∞. It would be interesting to use Stein’s method to obtain this same result. We refer to Benkus et al. (1994) and Jing and Zhou (2005) for a discussion regarding the necessity of the moment condition. For 1 ≤ k ≤ m, let hk (x1 , . . . , xk ) = E h(X1 , . . . , Xm )|X1 = x1 , . . . , Xk = xk , h¯ k (x1 , . . . , xk ) = hk (x1 , . . . , xk ) − √ −1 n n = mσ1 m and for l ∈ {1, . . . , n}, √ −1 n n l = mσ1 m
k
g(xi ),
i=1
h¯ m (Xi1 , . . . , Xim ),
(10.17)
1≤i1 0.5nν + supP Tn (Z1 ) ≤ z − P Tn (Z1 , Z2 ) ≤ z
z∈R
σ E|Z1 | ≤ 2 +C + 1/2 √ nν n μ ν 2 3 EY σ τ ≤ Cn−1/2 2 + 31 + √ . ν τ μ ν 4τ 2
EY13 n1/2 τ 3
(10.42)
As Tn (Z1 , Z2 ) has a standard normal distribution, (10.39) follows from (10.40) and (10.42).
10.2 Applications
273
10.2.5 Functions of Non-linear Statistics Let X1 , X2 , . . . , Xn be a random sample and θˆn = θˆn (X1 , . . . , Xn ) a weakly consistent estimator of an unknown parameter θ . Assume that θˆn can be written as n
1 ˆθn = θ + √ ξi + (10.43) n i=1
where ξi = gn,i (Xi ) are functions of Xi satisfying Eξi = 0 and ni=1 Eξi2 = 1, and := n (X1 , . . . , Xn ) → 0 in probability. Under these conditions, √ n(θˆn − θ ) →d N (0, 1) as n → ∞. The class of U -statistics, multi-sample U -statistics and L-statistics discussed in previous subsections fit into this setting. The so called ‘Delta Method’ in statistics (see Theorem 7 of Ferguson 1996, for instance) allows us to determine the asymptotic distribution of functions of the estimator θˆn . In particular, if h is differentiable in a neighborhood of θ with h continuous at θ with h (θ ) = 0, then √ n(h(θˆn ) − h(θ )) (10.44) →d N (0, 1). h (θ ) Of course, results that give some idea as to the accuracy of the Delta Method are of interest. When θˆn is the sample mean, the Berry–Esseen bound and Edgeworth expansion have been well studied (see Bhattacharya and Ghosh 1978). The next theorem shows that the results in Sect. 10.1 can be extended to functions of nonlinear statistics. Theorem 10.7 Suppose the statistic θˆn may be expressed in the form (10.43) where ξ1 , . . . , ξn areindependent random variables with mean zero and satisfy Var(W ) = 1 where W = ni=1 ξi . Assume that h (θ ) = 0 and δ(c0 ) = sup|x−θ|≤c0 |h (x)| < ∞ for some c0 > 0. Then for all 2 < p ≤ 3, √ n(h(θˆn ) − h(θ )) supP ≤ z − (z) h (θ ) z∈R
n 2c0 δ(c0 ) E ξi ( − i ) E|W | + ≤ 1+ |h (θ )| i=1
+ 6.1
n
3−p
E|ξi |p +
i=1
+
n−1/2 δ(c |h (θ )|
0)
n
δ(c0 ) 3.4c 2E|| 4 + + 0 (p−2)/2 2 1/2 |h (θ )|n c0 n c0 n
Eξi2 E|i |,
i=1
for any i , i = 1, . . . , n such that ξi and (W − ξi , i ) are independent.
(10.45)
274
10
Uniform and Non-uniform Bounds for Non-linear Statistics
Naturally, Theorem 10.7 may be applied as well to functions of linear statistics, and before turning to the proof we present the following simple example. Suppose one is making inference based on a random sample X1 , . . . , Xn from the Poisson distribution with unknown parameter λ > 0. Letting 1 Xi n n
Xn =
i=1
be the sample mean, the central limit theorem yields √ n(X n − λ) →d N (0, λ).
(10.46)
As the limiting distribution depends on the very parameter one is trying to estimate, a confidence interval based directly on (10.46) would depend on the estimate Xn of λ not only, naturally, for its centering, but also for its length, thus contributing some additional, unwanted, uncertainty. However, here, as in other important cases of interest, there exists a variance stabilizing transformation, that is, a function g such that the standardized asymptotic distribution of g(X n ) does not depend on√λ. For the case at hand, a direct application of the Delta method with g(x) = 2 x yields √ √ n(2 Xn − 2 λ) →d N (0, 1). (10.47) In (10.47) the limiting distribution is known even though λ is not, and the resulting √ confidence interval, for 2 λ, will use an estimate of λ only for centering. To apply Theorem 10.7 to calculate a bound on the error in the normal approximation justified by (10.47), we note that (10.43) holds with = 0 upon letting Xn θˆn = √ , λ
θ=
√ λ
and
Xi − λ . ξi = √ λn
√ With these choices √ (10.47) is equivalent to (10.44) when h(x) = 2λ1/4 x; in par ticular, √ note that h ( λ) = 1. With i = 0, Theorem 10.7 yields for, say, p = 3 and c0 = λ/2, that √ √ supP n(2 X n − 2 λ ) ≤ z − (z) z∈R
≤
4.9 16 6.1E|X1 − λ|3 +√ + . √ λn λ3/2 n λn
We now proceed to the proof of the theorem. Proof Let p ∈ (2, 3]. Since (10.45) is trivial if n i=1
n
p i=1 E|ξi |
E|ξi |p ≤ 1/6.
> 1/6, we assume (10.48)
10.2 Applications
275
Similar to the proof of Theorem 10.6, let ⎧ ⎨ −c0 /2 for x < −c0 /2, for −c0 /2 ≤ x ≤ c0 /2, x= x ⎩ for x > c0 /2. c0 /2 Observe that √ n(h(θˆn ) − h(θ )) h (θ ) √ θˆn −θ n ˆ = h (θ + t) − h (θ ) dt h (θ )(θn − θ ) + h (θ ) 0 √ n−1/2 (W +) n =W ++ h (θ + t) − h (θ ) dt h (θ ) 0 := W + + R, where √ n−1/2 W +n−1/2 n h (θ + t) − h (θ ) dt =+ h (θ ) 0 √ n−1/2 (W +) n h (θ + t) − h (θ ) dt. R= −1/2 −1/2 h (θ ) n W +n
and
We will apply Theorem 10.1 with and i , defined in (10.53), playing the role of and i , respectively. But first, in order to handle the remainder term R, note that if |n−1/2 W | ≤ c0 /2 and |n−1/2 | ≤ c0 /2 then R = 0. Hence P |R| > 0 ≤ P |W | > c0 n1/2 /2 + P || > c0 n1/2 /2 ≤ 4/ c02 n + 2E||/ c0 n1/2 . (10.49) n Recall W = i=1 ξi . We prove in the Appendix that under (10.48), for all 2 < p ≤ 3, n p/2 + E|ξi |p ≤ 2.2. E|W |p ≤ 2 EW 2
(10.50)
i=1
With W (i) denoting W − ξi as usual, we have
h (θ + t) − h (θ ) dt
n−1/2 W +n−1/2 0
2 ≤ 0.5δ(c0 ) n−1/2 W + n−1/2 2 2 ≤ δ(c0 ) n−1/2 W + n−1/2 p−1 ≤ δ(c0 ) (c0 /2)3−p n−1/2 |W | + (c0 /2)n−1/2 || ,
and therefore
(10.51)
276
10
Uniform and Non-uniform Bounds for Non-linear Statistics
(c0 /2)3−p δ(c0 ) c0 δ(c0 ) E|W |p + E|W | |h (θ )| |h (θ )|n(p−2)/2 3−p δ(c0 ) 2.2c c0 δ(c0 ) ≤ 1+ E|W | + 0 (p−2)/2 , |h (θ )| |h (θ )|n
E|W | ≤ E|W | +
(10.52)
where for the last term we have applied inequality (10.50). Now introducing √ n−1/2 W (i) +n−1/2 i n (10.53) h (θ + t) − h (θ ) dt, h (θ ) 0 √ the difference − i will equal − i plus n/ h (θ ) times the term in the absolute value of (10.54), which we bound in a manner similar to (10.51). In particular, applying the bound b δ(c0 ) h (θ + t) − h (θ ) dt ≤ |b − a| |a| + |b| for a, b ∈ [−c0 , c0 ], 2 a i = i +
we obtain n−1/2 W +n−1/2 −1/2 (i) −1/2 h (θ + t) − h (θ ) dt n
W
+n
i
δ(c0 ) −1/2 n W − n−1/2 W (i) n−1/2 W + n−1/2 W (i) ≤ 2 + n−1/2 + n−1/2 i + 2c0 n−1/2 − n−1/2 i p−2 3−p p−2 δ(c0 ) −1/2 3−p ≤ n W − n−1/2 W (i) c0 n−1/2 W + c0 n−1/2 W (i) 2 + n−1/2 || + n−1/2 |i | + 2c0 n−1/2 | − i | p−2 3−p ≤ δ(c0 ) c0 n−(p−1)/2 |ξi | W (i) + |ξi |p−2 + n−1 |ξi ||i | + 2c0 n−1/2 | − i | . (10.54)
Now, to attend to the final term in the bound (10.7), where, again, and i are playing the role of and i , from the inequality above we obtain n E ξi ( − i ) i=1
≤
n E ξi ( − i ) i=1
√ n nδ(c0 ) 3−p −(p−1)/2 2 (i) p−2 E |ξi | W + |ξi |p−2 c0 n + |h (θ )| i=1 n n −1 2 −1/2 Eξ |i | + 2c0 n E ξi ( − i ) +n i
i=1
i=1
10.3 Uniform and Non-uniform Randomized Concentration Inequalities
277
n 3−p n c0 δ(c0 ) 2 2c0 δ(c0 ) + Eξi + E|ξi |p ≤ 1+ E ξ ( − ) i i (p−2)/2 |h (θ )| |h (θ )|n i=1
+
n−1/2 δ(c |h (θ )|
≤ 1+ +
n 0)
2c0 δ(c0 ) |h (θ )|
i=1
Eξi2 E|i |
i=1
n
E ξi ( − i )
i=1 3−p 1.2c0 δ(c0 ) + |h (θ )|n(p−2)/2
n−1/2 δ(c0 ) 2 Eξi E|i |, |h (θ )| n
(10.55)
i=1
recalling (10.48) for the last inequality. The theorem now follows by combining (10.7), (10.49), (10.52) and (10.55).
10.3 Uniform and Non-uniform Randomized Concentration Inequalities As the previous chapters have demonstrated, the concentration inequality approach is a powerful tool for deriving sharp Berry–Esseen bounds for independent random variables. In this section we develop uniform and non-uniform randomized concentration inequalities which we will use to prove Theorems 10.1 and 10.2. Let ξ1 , . . . , ξn be independent random variables satisfying (10.1), W = ni=1 ξi and T = W + . The simple inequality −P z − || ≤ W ≤ z ≤ P (T ≤ z) − P (W ≤ z) ≤ P z ≤ W ≤ z + || (10.56) provides lower and upper bounds for the difference between the distribution functions of T and its approximation W , and involves the probability that W lies in an interval of random length. Hence, we are led to consider concentration inequalities that bound quantities of the form P (1 ≤ W ≤ 2 ). Proposition 10.1 Let δ > 0 satisfy (10.3). Then P (1 ≤ W ≤ 2 ) ≤ 4δ + E W (2 − 1 ) +
n E ξi (1 − 1,i ) + E ξi (2 − 2,i ) , (10.57) i=1
whenever ξi is independent of (W − ξi , 1,i , 2,i ) for all i = 1, . . . , n. When both 1 and 2 are not random, say, 1 = a and 2 = b with a ≤ b, then, by (ii) of Remark 10.1, whenever β1 + β2 ≤ 1/2 Proposition 10.1 recovers (3.38) by letting 1,i = a and i,2 = b for each i = 1, . . . , n.
278
10
Uniform and Non-uniform Bounds for Non-linear Statistics
Proof As the probability P (1 ≤ W ≤ 2 ) is zero if 1 > 2 we may assume without loss of generality that 1 ≤ 2 a.s. We follow the proof of (3.28). For a ≤ b let ⎧ 1 ⎪ ⎨ − 2 (b − a) − δ for w < a − δ, fa,b (w) = w − 12 (a + b) for a − δ ≤ w ≤ b + δ, ⎪ ⎩1 for w > b + δ, 2 (b − a) + δ and set Kˆ i (t) = ξi 1(−ξi ≤ t ≤ 0) − 1(0 < t ≤ −ξi )
ˆ = and K(t)
n
Kˆ i (t).
i=1
Since ξi and f1,i ,2,i (W − ξi ) are independent for 1 ≤ i ≤ n and Eξi = 0, we have EWf1 ,2 (W ) =
n E ξi f1 ,2 (W ) − f1 ,2 (W − ξi ) i=1
+
n E ξi f1 ,2 (W − ξi ) − f1,i ,2,i (W − ξi ) i=1
:= H1 + H2 .
(10.58)
ˆ ≥ 0 and f Using the fact that K(t) 1 ,2 (w) ≥ 0, we have
n E ξi H1 = =
i=1 n
∞ −∞
≥E
∞ −∞
i=1
=E
−ξi
E
0
f 1 ,2 (W
f 1 ,2 (W
+ t)dt
+ t)Kˆ i (t)dt
f 1 ,2 (W
ˆ + t)K(t)dt
f 1 ,2 (W
ˆ + t)K(t)dt ˆ K(t)dt
|t|≤δ
≥ E 1{1 ≤W ≤2 } |t|≤δ n = E 1{1 ≤W ≤2 } |ξi | min δ, |ξi | i=1
≥ H1,1 − H1,2 ,
(10.59)
where H1,1 = P (1 ≤ W ≤ 2 )
n i=1
by (10.3), and
1 E|ξi | min δ, |ξi | ≥ P (1 ≤ W ≤ 2 ), (10.60) 2
10.3 Uniform and Non-uniform Randomized Concentration Inequalities
n |ξi | min δ, |ξi | − E|ξi | min δ, |ξi | H1,2 = E i=1
1/2 n |ξi | min δ, |ξi | ≤ δ. ≤ Var
279
(10.61)
i=1
As to H2 , first, one verifies f , (w) − f , (w) ≤ |1 − 1,i |/2 + |2 − 2,i |/2, 1 2 1,i 2,i which then yields |H2 | ≤
n 1 E ξi (1 − 1,i ) + E ξi (2 − 2,i ) . 2
(10.62)
i=1
It follows from the definition of f1 ,2 that f
1 ,2
1 (w) ≤ (2 − 1 ) + δ. 2
Hence, by (10.58)–(10.62) P (1 ≤ W ≤ 2 ) ≤ 2EWf1 ,2 (W ) + 2δ +
i=1
≤ E W (2 − 1 ) + 2δE|W | + 2δ +
n E ξi (1 − 1,i ) + E ξi (2 − 2,i )
n E ξi (1 − 1,i ) + E ξi (2 − 2,i ) i=1
n ≤ E W (2 − 1 ) + 4δ + E ξi (1 − 1,i ) + E ξi (2 − 2,i ) , i=1
as desired.
Proof of Theorem 10.1 Claim (10.5) follows from applying (10.56) and Proposition 10.1 with
(z + , z, i , z) < 0, (1 , 2 , i,1 , i,2 ) = (z, z + , z, i ) ≥ 0. Next, (10.6) is trivial when β1 + β2 > 1/2, and otherwise follows from (10.5) and (ii) of Remark 10.1. Lastly, (10.7) is a direct corollary of (10.6) and (3.31). Theorem 10.2 is based on the following non-uniform randomized concentration inequality. Proposition 10.2 Let δ > 0 satisfy (10.3). If ξi is independent of (W − ξi , 1,i , 2,i ) for all i = 1, . . . , n, then for all a ∈ R and p ≥ 2,
280
10
Uniform and Non-uniform Bounds for Non-linear Statistics
P (1 ≤ W ≤ 2 , 1 ≥ a) −p ≤2 P |ξi | > (1 ∨ a)/(2p) + ep 1 + a 2 /(4p) β2 + e−a/2 τ1 , (10.63) 1≤i≤n
where β2 is given in (10.4) and τ1 = 18δ + 7.22 − 1 2 + 3
n
ξi 2 1 − 1,i 2 + 2 − 2,i 2 .
i=1
(10.64) Proof When a ≤ 2, (10.63) follows from Proposition 10.1. For a > 2, without loss of generality assume that a ≤ 1 ≤ 2 ,
(10.65)
as otherwise we may consider 1 = max(a, 1 ) and 2 = max(a, 1 , 2 ) and use the fact that |2 − 1 | ≤ |2 − 1 |. We follow the lines of argument in the proofs of Propositions 8.1 and 10.1. Let xi ¯ i = ξi 1{ξi ≤1} ,
W¯ =
n
xi ¯ i,
and W¯ (i) = W¯ − xi ¯ i.
i=1
As in (8.20), we have
!
{1 ≤ W ≤ 2 } ⊂ {1 ≤ W¯ ≤ 2 } ∪ 1 ≤ W ≤ 2 , max ξi > 1 1≤i≤n ! ⊂ {1 ≤ W¯ ≤ 2 } ∪ W ≥ a, max ξi > 1 1≤i≤n
by (10.65). Invoking Lemma 8.3 for the second term above, it only remains to show P (1 ≤ W¯ ≤ 2 ) ≤ e−a/2 τ1 .
(10.66)
We can assume that δ ≤ 0.065 since otherwise, by (8.5) of Lemma 8.1 with α = 1, ¯
P (1 ≤ W¯ ≤ 2 ) ≤ P (W¯ ≥ a) ≤ e−a/2 EeW /2 ≤ e−a/2 exp e0.5 − 1.5 ≤ 1.17e−a/2 ≤ 18δe−a/2 , implying (10.66). For α, β ∈ R let
⎧ for w < α − δ, ⎨0 w/2 fα,β (w) = e (w − α + δ) for α − δ ≤ w ≤ β + δ, ⎩ w/2 e (β − α + 2δ) for w > β + δ,
and set ¯ M¯ i (t) = ξi (1{−xi = ¯ i ≤t≤0} − 1{0 2.
10.3 Uniform and Non-uniform Randomized Concentration Inequalities
283
Applying Proposition 10.2 with (1 , 2 , 1,i , 2,i ) = z − ||, z, z − |i |, z and a = (2z − 1)/3 yields P z − || ≤ W ≤ z, || ≤ (z + 1)/3 −p P |ξi | > 1 ∨ (2z − 1)/3 /(2p) + ep 1 + (2z − 1)2 /(36p) β2 ≤2 1≤i≤n
+e ≤2
−(2z−1)/6
18δ + 7.22 + 3
n
ξi 2 − i 2
i=1
−p P |ξi | > (z + 1)/(6p) + ep 1 + z2 /(36p) β2 + e−z/3 τ.
1≤i≤n
Now combining the bound above with (10.56) and the inequality P z − || ≤ W ≤ z ≤ P || > (z + 1)/3 + P z − || ≤ W ≤ z, || ≤ (z + 1)/3 yields −γz,p − e−z/3 τ ≤ P (T ≤ z) − P (W ≤ z). Similarly showing the corresponding upper bound completes the proof of (10.8). When β1 + β2 ≤ 1/2, in light of (ii) of Remark 10.1, choosing δ = (β2 + β3 )/2 and noting that β2 ≤ ni=1 E|ξi |p , β3 ≤ ni=1 E|ξi |3∧p and n n |z| + 1 (6p)p P |ξi | > E|ξi |p , ≤ 6p (|z| + 1)p i=1
i=1
we see (10.10) holds by (10.8) and Theorem 8.1. If β2 + β3 > 1/2, then n
E|ξi |p + E|ξi |3∧p ≥ 1/2
i=1
and P (T ≥ z) ≤ P W ≥ (2z − 1)/3 + P || > (z + 1)/3
n Cp p ≤ E|ξi | + P || > (z + 1)/3 1+ p (1 + z) i=1
by (8.10). Therefore (10.10) remains valid.
284
10
Uniform and Non-uniform Bounds for Non-linear Statistics
Appendix Proof of Lemma 10.1 It is known (see, e.g., Koroljuk and Borovskich 1994, p. 271) that 2 ¯ hm (Xi1 , . . . , Xim ) E 1≤i1 z + δ) − 1 − (z) ≤ d0 I1 + δ1 I2 + δ1 I3 + δ2 I4 , where I1 = E (W + δ)f (W + δ) − Wf (W ) , I2 = E |W | 1 + W 2 f (W ) ,
I3 = E 1 − (z) − 1(W > z + δ) 1 + W 2 I4 = E 1 + |W | f (W ).
and
We will consider I2 first, and again using f (w) ≥ 0, apply |w| 1 + w2 f (w) ≤ 2 1 + |w|3 f (w).
(11.31)
Recalling inequality (2.11), e
z2 /2
1 1 1 − (z) ≤ min , √ 2 z 2π
for z > 0,
(11.32)
and the form (2.3) of the solution f = fz from Lemma 2.2, to bound the first term arising from the expectation of (11.31) we have Ef (W ) ≤ π/2P (W > z) + π/2 1 − (z) P (W ≤ 0) √ 2 + 2π 1 − (z) EeW /2 1(0 < W ≤ z) ≤ π/2P (W > z) + π/2 1 − (z) √ 2 (11.33) + 2π 1 − (z) EeW /2 1(0 < W ≤ z). Note that (11.29) implies max(δ, δ1 , δ2 ) ≤ 1/256. Hence the hypotheses of Lemma 11.2 are satisfied, and therefore also the conclusion of Lemma 11.3. Now note that since c0 in (11.20) is bounded over the given range of z, it follows from Lemma 11.2 that EezW ≤ Cez
2 /2
and hence P (W > z) ≤ e−z EezW ≤ Ce−z 2
2 /2
, (11.34)
where C denotes an absolute constant, not necessarily the same at each occurrence. This last inequality handles the first term in (11.33). We will apply the identities, for any absolutely continuous function g, that z g (y)P (W > y)dy 0
and
= g(z)P (W > z) − g(0)P (W > 0) + Eg(W )1(0 < W ≤ z),
∞
g (y)P (W > y)dy = −g(z)P (W > z) + Eg(W )1(W > z).
z
Now, to handle the last term in (11.33), by Lemma 11.3,
(11.35)
304
11
EeW
2 /2
1(0 < W ≤ z) ≤ P (0 < W ≤ z) +
z
yey
2 /2
Moderate Deviations
P (W > y)dy
0
≤ C(1 + z).
For the second term in (11.31), similarly, by (2.7), (11.32) and (2.3), E|W |3 f (W )
≤ EW 2 1(W > z) + 1 − (z) EW 2 1(W < 0) √ 2 + 2π 1 − (z) EW 3 eW /2 1(0 < W ≤ z).
The second term is clearly bounded by 2(1 − (z)), and we may bound the last expectation as z 4 2 3 W 2 /2 y + 3y 2 ey /2 P (W > y)dy 1(0 < W ≤ z) ≤ EW e 0 (11.36) ≤ C 1 + z4 , applying Lemma 11.3 again. As to EW 2 1(W > z), first, using (11.34), ∞ ∞ zW yP (W > y)dy ≤ Ee ye−zy dy z
z
= Ee
zW
≤ Ce−z
2 2 z−2 1 + z2 e−z ≤ Ce−z /2 z−2 1 + z2
2 /2
for z > 1. Thus, for all such z, by (11.35) and (11.34), ∞ 2 2 EW 1(W > z) = z P (W > z) + 2yP (W > y)dy
≤C 1+z e 2
z −z2 /2
≤ C 1 + z3 1 − (z) .
(11.37)
Now, by (11.3) with f (w) = w and (11.6) and (11.5), we have 2 ˆ EW = E K(t)dt + E(RW ) |t|≤δ
≤ E(Kˆ 1 ) + δ2 E |W | + W 2 ≤ E(Kˆ 1 ) + δ2 E 1 + 2W 2 ≤ (1 + δ1 + δ2 ) + (δ1 + 2δ2 )EW 2 ≤ 5/4 + EW 2 /4, yielding EW 2 ≤ 2. Hence (11.37) remains valid for 0 ≤ z ≤ 1 since EW 2 1(W > z) ≤ EW 2 ≤ 2. Summarizing, we have I2 ≤ C 1 + z4 1 − (z) , and in a similar fashion one may demonstrate I4 ≤ C 1 + z2 1 − (z)
(11.38)
11.4 Proofs of Main Results
305
and I3 ≤ 3 1 − (z) + E1(W ≥ δ + z) 1 + W 2 ≤ C 1 + z3 1 − (z) . Lastly, to handle I1 letting g(w) = (wf (w)) and recalling (2.81), √ 2 ( 2π(1 + w 2 )ew /2 (1 − (w)) − w)(z), w > z, g(w) = √ 2 ( 2π(1 + w 2 )ew /2 (w) + w)(1 − (z)), w < z and the inequality √
2 2π 1 + w 2 ew /2 1 − (w) − w ≤
2 1 + w3 from (5.4) of Chen and Shao (2001), we have for 0 ≤ t ≤ δ, 0≤
for w ≥ 0
Eg(W + t) = Eg(W + t)1{W + t ≥ z} + Eg(W + t)1{W + t ≤ 0} + Eg(W + t)1{0 < W + t < z} 2 P (W + t ≥ z) + 2 1 − (z) P (W + t ≤ 0) ≤ 1 + z3 √ 2 + 2π 1 − (z) E 1 + (W + t)2 + (W + t) e(W +t) /2 × 1{0 < W + t < z} ≤ C 1 + z3 1 − (z) ,
δ by arguing as in (11.36) for the final term. Now writing I1 = 0 Eg(W + t)dt, putting everything together and using the continuity of the right hand side in z to replace the strict inequality in (11.30) by a non-strict one, we obtain P (W ≥ z + δ) − 1 − (z) ≤ C 1 − (z) d0 1 + z3 δ + 1 + z4 δ1 + 1 + z2 δ2 . (11.39) Now note that for δz ≤ 1 and z ≥ 0, 1 − (z − δ) − 1 − (z) z 1 2 e−t /2 dt =√ 2π z−δ 1 2 ≤ √ δe−(z−δ) /2 2π 1 2 ≤ √ δe−z /2+zδ 2π 1 2 ≤ √ δe−z /2+1 2π ≤ eδ(1 + z) 1 − (z) ≤ 3(1 + z)δ 1 − (z) ≤ 6 1 + z3 δ 1 − (z) .
306
11
Moderate Deviations
For the third to last inequality we have used the fact that g(z) ≥ 0 for all z ≥ 0, where 1 2 g(z) = 1 − (z) − √ e−z /2 , 2π(1 + z) which can be shown by verifying g (z) ≤ 0 for all z ≥ 0, and limz→∞ g(z) = 0. Hence P (W ≥ z) − 1 − (z) = P (W ≥ z) − 1 − (z − δ) + 1 − (z − δ) − 1 − (z) ≤ P (W ≥ z) − 1 − (z − δ) + 6 1 + z3 δ 1 − (z) . Now, from (11.39), with C not necessarily the same at the occurrence, P (W ≥ z) − 1 − (z) ≤ C 1 − (z) d0 1 + z3 δ + 1 + z4 δ1 + 1 + z2 δ2 . As a corresponding lower bound may be shown in the same manner, the proof of Theorem 11.1 is complete. The proof of Theorem 11.2 follows the lines same as the proof of Theorem 11.1, with Kˆ 1 = 1, δ1 = δ2 = 0 and d0 = 1; we omit the details. We now prove our moderate deviation result for the Curie–Weiss model. Proof of (11.17) For each i ∈ {1, . . . , n} let σi be a random sample from the conditional distribution of σi given {σj , j = i, 1 ≤ j ≤ n}. Let I be a random index uniformly distributed over {1, . . . , n} independent of {σi , σi : 1 ≤ i ≤ n}. Recalling that σ 2 is the variance of the total spin ni=1 σi , and that W = ni=1 σi /σ , define W = W − (σI − σI )/σ . Then (W, W ) is an exchangeable pair. Let A(w) =
exp(−βσ w/n + β/n) , exp(βσ w/n − β/n) + exp(−βσ w/n + β/n)
and exp(βσ w/n + β/n) . exp(βσ w/n + β/n) + exp(−βσ w/n − β/n) With σ = (σ1 , . . . , σn ), from (11.16) we obtain B(w) =
n 1 E σi − σi |σ nσ i=1 1 = 2P σi = −1|σ + (−2)P σi = 1|σ nσ
E(W − W |σ ) =
i: σi =1
i: σi =−1
1 (n + σ W )A(W ) − (n − σ W )B(W ) = nσ A(W ) − B(W ) A(W ) + B(W ) W+ , = n σ
11.4 Proofs of Main Results
307
and hence E(W − W |W ) =
A(W ) + B(W ) A(W ) − B(W ) W+ . n σ
Similarly,
E (W − W )2 |W = E E (W − W )2 |σ |W
n
1 2 E σi − σi |σ
W =E nσ 2 i=1
1 (n + σ W )2A(W ) + (n − σ W )2B(W ) = 2 nσ 2(A(W ) + B(W )) 2(A(W ) − B(W )) + W. = nσ σ2 It is easy to see that e−βσ w/n 1 ≤ A(w) = e−βσ w/n + eβσ w/n 1 + exp(2βσ w/n − 2β/n) e2β/n ≤ 1 + exp(2βσ w/n) e−βσ w/n e2β/n = −βσ w/n e + eβσ w/n and similarly, 1 eβσ w/n ≤ B(w) = e−βσ w/n + eβσ w/n 1 + exp(−2βσ w/n − 2β/n) eβσ w/n e2β/n ≤ −βσ w/n . e + eβσ w/n Therefore A(W ) + B(W ) = 1 + O(1)
1 n
and 1 A(W ) − B(W ) = − tanh(βσ W/n) + O(1) . n Hence we have E(W − W |W ) 1 1 1 W + − tanh(βσ W/n) + O(1) = 1 + O(1) n n σ n 1 W W 1 tanh (ξ ) 2 = + O(1) 2 + O(1) + −βσ W/n − (βσ W/n) n nσ σ 2 n =
1 tanh (ξ )β 2 σ W 2 1−β + O(1) , W− 2 n nσ 2n
(11.40)
308
11
Moderate Deviations
using the fact that |σ W | ≤ n, and likewise E (W − W )2 |W 2W 1 W 2 − tanh (η)βσ W/n = 2 + O(1) 2 + O(1) 2 + nσ σ nσ n σ 2 2 tanh (η)βW 2 1 = 2− + O(1) 2 , (11.41) σ n2 nσ where ξ and η lie between 0 and βσ W/n. From (11.40) and Remark 11.1, W satisfies (11.3) with λ = (1 − β)/n, Kˆ 1 = (W − W )2 /2λ and R=
1 tanh (ξ )β 2 σ W 2 + O(1) . 2n(1 − β) σ
(11.42)
Further, from (11.41), E[Kˆ 1 |W ] − 1 1 = E (W − W )2 |W − 1 2λ tanh (η)βW 2 1 n − 1 − + O(1) 2 . = 2 n(1 − β) (1 − β)σ σ
(11.43)
Since (11.9) holds, the expected value of the left hand side of (11.43) is −E[RW ]. Hence, using that EW = 0, making the second term in (11.42) vanish after multiplying by W and taking expectation, we obtain n E(tanh (η)βW 2 ) 1 − 1 − + O(1) 2 n(1 − β) (1 − β)σ 2 σ tanh (ξ )β 2 σ W 3 = −E . 2n(1 − β)
(11.44)
On the left hand side, since tanh (x) is bounded on R and EW 2 = 1, the third term is O(1/n), and the last term is of smaller order than the first. On the right hand side, as tanh (x) has sign opposite that of x, we conclude tanh (ξ )W 3 ≤ 0, as ξ lies between 0 and βσ W/n. Hence the right hand side above is nonnegative. As tanh (x) is bounded on R, |W 3 | ≤ nW 2 /σ and EW 2 = √ 1, the right hand side is also bounded. Hence n/((1 − β)σ 2 ) is of order 1, and σ/ n is bounded away from 0 and infinity. Note now that from (11.44) that if E|W 3 | ≤ C then √ n − 1 = O(1/ n), 2 (1 − β)σ implying, by (11.43), that
1
E (W − W )2 |W − 1 ≤ Cn−1/2 1 + |W | .
2λ
(11.45)
11.4 Proofs of Main Results
309
Next we prove E|W 3 | ≤ C. Letting f (w) = w|w|, for which f (w) = 2|w|, substitution into (11.3), and (11.43) and (11.42) yield
E W 3
ˆ = EWf (W ) = E 2|W |K(t)dt + E Rf (W ) |t|≤δ
= 2E|W | + 2E |W | E[Kˆ 1 |W ] − 1 + E Rf (W ) tanh (η)βW 2 1 n − 1 − + O(1) = 2E|W | + 2E|W | n(1 − β) (1 − β)σ 2 σ2 tanh (ξ )β 2 σ W 2 1 +E + O(1) f (W ) 2n(1 − β) σ n 1 1 = 2E|W | + 2E|W | −1 +O E|W |3 + O(1) 2 E|W | n (1 − β)σ 2 σ tanh (ξ )β 2 σ W 3 1 +E |W | + O(1) Ef (W ). 2n(1 − β) σ As tanh (ξ )W 3 ≤ 0, and n/((1 − β)σ 2 ) − 1 = O(1), the right hand side is O(1) + O(1/n)E|W 3 |, hence E|W 3 | = O(1), √ as desired. By (11.42) and the fact that σ/ n is bounded away from zero and infinity, in place of Condition (11.6) we have instead that
C
E(R|W ) ≤ δ2 1 + W 2 where δ2 = √ . n
(11.46)
However, simple modifications can be made in the proofs of Lemma 11.2 and Theorem 11.1 so that (11.17) holds. First, note that the inequality (1 + |w|) ≤ 2(1 + w 2 ) is used in (11.21) to bound the first application of (11.6) in Lemma 11.2. Next, since ξ is between 0 and βσ W/n, the terms tanh (ξ ) and W have opposite signs. Hence, in the display (11.23) in Lemma 11.2, for the first term of the remainder R in (11.42) we have tanh (ξ )β 2 σ W 2 E W es(W ∧a) ≤ 0, 2n(1 − β) √ while the second term, of order 1/σ , that is, order 1/ n, can be absorbed after the indicated multiplication by W in the existing term δ2 Ef (W )(1 + |W |)|W |, with δ2 √ of order 1/ n. Hence (11.24), and Lemma 11.2 remain valid. In the proof of Theorem 11.1, the present case can be handled by replacing I4 = (1 + |W |)f (W ) by I4 = (1 + W 2 )f (W ), resulting in the bound I4 ≤ C 1 + z3 1 − (z) in place of (11.38). By (11.43) we √ may take d0 = O(1), and since |W − W | = |σI − σI |/σ , we have δ = O(1/ n). Likewise, √ by (11.46) and (11.45) we may take δ2 and δ1 respectively, both of order O(1/ n). Hence, in view of (11.45) and Remark 11.2, we have the following moderate deviation result for W
310
11
Moderate Deviations
P (W ≥ z) = 1 + O(1)d0 1 + z3 δ + O(1) 1 + z3 δ1 + O(1) 1 + z3 δ2 . 1 − (z) This completes the proof of (11.17).
Appendix Proof of Lemma 11.1 Write n−1=
2m−pi ,
i≥1
with 1 = p1 < p2 < · · · ≤ m1 the positions of the ones in the binary expansion of n − 1, where m1 ≤ m. Recall that X is uniformly distributed over {0, 1, . . . , n − 1}, and that m X= Xi 2m−i , i=1
with exactly S of the indicator variables X1 , . . . , Xm equal to 1. We say that X falls in category i, i = 1, . . . , m1 , when Xp1 = 1,
Xp2 = 1,
...,
Xpi−1 = 1
and Xpi = 0,
(11.47)
and in category m1 + 1 if X = n − 1. This last category is nonempty only when S = m1 and in this case, Q = m − m1 , which gives the last term in (11.48). Note that if X is in category i for i ≤ m1 , then, since X can be no greater than n − 1, the digits of X and n − 1 match up to the pith , except for the digit in place pi , where n − 1 has a one, and X a zero. Further, up to this digit, n − 1 has pi − i zeros, and so X has ai = pi − i + 1 zeros. Changing any of these ai zeros of X, except the zero in position pi , to one results in a number n − 1 or greater, while changing any other zeros, since digit pi of n − 1 is one and that same digit of X is zero, does not. Hence Q is at most ai when X falls in category i. Since X has S ones in its expansion, i − 1 of which are accounted for by (11.47), conditional on S the remaining S − (i − 1) ones are uniformly distributed over the m − pi = m − (i − 1) − ai remaining digits {Xpi +1 , . . . , Xm }. Thus, we have the inequality I (S = m1 ) 1 m − (i − 1) − ai ai + (m − m1 ) (11.48) E(Q|S) ≤ A S − (i − 1) A i≥1
where A=
m − (i − 1) − ai i≥1
S − (i − 1)
+ I (S = m1 ),
and 1 = a1 ≤ a2 ≤ a3 ≤ · · · . Note that if m1 = m, the last term of (11.48) equals 0. When m1 < m, we have I (S = m1 ) m − 1 −1 (m − m1 ) ≤ 1, (m − m1 ) ≤ m1 A
Appendix
311
so we may consider only the remaining terms of (11.48) in the following argument. We consider two cases; constants C may not necessarily be the same at each occurrence. Case 1 S ≥ m/2. As ai ≥ 1 for all i, there are at most m + 1 nonzero terms in the sum (11.48). Divide the summands into two groups, those for which ai ≤ 2 log2 m and those with ai > 2 log2 m. The first group can sum to no more than 2 log2 m, as the sum is a weighted average of the ai terms, with weights summing to less than 1. For the second group, note that m − (i − 1) − ai A S − (i − 1) m − (i − 1) − ai m − 1 ≤ S − (i − 1) S a −1 i−2 i m−S −j S −j = m−j m − (ai − 1) − 1 − j j =1
≤
j =0
1 2ai −1
2 ≤ 2, m
(11.49)
where the second to last inequality follows from S ≥ m/2 and the fact that the term considered is nonzero only when S ≤ m − ai , and the last from ai > 2 log2 m. As ai ≤ m and there are at most m + 1 terms in the sum, the terms in the second group can sum to no more than 4. Case 2 S < m/2. Divide the sum in (11.48) into two groups according as to whether i > 2 log2 m or i ≤ 2 log2 m. Reordering the product in (11.49), a i−2 i −1 S −j m−S −j m − (i − 1) − ai A≤ S − (i − 1) m−1−j m − (i − 1) − j j =0
≤ 1/2
j =1
i−1
using the assumption S < m/2, and noting that the term considered is zero unless S ≥ i − 1. The above inequality is true for all i, so in particular the summation over i satisfying i > 2 log2 m is bounded by 4. Next consider i ≤ 2 log2 m. For ai ≥ 2 the inequality log ai (11.50) + 2 log2 m S≥m ai − 1 −
log ai
−
log ai
implies S ≥ (1 − e ai −1 )m − 1 + e ai −1 i, which is equivalent to m−S−1 )ai −1 ≤ 1, which clearly holds also for ai = 1. Hence, ai ( m−(i−1)−1
312
11
ai
Moderate Deviations
m − (i − 1) − ai A S − (i − 1) m − (i − 1) − ai m − 1 ≤ ai S − (i − 1) S a −1 i−2 i S−j m−S −j = ai m−1−j m − (i − 1) − j j =0 j =1 ai −1 1 1 m−S −1 ≤ i−1 ≤ i−1 ai m − (i − 1) − 1 2 2
m−S−1 using the fact that ai ( m−(i−1)−1 )ai −1 ≤ 1. ai On the other hand, if S < m( log ai −1 ) + 2 log2 m then ai S/(m − 1) ≤ C log2 m, which implies m − (i − 1) − ai A ai S − (i − 1) a i−2 i −1 S −j m−S −j ai S ≤ m−1 m−1−j m − (i − 1) − j j =1
≤ C log2 m/2
j =1
i−2
.
Hence the sum over i is bounded by some constant time log2 m. Combining the two cases we have that the right hand side of (11.48), and therefore E(Q|S), is bounded by C log2 m. To complete the proof of the lemma, that is, to prove E(Q|W ) ≤ C(1 + |W |), we only need to show E(Q|S) ≤ C
when |W | ≤ log2 m,
(11.51)
as when |W | > log2 m we already have E(Q|W ) ≤ C log2 m ≤ C|W |. In case 1 we have shown E(Q|S) is bounded, and in case 2 that the contribution of the summands where i > 2 log2 m is bounded. Hence we need only consider√ summands where i ≤ 2 log2 m. Note that |W | ≤ log2 m implies S ≥ m/2 − m/4 log2 m. When ai , m are bigger than some universal constant, m/2 − √ ai m/4 log2 m ≥ m( log ai −1 ) + 2 log2 m, which implies that (11.50) holds. Hence, as in m−S−1 i case 2, we have that ( m−(i−1)−1 × ai /A ≤ 1/2i−1 . )ai −1 × ai ≤ 1 and m−(i−1)−a S−(i−1) Summing, we see the contribution from the remaining terms are also bounded, completing the proof of (11.51), and the lemma.
Chapter 12
Multivariate Normal Approximation
In this chapter we consider multivariate normal approximation. We begin with the extension of the ideas in Sect. 4.8 on bounds for smooth functions, using the results in Sect. 2.3.4 which may be applied in the multivariate setting. The first goal is to develop smooth function bounds in Rp . In Sect. 12.1 we obtain such bounds using multivariate size bias couplings, and in Sect. 12.3 by multivariate exchangeable pairs. In Sect. 12.4 we turn to local dependence, and bounds in the Kolmogorov distance. We consider applications of these results to questions in random graphs. Generalizing notions from Sect. 4.8, for p
k = (k1 , . . . , kp ) ∈ N0
let |k| =
p
ki ,
i=1
and for functions h : Rp → R whose partial derivatives hk (x) =
∂ k1 +···+kp h ∂ k1 x1 · · · ∂ kp xp
exists for all 0 ≤ |k| ≤ m,
p and · the supremum norm, recall that L∞ m (R ) is the collection of all functions p h : R → R with (k) p = max h (12.1) hL∞ m (R ) 0≤|k|≤m
finite. Now, for random vectors X and Y in Rp , letting p p ≤1 Hm,∞,p = h ∈ L∞ m R : hL∞ m (R )
(12.2)
define L(X) − L(Y)
Hm,∞,p
=
sup
h∈Hm,∞,p
Eh(X) − Eh(Y).
For a vector, matrix, or more generally, any array A = (aα )α∈A with A finite, let A = max |aα |. α∈A
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_12, © Springer-Verlag Berlin Heidelberg 2011
(12.3) 313
314
12
Multivariate Normal Approximation
12.1 Multivariate Normal Approximation via Size Bias Couplings The following theorem gives a smooth function bound via multivariate size bias couplings. Theorem 12.1 Let Y be a random vector in Rp with nonnegative components, mean μ = EY, and invertible covariance matrix Var(Y) = . For each i = 1, . . . , p let (Y, Yi ) be random vectors defined on a joint probability space such that Yi has the Y-size biased distribution in direction i, as in (2.68). Then, with Z a mean zero, covariance I normal vector in Rp , L( −1/2 (Y − μ) − L(Z) H 3,∞,p
≤
p p
−1/2 2 μi Var E Y i − Yj | Y
p2 2 +
j
1 p3 2 3
i=1 j =1 p p p −1/2 3
μi E Yji − Yj Yki − Yk .
(12.4)
i=1 j =1 k=1
Note that the theorem does not require the joint construction of (Y1 , . . . , Yp ). p ≤ 1, let f be the solution of (2.22) given by (2.21) Proof Given h with hL∞ 3 (R ) and (2.20). Writing out the expressions in (2.22), E h −1/2 (Y − μ) − N h
p p p ∂2 ∂ σij f (Y) − (Yi − μi ) f (Y) . (12.5) =E ∂yi ∂yj ∂yi
i=1 j =1
i=1
Recall from (2.68) that Yi is characterized by the fact that EYi G(Y) = μi EG Yi
(12.6)
for all functions G : Rp → R for which the expectations exist. For the coordinate function G(y) = yj , (12.6) gives (12.7) σij = Cov(Yi , Yj ) = EYi Yj − μi μj = Eμi Yji − Yj . Subtracting μi EG(Y) from both sides of (12.6), we obtain
E(Yi − μi )G(Y) = μi E G Yi − G(Y) . Equation (12.5), and (12.8) with G = (∂/∂yi )f , yield E h −1/2 (Y − μ) − N h p p p ∂2 ∂ ∂ i σij f (Y) − μi f Y − f (Y) . =E ∂yi ∂yj ∂yi ∂yi i=1 j =1
i=1
(12.8)
(12.9)
12.2 Degrees of Random Graphs
315
Taylor expanding (∂/∂yi )f (Yi ) about Y, with remainder in integral form, and simple calculations show that the right hand side of (12.9) equals −E
p p
μi Yji
− Yj − σij
i=1 j =1
× 0
1
(1 − t)
∂2 f (Y) − E μi ∂yi ∂yj p
p
p
i=1 j =1 k=1
∂3 ∂yi ∂yj ∂yk
f Y + t Yi − Y Yji − Yj Yki − Yk dt.
(12.10)
In the first term, we condition on Y, apply the Cauchy–Schwarz inequality and use (12.7), and then apply the bound (2.23) with k = 2 to obtain the first term in (12.4). The second term in (12.10) gives the second term in (12.4) by applying (2.23) with k = 3.
12.2 Degrees of Random Graphs In the classical Erdös and Rényi (1959b) random graph model (see also Bollobás 1985) for n ∈ N and ∈ (0, 1), K = Kn, is the random graph on the vertex set V = {1, . . . , n} with random edge set E where each pair of vertices has probability of being connected, independently of all other such pairs. For v ∈ V let 1{v,w}∈E , D(v) = w∈V
the degree of vertex v, and for d ∈ {0, 1, 2, . . .} let Xv where Xv = 1{D(v)=d} , Y= v∈V
the number of vertices with degree d. Karo´nski and Ruci´nski (1987) proved asymptotic normality of Y when n(d+1)/d → ∞ and n → 0, or n → ∞ and n − log n − d log log n → −∞; see also Palka (1984) and Bollobás (1985). Asymptotic normality when n → c > 0, was obtained by Barbour et al. (1989); see also Kordecki (1990) for the case d = 0, for nonsmooth h. Goldstein (2010b) gives a Berry–Esseen theorem for Y for all d by applying the size bias coupling in Bolthausen’s (1984) inductive method. Other univariate results on asymptotic normality of counts on random graphs, including counts of the type discussed in Theorems 12.2, are given in Janson and Nowicki (1991), and references therein. Based on the work of Goldstein and Rinott (1996) we consider the joint asymptotic normality of a vector of degree counts. For p ∈ N let di for i = 1, . . . , p be distinct, fixed nonnegative integers, and let Y ∈ Rp have ith coordinate Xvi where Xvi = 1{D(v)=di } , Yi = v∈V
the number of vertices of the graph with degree di . For simplicity we assume 0 < = c/(n − 1) < 1 in what follows, though the results below can be weakened to
316
12
Multivariate Normal Approximation
cover the case nn → c > 0 as n → ∞. To keep track of asymptotic constants, for a sequence an and a sequence of positive numbers bn write an = (bn ) if lim supn→∞ |an |/bn ≤ 1. Theorem 12.2 If = n = c/(n − 1) for some c > 0 and Z ∈ Rp is a mean zero normal vector with identity covariance matrix, then L( −1/2 (Y − μ) − L(Z) ≤ n−1/2 (r1 + r2 ), (12.11) H 3,∞,p
where p p3 b βi 24c + 48c2 + 144c3 + 48di2 + 144cdi2 + 12 and r1 = 2 i=1
r2 =
p 5 b3/2 3
p
βi c + c2 + (di + 1)2 ,
i=1
where the components μi , σij , i, j = 1, . . . , n of the mean vector μ = EY and covariance matrix = Var(Y) respectively, are given by μi = nβi
and (di − c)(dj − c) − 1 + 1{i=j } nβi , σij = nβi βj c(1 − c/(n − 1))
and
(12.12)
n − 1 di βi = (1 − )n−1−di di 1 . b= p minj βj (1 − i=1 βi )
and (12.13)
p Note that i=1 βi < 1 when {d1 , . . . , dp } = {0, 1, . . . , n − 1}, and then the quantities r1 and r2 are both of order O(1). Proof As for any v ∈ V the degree D(v) is the sum of n − 1 independent Bernoulli variables with success probability , we have D(v) ∼ Bin(n − 1, ). In particular, βi in (12.13) equals P D(v) = di = EXvi , yielding the expression for μi in (12.12). To calculate the covariance σij for i = j , with v = u write EXvi Xuj = E Xvi Xuj |{v, u} ∈ E + E Xvi Xuj |{v, u} ∈ / E (1 − ).
(12.14)
12.2 Degrees of Random Graphs
317
Given that there is an edge connecting v and u, Xvi Xuj = 1 if and only if v is connected to di − 1 vertices in V \ {u}, and u to dj − 1 vertices in V \ {v}, which are functions of independent Bernoulli variables. Hence n−2 n − 2 di +dj −2 (1 − )2n−2−di −dj E Xvi Xuj |{v, u} ∈ E = di − 1 dj − 1 di dj = βi βj 2 (n − 1)2 di dj = βi βj 2 . c Likewise, given that there is no edge between v and u, Xvi Xuj = 1 if and only if v is connected to di vertices in V \ {u}, and u to dj vertices in V \ {v}, and so n − 2 n − 2 di +dj /E = (1 − )2n−4−di −dj E Xvi Xuj |{v, u} ∈ di dj (n − 1 − di )(n − 1 − dj ) = βi βj (1 − )2 (n − 1)2 (n − 1 − di )(n − 1 − dj ) = βi βj . (n − 1 − c)2 Adding these expressions according to (12.14) yields di dj + c(n − 1) − cdi − cdj EXvi Xuj = βi βj . c(n − 1 − c) Now, multiplying by n2 − n, as Xvi Xvj = 0 for di = dj , we have di dj + c(n − 1) − cdi − cdj , EYi Yj = nβi βj c(1 − c/(n − 1)) and subtracting n2 βi βj yields (12.12) for i = j . When i = j the calculation is the same, but for the addition in the second moment of the expectation of n diagonal 2 =X . terms of the form Xvi vi We may write the covariance matrix more compactly as follows. Let 1/2 1/2 , b = β1 , . . . , βp 1/2 1/2 β1 (d1 − c) βp (dp − c) g= √ , ,..., √ c(1 − c/(n − 1)) c(1 − c/(n − 1)) and 1/2 1/2 D = diag β1 , . . . , βp , that is, the diagonal matrix whose diagonal elements are the components of b. Then it is not difficult to see that n−1 = D I + gg − bb D.
318
12
Multivariate Normal Approximation
For nonnegative definite matrices A and B, write when x Ax ≤ x Bx for all x.
A B
(12.15)
It is clear that D I − bb D n−1 . Letting λ1 (A) ≤ · · · ≤ λp (A) be the eigenvalues of A in non-decreasing order, then, see e.g. Horn and Johnson (1985), λk D I − bb D ≤ λk n−1 . It is simple to verify that the eigenvalues of B = I − bb are 1, with multiplicity p − 1, and, corresponding to the eigenvector b, λ1 (B) = 1 − b b. Now, by the Rayleigh-Ritz characterization of eigenvalues we obtain λ1 (DBD) =
λ1 (B) x DBDx y By ≥ . = min −2 p ,x=0 y∈R ,y=0 y D y x x λp (D −2 )
min p
x∈R
Hence
λ1 n−1 ≥ min βj 1 − j
p
βi = b1−1 ,
i=1
and −1/2 ≤ λp −1/2 =
n−1/2 ≤ n−1/2 b1/2 . λ1 ((n−1 )1/2 )
(12.16)
To apply Theorem 12.1, for all i ∈ {1, . . . , p} we need to couple Y to a vector Yi having the size bias distribution of Y in direction i. Let A = {vi, v ∈ V, i = 1, . . . , p} so that X = {Xvi , v ∈ V, i = 1, . . . , p} = {Xα , α ∈ A}. We will apply Proposition 2.2 to yield Yi from Xα for α ∈ A. To achieve Xα for α ∈ A, we follow the outline given after Proposition 2.2. First we generate Xαα from the Xα -size bias distribution. Since Xα is a nontrivial Bernoulli variable, we have Xαα = 1. Then we must generate the remaining variables with distribution L(Xβα |Xαα = 1). That is, for α = vi, say, we need to have D(v) = di , the degree of v equal to di , and the remaining variables so conditioned. We can achieve such variables as follows. If D(v) > di let K vi be the graph obtained by removing D(v) − di edges from K, selected uniformly from the D(v) edges of v. If D(v) < di let K vi be the graph obtained by adding di − D(v) edges of the form {v, u} to K, where the vertices u are selected uniformly from the n − 1 − D(v) vertices not connected to v. If D(v) = di let K vi = K. Using exchangeability, it is easy to see that the distribution of the graph K vi is the conditional distribution of K given that the degree of v is di . Now, for j = 1, . . . , p letting Xα . Bj = {vj : v ∈ V} we may write Yj = α∈Bj
12.2 Degrees of Random Graphs
319
By Proposition 2.2, to construct Yji , we first choose a summand of Yj according to the distribution given in (2.71), that is, proportional to its expectation. As EXvj is constant and |Bj | = n, we set P (V = v) = 1/n, so that V uniform over V, with V i be the indicator that vertex v has degree d V independent of K. Then letting Xvj j V i in K , Proposition 2.2 yields that the vector Yi with components Yji =
n
Vi Xvj ,
j = 1, . . . , p
v=1
has the Y-size biased distribution in direction i. In other words, for the given i, one vertex of K is chosen uniformly to have edges added or removed as necessary in order for it to have degree di , and then Yji counts the number of vertices of degree dj in the graph that results. We now proceed to obtain a bound for the last term in (12.4) of Theorem 12.1. Note that since exactly |D(V ) − di | edges are either added or removed from K to form K V i , and that the vertex degrees can only change on vertices incident to these edges and on vertex V itself, we have i Y − Yj ≤ D(V ) − di + 1. j This upper bound is achieved, for example, when i = j, di < dj and the degree of V and the degrees of all the D(V ) vertices connected to V have degree dj . Hence, as D(V ) ∼ Bin(n − 1, ), and = c/(n − 1), 2 E Yji − Yj Yki − Yk ≤ E D(V ) − di + 1 ≤ 2E D 2 (V ) + (di + 1)2 ≤ 2 (n − 1) + (n − 1)2 2 + (di + 1)2 = 2 c + c2 + (di + 1)2 . Now, considering the last term in (12.4), since the bound above depends only on i, applying (12.16) and that μi = nβi from (12.12), we obtain p p p 1 p3 −1/2 3 μi E Yji − Yj Yki − Yk 2 3
≤
i=1 j =1 k=1 p p 5 −1/2 3/2 b βi c + c2 n 3 i=1
+ (di + 1)2 ,
yielding the term r2 in the bound (12.11). Since Y is measurable with respect to K, following (4.143) we obtain the upper bound
Var E Yji − Yj |Y ≤ Var E Yji − Yj |K , and will demonstrate
Var E Yji − Yj |K = n−1 24c + 48c2 + 144c3 + 48di2 + 144cdi2 + 12
(12.17)
320
12
Multivariate Normal Approximation
Then, for the first term in (12.4), again applying (12.12) to give μi = nβi , and (12.16), we obtain p p
p2 −1/2 2 μi Var E Yji − Yj | Y 2
≤ n−1/2
i=1 j =1 n
p3 b 2
βi
24c + 48c2 + 144c3 + 48di2 + 144cdi2 + 12 ,
i=1
yielding r1 . To obtain (12.17) we first condition on V = v. Recalling V is uniform and letting | · | denote cardinality, in this way we obtain
E Yji − Yj |K 1 D(u) = dj + 1 − D(u) = dj D(v) − di = n D(v) u: {u,v}∈E v: D(v)>di
+
1 n
D(u) = dj − 1 − D(u) = dj
u=v, {u,v}∈ / E , D(v) di , then i − X = 1 if {u, v} ∈ E , D(u) = d + 1, and {u, v} is one of the d − D(v) Xuj uj j i edges removed at v at random, chosen with probability (D(v) − di )/D(v). Note that the factor 1/n multiplies all terms in (12.18), which provides a factor of 1/n2 in the variance. Breaking the two sums into two separate sums, so that six terms result, we will bound the variance of each term separately and then apply the bound k k Var Uj ≤ k Var(Uj ) (12.19) ×
j =1
j =1
for k = 6. The first term of (12.18) yields two sums, both of the form D(v) − di 1A (u, v) D(v) u,v for A = (u, v): {u, v} ∈ E: D(v) > di , D(u) = dj + a , the first with a = 1, and the second with a = 0. We show D(v) − di Var 1A (u, v) = 2cn + 4c2 n + 12c3 n . D(v) u,v
(12.20)
(12.21)
12.2 Degrees of Random Graphs
321
To calculate this variance requires the consideration of terms all of the form D(v ) − di D(v) − di Cov 1A (u, v) , 1A u , v . (12.22) D(v) D(v ) Let N be the number of distinct vertices among u, u , v, v . From the definition of A in (12.20), and that no edge connects a vertex to itself, we see that we need only consider cases where u = v and u = v . Hence N may only take on the values 2, 3 and 4, leading to the three terms in (12.21). There are two cases for N = 2. The n(n − 1) diagonal variance terms with (u, v) = (u , v ) can be bounded by their second moments as D(v) − di D(v) − di 2 ≤ E 1A (u, v) Var 1A (u, v) D(v) D(v) ≤ P {u, v} ∈ E c , = n−1 leading to a factor of n(c). Handling the case (u, v) = (v , u ) in the same manner gives an overall contribution of 2n(c) for the case N = 2, and the first term in (12.21). For N = 3 there are four subcases, all of which may be handled in a similar way. Consider, for example, the case u = u , v = v . Using the inequality Cov(X, Y ) ≤ EXY , valid for nonnegative X and Y , we obtain D(v ) − di D(v) − di Cov 1A (u, v) , 1A u, v D(v) D(v ) ≤ P {u, v} ∈ E, {u, v } ∈ E = c2 /(n − 1)2 . Handling the three other cases similarly and noting that the total number of N = 3 terms is no more than 4n3 leads to a contribution of 4n(c2 ) from the case N = 3 and the second term in (12.21). In the case N = 4 the vertices u, u , v, v are distinct, and we have D(v) − di D(v ) − di , 1A (u , v ) Cov 1A (u, v) D(v) D(v ) D(v) − di D(v ) − di (12.23) 1A (u , v ) − β 2, = E 1A (u, v) D(v) D(v ) where
D(v) − di β = E 1A (u, v) D(v) (D(v) − di )+ . = E 1{{u,v}∈E } 1{D(u)=dj +a} D(v)
322
12
With C the event that
Multivariate Normal Approximation
{u, u }, {v, v }, {u, v }, {u , v} ∩ E = ∅
(12.24)
we have P (C) = (1 − c/(n − 1))4 = 1 − 4(c/n). This estimate implies, noting that the events {u, v}, {u , v } ∈ E each have probability c/(n − 1) and are independent of C, that D(v) − di D(v ) − di E 1A (u, v) 1A (u , v ) D(v) D(v ) 3 D(v) − di c D(v ) − di = E 1A (u, v) 1A (u , v ) C P (C) + 4 3 D(v) D(v ) n 3 c 2 ≤ α + 4 3 , (12.25) n with
(D(v) − di )+ α = E 1{{u,v}∈E } 1{D(u)=dj +a} C , D(v)
where in the last inequality we used the conditional independence given C of the events indicated, for (u, v) and (u , v ). Bounding both α and β by the probability that {u, v} ∈ E , an event independent of C, we bound the covariance term (12.23) as 3 3 2 α + 4 c − β 2 = (α + β)(α − β) + 4 c n3 n3 3 c c + 4 3 . (12.26) ≤ 2|α − β| n n To handle α − β, letting R = {u, v} ∈ E ,
S = 1{D(u)=dj +a}
and T =
(D(v) − di )+ , D(v)
we have α − β = E[1R ST |C] − E[1R ST ] = E[ST |CR]
P (RC) − E[ST |R]P (R). P (C)
As R and C are independent and P (R) = c/(n − 1), |α − β| = E[ST |CR] − E[ST |R](c/n). Since S and T are conditionally independent given R or given CR, we have |α − β| = E[S|CR]E[T |CR] − E[S|R]E[T |R](c/n). Let X, Y ∼ Binomial(n − 4, ) and X , Y ∼ Binomial(n − 2, ), all independent. In α, conditioning on CR, D(u) − 1 and D(v) − 1 are equal in distribution to X and Y respectively; in β, conditioning on R, the same variables are distributed as X , Y . Hence,
12.2 Degrees of Random Graphs
323
(Y + 1 − di )+ |α − β| = E1{X=dj +a−1} E Y +1 (Y + 1 − di )+ c − E1{X =dj +a−1} E . Y +1 n
(12.27)
Next, note |E1{X=dj +a−1} − E1{X =dj +a−1} | = 2(c/n) and E (Y + 1 − di )+ − E (Y + 1 − di )+ = 2(c/n), Y +1 Y +1 which can be easily understood by defining X and X jointly, with X = X + ξ , with ξ ∼ Binomial(2, ), independently of X, so that P (X = X ) = 2(c/n), and constructing Y and Y similarly. Hence, by (12.27), 2 c |α − β| = 4 2 , n 3
and the N = 4 covariance term (12.26) is 12( nc 3 ). As there are no more than n4 where u, u , v, v are all distinct, their total contribution is (12c3 n), yielding the final term in (12.21). We apply a similar argument to the third and fourth terms arising from (12.18), both of the form di − D(v) 1D (u, v) n − 1 − D(v) u=v (12.28) for D = (u, v): {u, v} ∈ / E: D(v) < di , D(u) = dj − a , the first with a = 1, the second with a = 0, and show di − D(v) 1D (u, v) = 2di2 + 4di2 n + 12cdi2 n . Var n − 1 − D(v)
(12.29)
u=v
With N again counting the number of distinct indices among u, u , v, v , for the cases N = 2 and N = 3 it will suffice to apply the inequality di di − D(v) . ≤ 1{D(v) 0, which depends only on p, such that for all t ∈ (0, 1) sup hd(Q − ): h ∈ H p R √ ≤ C sup hd(Q − )t : h ∈ H + a t , Rp
where a is any constant that satisfies (12.55). For f : Rp → R, let ∇f and D 2 f denote the Hessian matrix and gradient of f , respectively. Consider the multivariate Stein equation (2.22) with μ = 0 and = I , that is, Tr D 2 f (w) − w · ∇f (w) = h(w) − N h.
(12.67)
By Götze (1991), the function 1 ft (x) = − 2
1
t
ds hs (x) − N h 1−s
(12.68)
solves the Stein equation (12.67) for ht . Again by Götze (1991), when |h| ≤ 1, there exists a constant C such that ∇ft ≤ C and D 2 ft ≤ C log t −1 . (12.69)
Setting Ki = Xi Ui , by (12.67),
Eht (W) − N h = E Tr D 2 ft (W) − W · ∇ft (W) = A − B − C + D, where
"
A = E Tr D ft (W) I − 2
n
# Ki
,
i=1
B=
n
E Xi · ∇ft (Vi ) , i=1
n
C= and E Xi · ∇ft (W) − ∇ft (Vi ) − D 2 ft (Vi )Ui i=1
D=
n i=1
E Tr Ki D 2 ft (W) − D 2 ft (Vi ) .
(12.70)
338
12
Multivariate Normal Approximation
The next lemma is used to bound Taylor series remainder terms arising from the (3) decomposition of terms (12.70). We will let ft (j kl) (x) denote the third derivative of ft at x, with respect to xj , xk , xl , with a similar notation for the partial derivatives of the normal density φ(z), and derivatives of lower orders. Lemma 12.2 Let W, V and U be any random vectors in Rp satisfying W = V + U, and let Y be any random variable. Suppose there exists constants C1 and C2 such that |U| ≤ C1 and |Y | ≤ C2 . Set κ = sup Eh(W) − N h: h ∈ H . (12.71) Then there exists a constant C, depending only on p, such that for all τ ∈ [0, 1] and h ∈ H, √ √ EYf (3) (V + τ U) ≤ CC2 κ/ t + aC1 / t + a|log t| , t (j kl) where a satisfies (12.55). Proof By replacing ht by ht − N h we may assume N ht = 0. Differentiation of (12.66), using a change of variable, and (12.68), yield √ √ 1 1 (1 − s)1/2 (3) (3) ds h( sz + 1 − sx)φj kl (z)dz. ft (j kl) (x) = p 2 t s 3/2 R Observe that ∂3 ∂ 3 (3) φj kl (z)dz = φ(z + x)dz = 1 = 0. (12.72) p p ∂ ∂ ∂ ∂ ∂ ∂ R
R
j k l
x=0
j k l
x=0
Now, EYf (3) (V + τ U) t (j kl) 1 √ 1 √ (3) (1 − s)1/2 ds EY h sz + 1 − s(V + τ U) φj kl (z)dz = 3/2 2 s Rp t 1 1/2 1 (1 − s) ds = 2 t s 3/2 √ √ √ (3) EY h 1 − sW − 1 − s(1 − τ )U + sz φj kl (z)dz × p R1 √ 1 √ √ (1 − s)1/2 ds EY h 1 − sW − 1 − s(1 − τ )U + sz = 3/2 2 t s Rp √ √ (z)dz − h 1 − sW − 1 − s(1 − τ )U φj(3) (12.73) kl 1 ≤ 2
−
=
t
1
1
s
ds C2 3/2
$
Rp
E
sup√
|u|≤C + s|z|
√ h( 1 − sW + u).
1 % √ (3) h( 1 − sW + u) φj kl (z)dz
inf√ |u|≤C1 + s|z| C2 1 1 ds 2 t s 3/2 Rp
√ √ (3) E h˜ 1 − sW; C1 + s|z| φj kl (z)dz,
(12.74)
12.4 Local Dependence, and Bounds in Kolmogorov Distance
339
where we have used (12.72) to obtain (12.73), and recalled the definition of h˜ in (5.28). Let Z denote an independent standard normal vector in Rp . Adding and subtracting, the quantity (12.74) equals √ √ √ √ C2 1 1 E h˜ 1 − sW; C1 + s|z| − h˜ 1 − sZ; C1 + s|z| 3/2 p 2 t s R √ √ ˜ (12.75) + h 1 − sZ; C1 + s|z| φj(3) kl (z) dz. ˜ for any > 0, Again in view of definition (5.28) of h, √ √ E h( ˜ 1 − sZ; ) ˜ 1 − sW; ) − h( √ √ + ≤ E h+ ( 1 − sW; ) − h ( 1 − sZ; ) √ − √ − + E h ( 1 − sW; ) − h ( 1 − sZ; ) .
(12.76)
By the closure conditions on the class H and the definition (12.71) of κ, we see that for any > 0 the expression (12.76) is bounded by 2κ. As 1 1 C ds ≤ √ , 3/2 t t s we conclude, for some C, 1 √ √ 1 E h˜ 1 − sW; C1 + s|z| ds 3/2 Rp t s √ √ √ (3) ˜ 1 − sZ; C1 + s|z| φj kl (z)dz ≤ Cκ/ t. (12.77) −h Turning now to the last term of (12.75), by (12.55), √ √ √ E h˜ 1 − sZ; C1 + s|z| ≤ a C1 + s|z| . Hence
t
1
1 s 3/2
ds
Rp
√ √ E h˜ 1 − sZ; C1 + s|z| φj(3) kl (z) dz
√ (3) C1 + s|z| φj kl (z)dz ds 3/2 Rp t s √ ≤ Ca C1 / t + |log t| .
≤a
1
1
Lemma 12.2 now follows by collecting terms.
We are now ready for the proof of Theorem 12.8. Proof Consider the decomposition in (12.70), starting with the term C. Let Xij and Uij denote the j th components of Xi and Ui , respectively. For i = 1, . . . , n, Taylor expansion of ∇ft (W) about Vi shows that C equals 1 p p p n (3) E (1 − τ ) Xij Uik Uil ft (j kl) (Vi + τ Ui )dτ. (12.78) i=1
0
j =1 k=1 l=1
340
12
Multivariate Normal Approximation
Applying Lemma 12.2 for each i, with U = Ui and Y = Xij Uik Uil , recalling |Xi | ≤ B and |Ui | ≤ A1 , we obtain √ √ (12.79) |C| ≤ CnA21 B κ/ t + aA1 / t + a| log t| . Next consider the term D in (12.70). A first order Taylor expansion yields p 1 (2) ft(2) (W) − f (V ) = ft(3) (12.80) i (j k) t (j k) (j kl) (Vi + τ Ui )Uil dτ. l=1
0
The term D is obtained by multiplying (12.80) by the entries of Ki , and it is easy to see from their definition that this leads to a term which is similar to (12.78), allowing us to conclude that |D| is bounded by the right hand side of (12.79), with a possibly different constant. Next, note that we may write B of (12.70) as B=
n
E ∇ft (Vi ) · E(Xi |Vi ) . i=1
By the bound (12.69) on the solution to (12.67), the components of ∇ft (Vi ) are uniformly bounded, implying that for some positive constant C p n |B| ≤ C E E(Xij |Vi ).
(12.81)
i=1 j =1
Finally, consider A. With δj k = 1 when j = k and 0 otherwise, we have # " n 2 Ki Tr D ft (W) I − i=1
= =
p p j =1 k=1 p p
D ft (j k) (W) δj k − 2
n
Xik Uij
i=1
D 2 ft (j k) (W)
j =1 k=1
n n n × δj k − E(Xik Uij ) + E(Xik Uij ) − Xik Uij . i=1
i=1
(12.82)
i=1
By the bound (12.69), we have that |D 2 ft (j k) (W)| ≤ C log(t −1 ) for all j, k = 1, . . . , p and t ∈ (0, 1). Hence, for the expectation of the first two terms of the last line of (12.82), we obtain the bound p p n 2 E D ft (j k) (W) δj k − E(Xik Uij ) j =1 k=1 i=1 p p n E(Xik Uij ). ≤ C|log t| (12.83) δj k − j =1 k=1
i=1
12.4 Local Dependence, and Bounds in Kolmogorov Distance
341
Now write the expression involving the last two terms in the last line of (12.82) in the form p p
D 2 ft (j k) (W) − D 2 ft (j k) (Ti ) + D 2 ft (j k) (Ti )
j =1 k=1
× E(Xik Uij ) − Xik Uij .
(12.84)
Taylor expansion of the difference D 2 ft (j k) (W) − D 2 ft (j k) (Ti ), and Lemma 12.2 applied for each i with U = Ri and Y = Ril (Xik Uij − E(Xik Uij )), imply p p n 2
E D ft (j k) (W) − D 2 ft (j k) (Ti ) E(Xik Uij ) − Xik Uij i=1 j =1 k=1
√ √ ≤ CnA1 A2 B κ/ t + aA2 / t + a|log t| .
(12.85)
Returning to (12.84), we apply (12.69) to bound the last term by p p n 2
E D ft (j k) (Ti ) E(Xik Uij ) − Xik Uij i=1 j =1 k=1
≤ C|log t|
p p n E E(Xik Uij ) − E(Xik Uij |Ti ).
(12.86)
i=1 j =1 k=1
Combining Lemma 12.1, the decomposition (12.70), and the bounds (12.79), (12.81), (12.83), (12.85) and (12.86), noting that since A1 ≤ A2 the term (12.79) may be ignored, being of smaller order than (12.85), we obtain √ √ κ ≤ CnA1 A2 Bκ/ t + CnaA1 A1 B A2 / t + |log t| +C
p n E E(Xij |Vi ) i=1 j =1
p p n E(Xij Uik ) + C|log t| δj k − j =1 k=1
i=1
p p n √ E E(Xij Uik ) − E(Xij Uik |Ti ) + Ca t. + i=1 j =1 k=1
(12.87)
√ Setting t = 2CnA1 A2 B, provided it is less than 1, simple manipulations yield (12.65) after observing that the last term in (12.87) is of lower order than the second term. If t > 1 for the choice above, then by enlarging C in (12.65) as necessary, the theorem is trivial.
Chapter 13
Non-normal Approximation
Though the principle theme of this book concerns the normal distribution, in this chapter we explore how Stein’s method can be applied to approximations by nonnormal distributions as well. There are already many well known distributions other than the normal where Stein’s method works, the Poisson case being the most notable. Here we focus on approximation by continuous distributions where an analysis parallel to that for the normal, such as the method of exchangeable pairs, may proceed. Denoting the random variable of interest as W = Wn , it may be the case that appropriate approximating or limiting distributions of Wn are not known a priori. In this chapter, following Chatterjee and Shao (2010) we first develop a method of exchangeable pairs which identifies an appropriate approximating distribution, and which obtains L1 and Berry–Esseen type bounds for that √ approximation. As applications, in Sect. 13.3 we obtain error bounds of order 1/ n in both the L1 and Kolmogorov distance for the non-central limit theorem for the magnetization in the Curie–Weiss model at the critical temperature. In Sect. 13.4, and also using different methods, we derive bounds for approximations by the exponential distribution, and, following Chatterjee et al. (2008), apply the results to the spectrum of the Bernoulli Laplace diffusion model, and following Peköz and Röllin (2009), to first passage times of Markov chains.
13.1 Stein’s Method via the Density Approach One way of looking at Stein’s characterization for the normal is the following. Since √ 2 the standard normal density function φ(z) = e−z /2 / 2π satisfies φ (z) (13.1) = −z, we have φ (z) + zφ(z) = 0, φ(z) and integration by parts now yields a kind of ‘dual’ equation for the distribution of Z having density φ, that is, the Stein characterization E f (Z) − Zf (Z) = 0 L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_13, © Springer-Verlag Berlin Heidelberg 2011
343
344
13
Non-normal Approximation
holding for all functions for which the expectations above exist. We will now see that a number of arguments used for the normal hold more generally for distributions with density p(y) when replacing the ratio −y in (13.1) by p (y)/p(y). We will consider approximations by the distribution of Y , a random variable with probability density function p satisfying the following condition. Condition 13.1 For some −∞ ≤ a < b ≤ ∞, the density function p is strictly positive and absolutely continuous over the interval (a, b), zero on (a, b)c , and possesses a right-hand limit p(a+) at a and a left-hand limit p(b−) at b. Furthermore, the derivative p of p satisfies b p (y)dy < ∞. (13.2) a
The key step for applying the Stein method for approximation by the distribution of Y is the development of a Stein identity and the derivation of bounds on solutions to the Stein equation.
13.1.1 The Stein Characterization and Equation Let Y have density p satisfying Condition 13.1. Then, letting f be an absolutely continuous function satisfying f (a+) = f (b−) = 0, whenever the expectations below exist, on the interval (a, b) we have E f (Y ) + f (Y )p (Y )/p(Y ) = E f (Y )p(Y ) /p(Y ) b = f (y)p(y) dy = f (b−)p(b−) − f (a+)p(a+) = 0, (13.3) a
that is,
E f (Y ) + f (Y )p (Y )/p(Y ) = 0.
(13.4)
For any measurable function h with E|h(Y )| < ∞, let f = fh be the solution to the Stein equation f (w) + f (w)p (w)/p(w) = h(w) − Eh(Y ).
(13.5)
Rewriting (13.5) we have that f (w)p(w) = h(w) − Eh(Y ) p(w) and hence the solution for w ∈ (a, b) is given by w h(y) − Eh(Y ) p(y)dy fh (w) = 1/p(w) a
= −1/p(w)
b
w
h(y) − Eh(Y ) p(y)dy.
(13.6)
13.1 Stein’s Method via the Density Approach
345
As the Stein equation, and its solution f , are valid only over the interval (a, b), we consider, without further mention, approximation of the distributions of random variables W by that of Y , having density p on (a, b), only when the support of W is contained in the closure of (a, b). Example 13.1 (Exponential Distribution) Let Y ∼ E(λ), where E(λ) denotes the exponential distribution with parameter λ, that is, Y is a random variable with density function p(y) = λe−λy 1(y > 0). Then p (y)/p(y) = −λ and identity (13.4) becomes E f (Y ) − λEf (Y ) = 0, (13.7) for any absolutely continuous f for which the expectation above exists, satisfying f (0) = limy→∞ f (y) = 0. Similar to the case of the normal, (13.7) is a characterization of the exponential distribution in that if (13.7) holds for all such functions f then Y ∼ E(λ). The exponential distribution is a special case of the Gamma, which we turn to next. Example 13.2 (The Gamma and χ 2 distributions) With α and β positive numbers, we say Y has the (α, β) distribution when Y has density p(y) = Then p (y)/p(y) =
y α−1 e−y/β 1{y>0} . β α (α)
− β1 and identity (13.4) becomes
α−1 1 − f (Y ) = 0. E f (Y ) + Y β
α−1 y
In the special case where Y has the χk2 distribution, that is, the (k/2, 2) distribution, the identity specializes to
k−2 1 E f (Y ) + − f (Y ) = 0. 2Y 2 Approximation by Gamma and χ 2 distributions have been considered by Luk (1994) and Pickett (2004). Gamma approximation of the distribution of stochastic integrals of Weiner processes is handled in Nourdin and Peccati (2009), the normal version of which is explored in Chap. 14. Example 13.3 Let W(α, β) denote the density function p(y) =
αe−|y|
α /β
2β 1/α ( α1 )
for y ∈ R, with α > 0, β > 0.
For any α > 0, by the change of variable u = y α we have ∞ ∞ 2 ∞ −u 1/α−1 2 α α e−|y| dy = 2 e−y dy = e u du = (1/α), α α −∞ 0 0
(13.8)
346
13
Non-normal Approximation
hence, scaling by β 1/α > 0, the family of functions (13.8) are densities on R. Note that the mean zero normal distributions with variance σ 2 are the special case W(2, 2σ 2 ). Of special interest will be the distribution W(4, 12) with density √ 4 (13.9) p(y) = c1 e−y /12 for y ∈ R, where c1 = 2/ 31/4 (1/4) . For W(4, 12) the ratio of the derivative to the density is given by p (y) y3 =− . p(y) 3
13.1.2 Properties of the Stein Solution As in the normal case, in order to determine error bounds for approximations by the distribution of Y , we need to understand the basic properties of the Stein solution. As we consider the approximation of a random variable W whose support is contained in the closure of (a, b), in this chapter for a function f on R we take f to be the supremum of |f (w)| over w ∈ (a, b). Lemma 13.1 Let p be a density function satisfying Condition 13.1 for some −∞ ≤ a < b ≤ ∞ and let y F (y) = p(x)dx a
be the associated distribution function. Further, let h be a measurable function and fh the Stein solution given by (13.6). (i) Suppose there exist d1 > 0 and d2 > 0 such that for all y ∈ (a, b) we have (13.10) min 1 − F (y), F (y) ≤ d1 p(y) and
p (y) min F (y), 1 − F (y) ≤ d2 p 2 (y).
(13.11)
Then if h is bounded
and
fh ≤ 2d1 h , fh p /p ≤ 2d2 h
(13.13)
f ≤ (2 + 2d2 ) h .
(13.14)
h
(13.12)
(ii) Suppose in addition to (13.10) and (13.11), there exist d3 ≥ 0 such that p min E|Y |1{Y ≤y} + E|Y |F (y), E|Y |1{Y >y} + E|Y | 1 − F (y) p ≤ d3 p(y) (13.15)
13.2 L1 and L∞ Bounds via Exchangeable Pairs
347
and d4 (y) such that for all y ∈ (a, b) we have min E|Y |1{Y ≤y} + E|Y |F (y), E|Y |1{Y >y} + E|Y | 1 − F (y) ≤ d4 (y)p(y).
(13.16)
Then if h is absolutely continuous with bounded derivative h , f ≤ (1 + d2 )(1 + d3 ) h , h fh (y) ≤ d4 (y) h for all y ∈ (a, b), and
f ≤ (1 + d3 )d1 h . h
(13.17) (13.18)
(13.19)
The proof of the lemma is deferred to the Appendix.
13.2 L1 and L∞ Bounds via Exchangeable Pairs Let W be a random variable of interest and (W, W ) an exchangeable pair. Write E(W − W |W ) = g(W ) + r(W ),
(13.20)
where we consider g(W ) to be the dominant term and r(W ) some negligible remainder. When g(W ) = λW , and λ−1 E((W − W )2 |W ) is nearly constant, the results in Sect. 5.2 show that the distribution of W can be approximated by the normal, subject to some additional conditions. Here we use the function g(w) to determine an appropriate approximating distribution for W , or, more particularly, identify its density function p. Once p is determined, we can parallel the development of Stein’s method of exchangeable pairs for normal approximation. As a case in point, the proofs in this section depend on the following exchangeable pair identity, analogous to the one applied in the proof of Lemma 2.7 for the normal. That is, when (13.20) holds, for any absolutely continuous function f for which the expectations below exist, recalling = W − W , by exchangeability we have 0 = E(W − W ) f (W ) + f (W ) = 2Ef (W )(W − W ) + E(W − W ) f (W ) − f (W ) 0 = 2E f (W )E (W − W )|W − E(W − W ) f (W + t)dt −
∞ ˆ f (W + t)K(t)dt (13.21) = 2Ef (W )g(W ) + 2Ef (W )r(W ) − E −∞
where ˆ = E 1{− ≤ t ≤ 0} − 1{0 < t ≤ − } |W . K(t)
(13.22)
348
13
Non-normal Approximation
Note that here, similar to (2.39), we have ∞ ˆ K(t)dt = E 2 |W . −∞
(13.23)
For a given function g(y) defined on (a, b) let Y be a random variable with density function p(y) = 0 for y ∈ / (a, b), and for y ∈ (a, b), ⎧ y ⎪ ⎨ 0 g(s)ds if 0 ∈ [a, b), y −c0 G(y) p(y) = c1 e where G(y) = (13.24) a g(s)ds if a > 0, ⎪ ⎩ y b g(s)ds if b ≤ 0 with c0 > 0 and c1−1
b
=
e−c0 G(y) dy < ∞.
(13.25)
a
Note that (13.24) implies p (y) = −c0 g(y)p(y)
for all y ∈ (a, b).
(13.26)
Theorem 13.1 shows that for deriving L1 bounds for approximations by distributions with densities p of the form (13.24), it suffices that there exist a function b0 (y), and constants b1 and b2 , such that f (y) ≤ b0 (y) for all y ∈ (a, b), and (13.27) f ≤ b1 and f ≤ b2 for all solutions f to the Stein equation (13.5) for absolutely continuous functions h with h ≤ 1. For some cases, the following two conditions will help verify the hypotheses of Lemma 13.1 for densities of the form (13.24), thus implying bounds of the form (13.27). Condition 13.2 On the interval (a, b) the function g(y) is non-decreasing and yg(y) ≥ 0. Condition 13.3 On the interval (a, b) the function g is absolutely continuous, and there exists c2 < ∞ such that
1 3 1 , |y| + max 1, c0 g (y) ≤ c2 . min c1 |c0 g(y)| c1 Lemma 13.2 Suppose that the density p is given by (13.24) for some c0 > 0, and g satisfying Conditions 13.2 and 13.3, and E|g(Y )| < ∞ for Y having density p. Then Condition 13.1 and all the bounds in Lemma 13.1 on the solution f and its derivatives hold, with d1 = 1/c1 , d2 = 1, d3 = c2 and d4 (y) = c2 for all y ∈ (a, b). We refer the reader to the Appendix for a proof of Lemma 13.2. Equipped with bounds on the solution, we can now provide the following L1 result.
13.2 L1 and L∞ Bounds via Exchangeable Pairs
349
Theorem 13.1 Let (W, W ) be an exchangeable pair satisfying (13.20) and set = W − W . Let Y have density p of the form (13.24), on an interval (a, b), with c0 > 0, and g in (13.20) satisfying E|g(Y )| < ∞. Suppose that the solution f to the Stein equation (13.5), for all absolutely continuous functions h with h ≤ 1, satisfies (13.27) for some function b0 (w) and constants b1 and b2 . Then L(W ) − L(Y ) ≤ b1 E 1 − c0 E 2 |W 1 2 c0 b 2 E| |3 + c0 E r(W )b0 (W )1{a<W 1, we assume c1 c2 δ ≤ 1.
(13.32)
Let F be the distribution function of Y and for z ∈ R let f = fz be the solution to the Stein equation f (w) + f (w)p (w)/p(w) = 1(w ≤ z) − F (z) or, by (13.26), equivalently, to f (w) − c0 f (w)g(w) = 1(w ≤ z) − F (z).
(13.33)
By Lemma 13.2, the bound (13.12) of Lemma 13.1 holds, so f < ∞. Letting ˆ K(t) be given by (13.22), in view of identities (13.21) and (13.23), and that |W − ˆ ≥ 0, we obtain W | ≤ δ and K(t) 2Ef (W )g(W ) + 2Ef (W )r(W ) ∞ ˆ =E f (W + t)K(t)dt −∞ δ
=E ≥E
−δ δ
−δ
ˆ c0 f (W + t)g(W + t) + 1(W + t ≤ z) − F (z) K(t)dt
ˆ c0 f (W + t)g(W + t)K(t)dt + E1{W ≤z−δ} 2 − F (z)E 2 .
Rewriting, and adding and subtracting using (13.23) again, we have E1{W ≤z−δ} 2 − F (z)E 2 ≤ 2Ef (W )g(W ) + 2Ef (W )r(W ) − E
δ −δ
ˆ c0 f (W + t)g(W + t)K(t)dt
= 2Ef (W )g(W ) 1 − (c0 /2)E 2 |W + 2Ef (W )r(W ) δ ˆ f (W )g(W ) − f (W + t)g(W + t) K(t)dt + c0 E −δ
:= J1 + J2 + J3 .
(13.34)
13.2 L1 and L∞ Bounds via Exchangeable Pairs
351
Lemma 13.2 and (13.12), (13.13), and (13.14) of Lemma 13.1 yield, along with (13.26), that f ≤ 2/c1 , Therefore
f g ≤ 2/c0
and f ≤ 4.
(13.35)
|J1 | ≤ (4/c0 )E 1 − (c0 /2)E 2 |W
(13.36)
|J2 | ≤ (4/c1 )E r(W ).
(13.37)
and
To bound J3 , we first show that c1 c2 δ c1 + c0 g(w) . sup g(w + t) − g(w) ≤ 2c 0 |t|≤δ From Condition 13.3 it follows that c 1 c2 g (w) ≤ 3c0 min(1/c1 , 1/|c0 g(w)|) c 1 c2 max c1 , c0 g(w) = 3c0 c1 c2 ≤ c1 + c0 g(w) . 3c0
(13.38)
(13.39)
Thus by the mean value theorem sup g(w + t) − g(w) ≤ δ sup g (w + t) |t|≤δ
|t|≤δ
c 1 c2 δ c1 + c0 sup g(w + t) 3c0 |t|≤δ c 1 c2 δ c1 + c0 g(w) + c0 sup g(w + t) − g(w) ≤ 3c0 |t|≤δ c1 c2 δ c 1 c2 δ c1 + c0 g(w) + sup g(w + t) − g(w) = 3c0 3 |t|≤δ 1 c1 c2 δ c1 + c0 g(w) + sup g(w + t) − g(w), ≤ 3c0 3 |t|≤δ ≤
by (13.32). This proves (13.38). Now, by (13.35) and (13.38), when |t| ≤ δ, f (w)g(w) − f (w + t)g(w + t) ≤ g(w)f (w + t) − f (w) + f (w + t)g(w + t) − g(w) 2 c1 c2 δ ≤ 4g(w)|t| + c1 + c0 g(w) c1 2c0 ≤ (4 + c2 )δ g(w) + δc1 c2 /c0 .
Therefore
352
13
Non-normal Approximation
|J3 | ≤ c0 (4 + c2 )δE g(W ) 2 + δc1 c2 E 2 ≤ (4 + c2 )δ 3 E c0 g(W ) + c1 c2 δ 3 .
(13.40)
Combining (13.34), (13.36), (13.37) and (13.40) shows that 4 4 c0 E1{W ≤z−δ} 2 − F (z)E 2 ≤ E 1 − E 2 |W + E r(W ) c0 2 c1 3 (13.41) + (4 + c2 )δ E c0 g(W ) + c1 c2 δ 3 . On the other hand, using F (z) = p(z) ≤ c1 by (13.24), we have E1{W ≤z−δ} 2 − F (z)E 2 2 E1{W ≤z−δ} − F (z − δ) = c0
c0 2 − E 1{W ≤z−δ} − F (z) 1 − E 2 |W c0 2 2 + F (z − δ) − F (z) c0 2 ≥ P (W ≤ z − δ) − F (z − δ) c0 c0 2 2c1 δ 2 , − E 1 − E |W − c0 2 c0
(13.42)
which together with (13.41) yields P (W ≤ z − δ) − F (z − δ) c0 2 ≤ E 1 − E |W + c1 δ 2
c0 c0 2 + (4/c0 )E 1 − E |W + (4/c1 )E r(W ) 2 2 3 3 + (4 + c2 )δ E c0 g(W ) + c1 c2 δ c0 2 = 3E 1 − E |W + c1 δ + 2c0 E r(W )/c1 2 3 + δ c0 (2 + c2 /2)E c0 g(W ) + c1 c2 /2 .
(13.43)
Similarly, one can demonstrate F (z + δ) − P (W ≤ z + δ) c0 2 ≤ 3E 1 − E |W + c1 δ + 2c0 E r(W )/c1 2 3 + δ c0 (2 + c2 /2)E c0 g(W ) + c1 c2 /2 . As c1 δ ≤ c1 max(1, c2 )δ, the proof of (13.31) is complete.
(13.44)
13.3 The Curie–Weiss Model
353
13.3 The Curie–Weiss Model The Curie–Weiss model, or the Ising model on the complete graph, was introduced in Sect. 11.2 and is a simple statistical mechanical model of ferromagnetic interaction. We recall that for n ∈ N the vector σ = (σ1 , . . . , σn ) of ‘spins’ in {−1, 1}n are assigned probability
β σi σj p(σ ) = Cβ exp (13.45) n i<j
for a given ‘inverse temperature’ β > 0, with Cβ the appropriate normalizing constant. For a detailed mathematical treatment of the Curie–Weiss model in general we refer to the book by Ellis (1985). When 0 < β < 1 the total spin ni=1 σi , properly standardized, converges to a standard normal distribution as n → ∞, see, e.g., Ellis and Newman (1978a, 1978b). For β = 1 it was proved by Ellis and Newman (1978a, 1978b) that as n → ∞, the law of W = n−3/4
n
σi
(13.46)
i=1
converges to the distribution W(4, 12) of Example 13.3. For various interesting extensions and refinements of their results, see Ellis et al. (1980), and Papangelou (1989). Here we present the following L1 and Berry–Esseen bounds for the critical β = 1 non-central limit theorem, obtained via Theorems 13.1 and 13.2, respectively. Theorem 13.3 Let W be the scaled total spin (13.46) in the Curie–Weiss model, where the vector σ of spins has distribution (13.45) at the critical inverse temperature β = 1, and let Y be a random variable with distribution W(4, 12) as in (13.9). Then there exists a universal constant C such that for all n ∈ N, L(W ) − L(Y ) ≤ Cn−1/2 . (13.47) 1 and
supP (W ≤ z) − P (Y ≤ z) ≤ Cn−1/2 .
(13.48)
z∈R
The required exchangeable pair is constructed following Example 2.2. Given σ having distribution (13.45), construct σ by choosing I uniformly and independently of σ , and replacing σI by σI , where σI is generated from the conditional distribution of σI given {σj , j = I }. It is easy to see that (σ , σ ) is an exchangeable pair. Let W = W − σI + σI , the total spin of the configuration when σI is replaced by σI . Considering the sequence of distributions (13.45) indexed by n ∈ N, the key step is to show (13.20), or, more specifically for the case at hand, that 1 (13.49) E(W − W |W ) = n−3/2 W 3 + O n−2 as n → ∞. 3
354
13
Non-normal Approximation
To explain (13.49), roughly, a simple computation shows that at any inverse temperature, E(W − W |W ) = n−3/4 M − tanh(βM) + O n−2 as n → ∞, where M = n−1/4 W is known as the magnetization. Since M 0 with high probability when β ≤ 1, and a Taylor expansion about zero yields tanh x = x − x 3 /3 + O(x 5 ), we see that M − tanh(βM) behaves like n−3/4 M(1 − β) when β < 1, and like n−3/4 M 3 /3 when β = 1. This is what distinguishes the high temperature regime β < 1 from the critical case β = 1, and how we arrive at (13.49). Comparing (13.49) with (13.20), we find that if on (−∞, ∞) we take 1 g(y) = n−3/2 y 3 3
and c0 = n3/2 ,
(13.50)
then, following (13.24), the density function
y 4 g(s)ds = c1 e−y /12 p(y) = c1 exp −c0 0
results, that is, the distribution W(4, 12) of the family considered in Example 13.3. We note that though c0 depends on n, the constant c1 given in (13.9) does not. We now make (13.49) precise, as well as verify the remainder of the hypotheses required in order to invoke Theorem 13.2. Lemma 13.3 Let W be the scaled total spin (13.46) in the Curie–Weiss model, where the spins σ have distribution given by (13.45) at the critical inverse temperature β = 1, and let W be the given by (13.46) when a uniformly chosen spin from σ has been replaced by one having its conditional distribution given the others. Then for all n ∈ N, |W − W | ≤ 2n−3/4 ,
1 −3/2 3 E E(W − W |W ) − n W ≤ 15n−2 , 3 15 −1/2 n3/2 2 E 1 − E (W − W ) |W ≤ n 2 2
(13.51) (13.52)
(13.53)
and E|W |3 ≤ 15.
(13.54)
Proof As W and W differ in at most one coordinate, (13.51) is immediate. Next, n −1 −1/4 W be the magnetization, and for each i, let let M = n i=1 σi = n σj . Mi = n−1 j =i
13.3 The Curie–Weiss Model
355
It is easy to see that for every i = 1, 2, . . . , n, if one chooses a variables σi from the conditional distribution of the ith spin given σj , j = i, independently of σi , then for τ ∈ {−1, 1}, P σi = τ |σj , j = i = P σi = τ |σ =
e Mi τ , eMi + e−Mi
(13.55)
and so E σi |σ =
e Mi e−Mi − = tanh Mi . eMi + e−Mi eMi + e−Mi
Hence
E(W − W |σ ) = n
−1
n
n−3/4 σi − E σi |σ
i=1
= n−3/4 M − n−7/4
n
tanh Mi .
(13.56)
i=1
Now it is easy to verify that the second derivative d2 tanh x = −2(tanh x) 1 − tanh2 x 2 dx has exactly two extrema on the real line, the solutions x to the equation tanh2 x = 1/3, and is bounded in absolute value by 4/33/2 . Thus, for all x, y ∈ R, 2 tanh x − tanh y − (x − y)(cosh y)−2 ≤ 2(x − y) . 33/2 It follows that for all i = 1, . . . , n, n n 2n−1 −1 −2 tanh Mi − n tanh M + n (cosh M) σi ≤ 3/2 , 3 i=1
and therefore
i=1
n 2n−1 tanh Mi − n tanh M ≤ |M| + 3/2 . 3 i=1
Applying this inequality and the relation M = n−1/4 W in (13.56), we obtain −11/4 E W − W |σ + n−3/4 (tanh M − M) ≤ n−2 |W | + 2n . 33/2
(13.57)
Now consider the function f (x) = tanh x −x +x 3 /3. Since f (x) = (cosh x)−2 − 1 + x 2 ≥ 0 for all x, the function f is increasing, and as f (0) = 0 we obtain f (x) ≥ 0 for all x ≥ 0. Now, it can be easily verified that the first four derivatives of f vanish at zero, and for all x ≥ 0, d 5f 16 sinh2 x sinh4 x 16 = − 120 + 120 ≤ ≤ 16. 5 2 4 6 dx cosh x cosh x cosh x cosh2 x
356
13
Non-normal Approximation
Thus, for all x ≥ 0, 0 ≤ f (x) ≤
2 16 5 x = x5. 5! 15
Since f is an odd function, we obtain 5 tanh x − x + 1 x 3 ≤ 2|x| 3 15
for all x ∈ R.
Using this inequality in (13.57), we arrive at −3/4 |M|5 2n−11/4 E(W − W |σ ) − 1 n−3/4 M 3 ≤ 2n , + n−2 |W | + 3 15 33/2 or, by the relation M = n−1/4 W , equivalently, −2 5 −11/4 E(W − W |σ ) − 1 n−3/2 W 3 ≤ 2n |W | + n−2 |W | + 2n . 3 15 33/2 The latter inequality implies, in particular, that E (W − W )W 3 − 1 n−3/2 E W 6 3 −2 8 2n−11/4 E|W |3 2n E(W ) . + n−2 E W 4 + ≤ 15 33/2 Thus,
(13.58)
(13.59)
2n−1/2 E(W 8 ) E W 6 ≤ 3n3/2 E (W − W )W 3 + 5 −5/4 E|W |3 2n . (13.60) + 3n−1/2 E W 4 + 31/2 Regarding the first term on the right hand side of (13.60), note that by the exchangeability of (W, W ), 1 E (W − W )W 3 = E (W − W ) W 3 − W 3 2 1 = − E (W − W )2 W 2 + W W + W 2 . 2 Now, by (13.51) and the Cauchy–Schwarz inequality, E (W − W )W 3 ≤ 6n−3/2 E W 2 .
(13.61)
For the remaining terms in (13.60), using the crude bound |W | ≤ n1/4 , we obtain 2n−5/4 E|W |3 2n−1/2 E(W 8 ) + 3n−1/2 E W 4 + 5 31/2 6 −1 2 2n E(W ) 2E(W ) ≤ . + 3E W 2 + 5 31/2 Combining (13.60), (13.61), and (13.62), we obtain
(13.62)
13.3 The Curie–Weiss Model
357
2E(W 6 ) 2n−1 E W 6 ≤ 21 + 1/2 E W 2 + , 5 3 and therefore, for all n ∈ N,
5 2n−1 21 + 1/2 E W 2 ≤ 36.925E W 2 . E W6 ≤ 3 3 Since E(W 2 ) ≤ [E(W 6 )]1/3 , this gives E W 6 ≤ (36.925)3/2 ≤ 224.4 < (15)2 ,
(13.63)
and hence (13.54) holds. Applying the bound (13.63) in (13.58) yields, for all n ∈ N, 1 −3/2 3 E E(W − W |W ) − n W 3
5/6 2n−11/4 −2 2(224.4) 1/6 ≤ 15n−2 , (13.64) + (224.4) + ≤n 15 33/2 completing the proof of (13.52). Lastly, to prove (13.53), by (13.55) we have n 1 e−σi Mi 4 σM E (W − W )2 |σ = n−3/2 n e i i + e−σi Mi
= 2n−5/2
i=1 n
1 − tanh(σi Mi )
i=1
= 2n−3/2 − 2n−5/2
n
σi tanh Mi .
i=1
Using | tanh Mi − tanh M| ≤ |Mi − M| ≤ n−1 , we obtain E (W − W )2 |σ − 2n−3/2 ≤ 2n−5/2 + 2n−3/2 M tanh M ≤ 2n−5/2 + 2n−3/2 M 2 = 2n−5/2 + 2n−2 W 2 . Using (13.63), we obtain, for all n ∈ N, that 2 E E W − W |W − 2n−3/2 ≤ 2n−5/2 + 2n−2 (224.4)1/3 ≤ 15n−2 . Now multiplying by n3/2 /2 completes the proof of (13.53), and the lemma.
Proof of Theorem 13.3 We apply Theorems 13.1 and 13.2 with the coupling given in Lemma 13.3. First, inequality (13.52) of Lemma 13.3 shows that the exchangeable pair (W, W ) satisfies (13.20) with 1 g(y) = n−3/2 y 3 3
and
r(y) ≤ 15n−2 .
(13.65)
358
13
Non-normal Approximation
Recall that the density p(y) of Y on (−∞, ∞) is given by (13.9), or, equivalently, by (13.24) with g(y) and c0 = n3/2 as in (13.50), and that c1 is a constant not depending on n. It is clear that Y has moments of all order, so in particular E|g(Y )| < ∞, and that g(y) satisfies Condition 13.2. As the quantity min 1/c1 , 3/y 3 (|y| + 3/c1 ) 1 + y 2 is bounded near zero and has finite limits at plus and minus infinity, Condition 13.3 is satisfied with a constant c2 not depending on n. Regarding the L1 bound (13.47), we have that the hypotheses of Theorem 13.1 are satisfied, and need only verify that all terms in the bound (13.28) of that theorem are of order O(n−1/2 ). Inequality (13.53) shows that the first term in the bound is no more than (15/2)n−1/2 , inequality (13.51) shows the second term is O(n−9/4 ), and (13.65) shows the last term to be of order O(n−2 ). Regarding the supremum norm bound (13.48), Lemma 13.3 shows the coupling of W and W is bounded, and therefore the hypotheses of Theorem 13.2 are satisfied; similarly, we need only verify that all terms in the bound (13.31) are O(n−1/2 ). We have already shown the first term in the bound is of this order. Inequality (13.51) allows us to choose δ = 2n−3/4 , showing the second term in the bound is of order o(n−1/2 ). By (13.65) the third term is O(n3/2 n−2 ) = O(n−1/2 ). For the coefficient of the last term, we find δ 3 c0 = O(n−9/4 n3/2 ) = O(n−3/4 ). As c1 and c2 do not depend on n, and E|c0 g(W )| = E|W |3 /3 ≤ 5 by (13.54), the final term is o(n−1/2 ). As all terms in the bound are O(n−1/2 ) the claim is shown.
13.4 Exponential Approximation In this section we focus on approximation by the exponential distribution E(λ) for λ > 0, that is, the distribution with density p(x) = λe−λx 1{x>0} as in Example 13.1. We consider two examples, the spectrum of the Bernoulli–Laplace diffusion model, and first passage time of Markov chains. The first example is handled using exchangeable pairs, and the second by introducing the equilibrium distribution.
13.4.1 Spectrum of the Bernoulli–Laplace Markov Chain Since for the exponential distribution the ratio p (w)/p(w) is constant, following (13.20) we hope to construct an exchangeable pair (W, W ) satisfying E(W − W |W ) = 1/c0 + r(W )
(13.66)
for some positive constant c0 > 0 and small remainder term. Taking (a, b) = (0, ∞), g(y) = 1/c0 and G(y) = y/c0 in (13.24) yields p(y) = e−y 1{y>0} , and so Y ∼ E(1), the unit exponential distribution; clearly c1 = 1 and E|g(Y )| < ∞. We now obtain bounds on the solution to the Stein equation for the unit exponential.
13.4 Exponential Approximation
359
Lemma 13.4 If f is the solution to the Stein equation (13.5) with p(y) = e−y 1(y > 0), the unit exponential density, and h any absolutely continuous function with h ≤ 1, then the bounds (13.27) hold with b0 (y) = 3.5y for y > 0, b1 = 1 and b2 = 2. Proof We verify the hypotheses of Lemma 13.1 are satisfied for p(y) = e−y and F (y) = 1 − e−y for y > 0. Clearly (13.10) and (13.10) are satisfied with d1 = 1 and d2 = 1, respectively. As (p /p) = 0, (13.15) is satisfied with d3 = 0. Regarding (13.16), we have EY 1(Y ≤ y) + EY P (Y ≤ y) = 2 1 − e−y − ye−y ≤ 2 1 − e−y , and similarly, EY 1(Y > y) + EY P (Y > y) ≤ (1 + 2y)e−y . Hence, (13.16) is satisfied with d4 (y) = 3.5y as ey min 2 1 − e−y , (1 + 2y)e−y = min 2 ey − 1 , (1 + 2y) ≤ 3.5y, where the final inequality is shown by considering the cases 0 < y < 2/3 and y ≥ 2/3. Invoking Lemma 13.1 with d1 = 1, d2 = 1, d3 = 0 and d4 (y) = 3.5y now completes the proof of the claim. Theorem 13.1 now immediately yields the following result for approximation by the exponential distribution. Theorem 13.4 Let (W, W ) be an exchangeable pair satisfying (13.66) for some c0 > 0, let = W − W , and Y ∼ E(1). Then L(W ) − L(Y ) ≤ E 1 − c0 E 2 |W + c0 E| |3 2 1 2 + 3.5c0 E W r(W )1{|W |>0} . We apply the result above to the Bernoulli–Laplace Markov chain, a simple model of diffusion, following the work of Chatterjee et al. (2008). Two urns contain n balls each. Initially the balls in each urn are all of a single color, with urn 1 containing all white balls, and urn 2 all black. At each stage two balls are picked at random, one from each urn, and interchanged. Let the state of the chain be the number of white balls in the urn 1. Diaconis and Shahshahani (1987) proved that (n/4) log(2n) + cn steps suffice for this process to reach equilibrium, in the sense that the total variation distance to stationarity is at most ae−dc , for some positive universal constants a and d. To prove this result they used the fact that the spectrum of the chain consists of the numbers λi = 1 −
i(2n − i + 1) n2
for i = 0, . . . , n
(13.67)
360
13
Non-normal Approximation
occurring with multiplicities
2n 2n mi = − for i = 0, 1, . . . , n, i i−1 where we adopt the convention that nk = 0 when k < 0, so that the multiplicity m0 of the eigenvalue λ0 = 1 is 1. The Bernoulli–Laplace Markov chain is equivalent to a function of a certain random walk on the Johnson graph J (2n, n). The vertices of the Johnson graph J (2n, n) are all size n subsets of {1, 2, . . . , 2n}, and two subsets are connected by an edge if they differ by exactly one element. From a given vertex, the random walk moves to a neighbor chosen uniformly at random. Numbering the balls of the Bernoulli–Laplace model 1 through 2n, with the white balls corresponding to the odd numbers 1, 3, . . . , 2n − 1 and the black balls to the even values 2, 4, . . . , 2n, the state of the random walk on the Johnson graph is simply the labels of the balls in urn 1. We apply Stein’s method to study an approximation to the distribution of a randomly chosen eigenvalue, that is, to the values λi given in(13.67), chosen in proportion to their multiplicities mi , i = 0, . . . , n. As the sum ni=0 mi telescopes, this means we choose λi with probability 2n 2n − πi = i 2n i−1 , i = 0, 1, . . . , n. (13.68) n
Letting I have distribution P (I = i) = πi , translating and scaling the distribution which chooses λi with probability πi to be comparable to the unit exponential, we are led to study the random variable (n − i)(n + 1 − i) . (13.69) n We construct an exchangeable pair (W, W ) using a reversible Markov chain on {0, 1, . . . , n}, as outlined at the end of Sect. 2.3.2. For the chain to be reversible with respect to π , its transition kernel K must satisfy the detail balance equation W = μI
where μi =
πi K(i, j ) = πj K(j, i)
for all i, j ∈ {0, . . . , n}.
(W, W )
(13.70)
by letting W = μI where I is chosen Given such a K, one obtains the pair from the equilibrium distribution π , and W = μJ where J is determined by taking one step from state I according to the transition mechanism K. One can verify that the following transition kernel K satisfies (13.70). For the upward transitions, K(i, i + 1) =
2n − i + 1 4n(n − i)(2n − 2i + 1)
for i = 0, . . . , n − 1,
for the downward transitions, K(i, i − 1) =
i 4n(2n − 2i + 1)(n − i + 1)
for i = 0, . . . , n,
13.4 Exponential Approximation
361
for the probability of returning to the same state, K(i, i) = 1 − K(i, i + 1) − K(i, i − 1), and K(i, j ) = 0 otherwise. The following lemma summarizes some of the properties of the exchangeable pair so constructed. Lemma 13.5 Let W = μI as in (13.69) with P (I = i) = πi , specified in (13.68), and let W = μJ where J is obtained from I by taking one step from I according to the transition kernel K. Then (W, W ) is exchangeable, and, with = W − W , n+1 1 − 1{W =0} , 2 2n 2n2 1 E(W ) = 1, E 2 |W = 2 n
E( |W ) =
(13.71) (13.72)
and E| |3 ≤ 4n−5/2 .
(13.73)
Proof Since K is reversible with respect to π , the pair (I, J ), and therefore the pair (μI , μJ ) = (W, W ), is exchangeable. The mapping i → μi given by (13.69) is strictly decreasing, and is therefore invertible, for i ∈ {0, 1, . . . , n}. Hence, conditioning on W is equivalent to conditioning on I . For i ∈ {0, 1, . . . , n − 1} we have E( |I = i) = K(i, i + 1)(μi − μi+1 ) + K(i, i − 1)(μi − μi−1 ) 2n − i + 1 (2n − 2i) = 4n(n − i)(2n − 2i + 1) n i (2i − 2n − 2) + 4n(2n − 2i + 1)(n − i + 1) n 1 = 2, 2n and for i = n, E( |I = n) = K(n, n − 1)(μn − μn−1 ) = −
1 . 2n
As {I = n} = {W = 0}, the claim (13.71) is shown. To show that E(W ) = 1, argue as in the proof of (13.71) to compute that 2 E 3 |W = 3 (W − 1). n
(13.74)
Since W and W are exchangeable, E( 3 ) = 0, and taking expectation in (13.74) yields E(W ) = 1. Similarly one checks that E( 2 |W ) = 1/n2 , proving (13.72). Lastly, to show (13.73), since μi is decreasing in i, for i ∈ {0, 1, . . . , n − 1} we have
362
13
Non-normal Approximation
E | |3 |I = i = K(i, i + 1)(μi − μi+1 )3 + K(i, i − 1)(μi−1 − μi )3 =
2n − i + 1 (2n − 2i)3 4n(n − i)(2n − 2i + 1) n3 +
=
i (2n + 2 − 2i)3 4n(2n − 2i + 1)(n − i + 1) n3
2((2n − i + 1)(n − i)2 + i(n − i + 1)2 ) n4 (2n − 2i + 1)
2((2n − i + 1)(n − i)(n − i + 1) + i(n − i + 1)2 ) n4 (2n − 2i + 1) 2(n − i + 1) , = n3 ≤
and when i = n, 2 2(n − i + 1) E | |3 |I = n = K(n, n − 1)(μn − μn−1 )3 = 3 = . n n3 Thus, √ √ 2(n − I + 1) 2 n(W + 2) E | |3 |W ≤ ≤ ≤ 2n−5/2 W + 2, 3 3 n n and, by Jensen’s inequality, √ E| |3 ≤ 2n−5/2 E(W + 2) = 2 3n−5/2 ≤ 4n−5/2 .
This proves (13.73).
Applying Theorem 13.4 with c0 = 2n2 and r(w) = −(n + 1)/(2n2 )1(w > 0) now yields the following result. Theorem 13.5 Let W be a scaled, translated randomly chosen eigenvalue of the Bernoulli–Laplace diffusion model given by (13.69). Then, with Y having the unit exponential distribution L(W ) − L(Y ) ≤ 4n−1/2 . (13.75) 1 As the difference between W and W is large when I is close to zero, Theorem 13.2 does not provide a useful bound for the Kolmogorov distance between W and Y . However, using a completely different approach and some heavy machinery, Chatterjee et al. (2008) are able to show that (13.76) supP (W ≤ z) − P (Y ≤ z) ≤ Cn−1/2 z∈R
where C is a universal constant.
13.4 Exponential Approximation
363
13.4.2 First Passage Times We now consider an approach to exponential approximation different from that of the previous sections, following Peköz and Röllin (2009). In the nomenclature of renewal theory, for a non-negative random variable X with finite mean, X e is said to have the equilibrium distribution with respect to X if for all Lipschitz functions f , (13.77) Ef (X) − f (0) = EXEf X e . Clearly (13.77) holds for all Lipschitz functions if and only if it holds for all Lipschitz functions f with f (0) = 0. The equilibrium distribution has a close connection with both the size biased and zero biased distributions. For the first connection, let X s have the X-size bias distribution of X, that is, EXf (X) = EXEf X s for all functions f for which these expectations exist. Then, if U has the uniform distribution on [0, 1] independent of X, the variable U X s has the equilibrium distribution X e with respect to X. Indeed, if f is any Lipschitz function then Ef (X) − f (0) = EXf (U X) = EXEf U X s = EXEf X e . For the second connection, recall how a random variable with the zero bias distribution can likewise be formed by multiplying a square bias variable by an independent uniform. Note also that for any a > 0, parallel to (2.59), we have (aX)e = U (aX)s = aU X s = aX e .
(13.78)
Xe
is absolutely continuous Additionally, a simplecalculation shows that in general x with density function 0 P (X > t)dt/EX. As identity (13.7) characterizes the exponential distribution we see that if X =d X e then X ∼ E(λ), where λ = EX. Hence the equilibrium distribution of X operates for the exponential distributions in the same way that the zero bias distributions do for the mean zero normals. By analogy then, if the distributions of X and X e are close then X should be approximately exponential. This intuition is made precise by the following result of Peköz and Röllin (2009). Theorem 13.6 Let W be a non-negative random variable with EW = 1 and let W e have the equilibrium distribution with respect to W . Then, for Y ∼ E(1) and any β > 0, (13.79) supP (W ≤ z) − P (Y ≤ z) ≤ 12β + 2P W e − W > β z∈R
and
supP W e ≤ z − P (Y ≤ z) ≤ β + P W e − W > β .
(13.80)
z∈R
If in addition W has a finite second moment, then L(W ) − E(1) ≤ 2E W e − W 1
(13.81)
364
13
Non-normal Approximation
supP W e ≤ z − P (Y ≤ z) ≤ E W e − W .
(13.82)
z∈R
The time until the occurrence of a rare event can often be well approximated by an exponential distribution. Aldous (1989) gives a wide survey of some settings where this phenomenon can occur, and Aldous and Fill (1994) summarize many results in the setting of Markov chain hitting times. We consider exponential approximation for first passage times, following the paper by Peköz and Röllin (2009). If X0 , X1 , . . . is a Markov chain taking values in a denumerable space X , for j ∈ X let Tπ,j = inf{t ≥ 0: Xt = j },
(13.83)
the time of the first visit to state j when the chain is initialized at time 0 with distribution π , and let Ti,j = inf{t > 0: Xt = j }
(13.84)
be the first time the Markov chain started in state i at time 0 next visits state j . Theorem 13.7 Let X0 , X1 , . . . be an ergodic, stationary Markov chain with stationary distribution π , and Tπ,j and Ti,j the first passage times as given in (13.83) and (13.84), respectively. Then, with Y ∼ E(1), the unit exponential distribution, for every i ∈ X we have supP (πi Tπ,i ≤ z) − P (Y ≤ z) ≤ 1.5πi + πi E|Tπ,i − Ti,i |, (13.85) z∈R
supP (πi Tπ,i ≤ z) − P (Y ≤ z) ≤ 2πi + P (Tπ,i = Ti,i )
(13.86)
z∈R
and ∞ (n) P − πi , supP (πi Tπ,i ≤ z) − P (Y ≤ z) ≤ 2πi + i,i z∈R
(13.87)
n=1
(n)
where Pi,i = P (Xn = i|X0 = i). Proof We first claim that if U is a uniform [0, 1] random variable independent of all else, then the equilibrium distribution of Ti,i is given by e =d Tπ,i + U. Ti,i
(13.88)
To prove (13.88) we first demonstrate P (Tπ,i = k) = πi P (Ti,i > k).
(13.89)
Consider the renewal–reward process which has renewals at visits to state i and a reward of 1 when times between renewals is greater than k. Then, with (X1 , R1 ), (X2 , R2 ), . . . the sequence of renewal interarrival times and rewards, the renewal–reward theorem (see, e.g., Grimmett and Stirzaker 2001) yields
13.4 Exponential Approximation
365
lim
t→∞
ER(t) ER1 , = t EX1
(13.90)
where R(t) is the total reward received by time t . As the mean length between renewals is ETi,i = 1/πi , the right hand side of (13.90) equals the right hand side of (13.89). On the other hand, there is precisely one time t when the waiting time to the next renewal Yt = inf{s ≥ t: Xs = i} − t
is exactly k if and only if the cycle length is greater than k. Hence R(t) = si=1 Yi , and R(t)/t → EYs , which is the left hand side of (13.89). Hence, letting f be a Lipschitz function with f (0) = 0, using (13.89) for the first equality, we have Ef (Tπ,i + U ) = E f (Tπ,i + 1) − f (Tπ,i ) = πi
∞
P (Ti,i > k) f (k + 1) − f (k)
k=0
= πi
∞ ∞
P (Ti,i = j ) f (k + 1) − f (k)
k=0 j =k+1
= πi = πi
j −1 ∞
P (Ti,i = j ) f (k + 1) − f (k)
j =0 k=0 ∞
P (Ti,i = j )f (j )
j =0
= πi Ef (Ti,i ). The claim (13.88) now follows from definition (13.77). Writing the L∞ norm between random variables ξ and η more compactly as L(ξ ) − L(η) ∞ , by the triangle inequality, we have L(πi Tπ,i ) − E(1) ∞ ≤ L(πi Tπ,i ) − L πi (Tπ,i + U ) ∞ + L πi (Tπ,i + U ) − E(1) ∞ ≤ πi + L πi (Tπ,i + U ) − E(1) ∞ . (13.91) To justify the final inequality (13.91), with [z] denoting the greatest integer not greater than z and z ≥ 0, note that P (Tπ,i + U ≤ z) = P Tπ,i ≤ [z] − 1 + P Tπ,i = [z], U ≤ z − [z] = P Tπ,i ≤ [z] − 1 + z − [z] P Tπ,i = [z] , that P (Tπ,i ≤ z) = P Tπ,i ≤ [z] − 1 + P Tπ,i = [z] ,
366
13
Non-normal Approximation
and with X0 , X1 , . . . in the equilibrium distribution π , that P Tπ,i = [z] ≤ P (X[z] = i) = πi . Continuing from (13.91) and applying (13.78), (13.88), and (13.82) of Theorem 13.6, and then (13.88) again, we obtain L(πi Tπ,i ) − E(1) ≤ πi + L πi T e − E(1) i,i ∞ ∞ ≤ πi + L (πi Ti,i )e − E(1) ≤ πi + πi E|Tπ,i + U − Ti,i |
∞
≤ 1.5πi + πi E|Tπ,i − Ti,i |,
(13.92)
proving (13.85). Taking β = πi in (13.80), from (13.92) and (13.88) we obtain L(πi Tπ,i ) − E(1) ≤ 2πi + P |Tπ,i + U − Ti,i | > 1 , ∞ and we now obtain (13.86) by noting that |Tπ,i + U − Ti,i | > 1
implies Tπ,i = Ti,i .
Lastly, to show (13.87), let X0 , X1 , . . . be the chain in equilibrium and Y0 , Y1 , . . . a coupled copy started in state i at time 0, according to the maximal coupling, see (n) Griffeath (1974/1975), so that P (Xn = Yn = i) = πi ∧ Pi,i . Let Tπ,i and Ti,i be the hitting times given by (13.83) and (13.84) defined on the X and Y chain, respectively. Then P (Tπ,i = Ti,i ) ≤
∞
P (Xn = i, Yn = i) + P (Yn = i, Xn = i) ,
n=0
and since P (Xn = i, Yn = i) = πi − P (Xn = i, Yn = i) (n) (n) + = πi − πi ∧ Pi,i = πi − Pi,i , and a similar calculation yields
+ (n) P (Yn = i, Xn = i) = Pi,i − πi ,
we obtain (13.87) from (13.86).
From, say, Billingsley (1968) Theorem 8.9, we know that for a finite, irreducible, aperiodic Markov chain, convergence to the unique stationary distribution is exponential, that is, that there exists A ≥ 0 and 0 ≤ ρ < 1 such that (n) P − πi ≤ Aρ n for all n ∈ N. i,i In this case, (13.87) of Theorem 13.7 immediately yields the bound supP (πi Tπ,i ≤ z) − P (Y ≤ z) ≤ inf (k + 1)πi + Aρ k /(1 − ρ) z∈R
k∈N
on the Kolmogorov distance between πi Tπ,i and the unit exponential.
Appendix
367
Using the results of Theorem 13.7, Peköz and Röllin (2009) give bounds to the exponential for the times of appearance of patterns in independent coin tosses. In addition, they consider exponential approximations for random sums, and the asymptotic behavior of the scaled population size in a critical Galton–Watson branching process, conditioned on non-extinction.
Appendix Proof of Lemma 13.1 (i) Let Y be an independent copy of Y . Then, for y ∈ (a, b), we can rewrite fh in (13.6) as f (y) = 1/p(y) E h(Y ) − h Y 1{Y ≤y} = − 1/p(y) E h(Y ) − h Y 1{Y >y} , (13.93) which yields
f (y) ≤ 2 h min F (y), 1 − F (y) /p(y).
(13.94)
Inequality (13.12) now follows from (13.10) and (13.94). Inequalities (13.94) and (13.11) imply |fh p /p| ≤ 2d2 h , that is, (13.13), and now (13.14) follows from (13.5). (ii) Let g1 (y) = p (y)/p(y) for y ∈ (a, b). Differentiating (13.5) we obtain f = h − f g1 − f g1 . To prove (13.17), it suffices to show that f g ≤ d3 h
(13.96)
1
and
(13.95)
f g1 ≤ (1 + d3 )d2 h .
(13.97)
By (13.93) again, we have f (y)p(y) ≤ h min E |Y | + |Y | 1{Y ≤y} , E |Y | + |Y | 1{Y >y} = h min E|Y |1{Y ≤y} + E|Y |F (y), E|Y |1{Y >y} + E|Y | 1 − F (y) . (13.98) This inequality proves (13.96) by assumption (13.15); the claim (13.18) follows in the same way from (13.16). Rearranging (13.95) and multiplying by p(y) yields h − f g1 p = p f + f g1 = f p + f p = (f p) . Thus, noting that the boundary terms vanish by (13.6), b y h − f g1 pdx = − h − f g1 pdx, f (y)p(y) = a
y
368
13
Non-normal Approximation
and hence, using (13.96), f (y)p(y) ≤ h (1 + d3 ) min F (y), 1 − F (y) , yielding (13.97), as well as (13.19), by (13.11) and (13.10), respectively.
Proof of Lemma 13.2 First, we see that Condition 13.1 is satisfied for densities of the form (13.24) whenever E|Y | < ∞ for Y having density p, by (13.26). To prove the remaining claims, it suffices to verify that hypotheses (13.10), (13.11), (13.15) and (13.16) of Lemma 13.1 hold with d1 = 1/c1 , d2 = 1, d3 = c2 and d4 (y) = c2 . Let g2 (y) = c0 g(y) and F (y) = P (Y ≤ y), the distribution function of Y . Consider the case 0 ∈ [a, b], so that G(0) = 0 and p(0) = c1 . We first show that (13.10) is satisfied with d1 = 1/c1 . It suffices to show that F (y) ≤ F (0)p(y)/c1 and
for a < y < 0,
1 − F (y) ≤ 1 − F (0) p(y)/c1
for 0 ≤ y < b.
(13.99)
(13.100)
Consider the case a < y < 0 and let H (y) = F (y) − F (0)p(y)/c1 . Differentiating, H (y) = p(y) − F (0)p (y)/c1 = p(y) + F (0)g2 (y)p(y)/c1 = p(y) 1 + F (0)g2 (y)/c1 . Since g2 (y) is non-decreasing by Condition 13.2, if H (0) > 0, then H (y) has at most one sign change on (a, 0), and therefore H achieves its maximum either are a or at 0. If H (0) ≤ 0, then H (y) ≤ 0 for all y < 0, and H achieves its maximum at a. However, as H (0) = 0 and H (a) = −F (0)p(a)/c1 ≤ 0, we conclude that H (y) ≤ 0 for all y ≤ 0. This proves (13.99). Inequality (13.100) can be shown in a similar fashion. Next we prove (13.11) holds with d2 = 1. First consider a < y < 0. Inequality (13.11) is trivial when p (y) = 0, so consider y such that p (y) = −p(y)g2 (y) = 0. By Condition 13.2, we have g2 (x) ≤ g2 (y) < 0 for all x ≤ y, and therefore y F (y) = p(x)dx a y p(x)g2 (x) dx ≤ g2 (y) a y −p (x) dx = g2 (y) a p(y) − p(a) p(y) = ≤ . (13.101) −g2 (y) |g2 (y)| Similarly, we have 1 − F (y) ≤ p(y)/g2 (y) for 0 ≤ y < b. Hence (13.11) is satisfied with d2 = 1.
(13.102)
Appendix
369
Note that (13.100) and (13.102) imply that 1 − F (y) ≤ p(y) min 1/c1 , 1/g2 (y)
for a < y < 0,
(13.103)
and that likewise we have
F (y) ≤ p(y) min 1/c1 , 1/g2 (y)
for 0 ≤ y < b.
(13.104)
To verify (13.15), with 0 ≤ y < b write E|Y |1{Y >y}
b
= yP (Y > y) +
P (Y > t)dt y
≤ yp(y) min 1/c1 , 1/g2 (y) +
b
p(t) min 1/c1 , 1/g2 (t) dt
y
≤ yp(y) min 1/c1 , 1/g2 (y) + min 1/c1 , 1/g2 (y) = min 1/c1 , 1/g2 (y) yp(y) + 1 − F (y) ≤ min 1/c1 , 1/g2 (y) yp(y) + p(y)/c1 = p(y) min 1/c1 , 1/g2 (y) {y + 1/c1 }.
b
p(t)dt y
(13.105)
Similarly, for a < y < 0 we obtain
E|Y |1{Y ≤y} ≤ p(y) min 1/c1 , 1/g2 (y) |y| + 1/c1 .
(13.106)
Applying inequalities (13.105) and (13.106) at y = 0, and noting again that G(0) = 0 so that p(0) = c1 , gives E|Y | ≤ 2/c1 . Hence, recalling (13.103), E|Y |1{Y >y} + E|Y | 1 − F (y) (13.107) ≤ p(y) min 1/c1 , 1/g2 (y) {y + 3/c1 } for 0 ≤ y < b, and, using (13.104), E|Y |1{Y ≤y} + E|Y |F (y) ≤ p(y) min 1/c1 , 1/g2 (y) |y| + 3/c1
for a < y < 0.
(13.108)
Thus, (13.15) holds with d3 = c2 by Condition 13.3. Inequalities (13.107) and (13.108) also show that (13.16) is satisfied d4 (y) = c2 , completing the proof of Lemma 13.2 for the case 0 ∈ [a, b]. The other cases follow similarly, noting that for a > 0 we have G(a) = 0 and p(a) = c1 , and likewise when b ≤ 0 we have G(b) = 0 and p(b) = c1 .
Chapter 14
Group Characters and Malliavin Calculus
In this chapter we outline two recent developments in the area of Stein’s method, one in the area of algebraic combinatorics, and the other a deep connection to the Malliavin calculus. We provide some background material in these areas in order to help make our presentation more self contained, but refer the reader to more complete sources for a comprehensive picture. The material in Sect. 14.1 is based on Fulman (2009), and that in Sect. 14.2 on Nourdin and Peccati (2009).
14.1 Normal Approximation for Group Characters In the combinatorial central limit theorem studied in Sects. 4.4 and 6.1, one considers Y=
n
aiπ(i)
where π ∼ U(Sn ),
(14.1)
i=1
that is, the sum of matrix elements, one from each row, where the column index is chosen according to a permutation π with the uniform distribution on the symmetric group Sn of {1, . . . , n}. If one were interested, say, in the distribution of the number of fixed points of π , one could take aij = 1{i=j } , that is, let the matrix (aij ) be the identity. Or, to look at the matter another way, if e1 , . . . , en are the standard (column) basis vectors in Rn , and if Pπ = [eπ(1) , . . . , eπ(n) ], the permutation matrix associated to π , then the number of fixed points Y of π is the trace Tr(Pπ ) of Pπ . More generally, we may write Y in (14.1) as Tr(APπ ). Asking similar questions for other groups leads us fairly directly to the study of the distribution of traces, or group characters on random matrices, and generalizations thereof. One of the earliest results on traces of random matrices is due to Diaconis and Shahshahani (1994), who applied the method of moments to prove joint convergence in distribution for traces of powers in a number of classical compact groups, including the orthogonal group O(n, R). For this case, Stein (1995) showed that the error in the normal approximation decreases faster than the rate n−r for any fixed r, and Johansson (1997), working more generally, showed the rate is actually L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4_14, © Springer-Verlag Berlin Heidelberg 2011
371
372
14
Group Characters and Malliavin Calculus
exponential, validating a conjecture of Diaconis. The theory of representations and group characters being a rich one, below we only provide the most basic relevant definitions, but for omitted details and proofs, refer the reader to Weyl (1997) and Fulton and Harris (1991) for in depth treatments, and in particular to Serre (1997) for finite groups, and Sagan (1991) for the symmetric group. For V a finite dimensional vector space, let GL(V ) denote the set of all invertible linear transformation from V to itself. When taking V to be some subspace of Cn , each such transformation may be considered as an invertible n × n matrix with complex entries. A representation of a group G is a map τ : G → GL(V ) which preserves the group structure, that is, which obeys τ (e) = I
and
τ (gh) = τ (g)τ (h),
where e is the identity element of g, and I the identity matrix. For groups where G itself has a topology we also require τ to be continuous. The map τ (g) = 1 for all g ∈ G, clearly a representation, is called the trivial representation. We define the dimension dim(τ ) of the representation τ to be the dimension of the vector space V . The defining representation, for groups such as the ones we consider below which are already presented as matrices, is just the map that sends each matrix to itself. If τ is a given representation over the vector space V , we say that a subspace W ⊂ V is invariant if w∈W
implies τ (g)w ∈ W
for all g ∈ G.
If τ has no invariant subspaces other than the trivial ones W = V and W = 0, we say τ is irreducible. Recall that Y in the motivating example (14.1) could be written in terms of the trace of a matrix. Given a representation τ , the character χ τ of τ , also written simply as χ when τ is implicit, is given by χ(g) = Tr τ (g). We let the dimension, or degree, of χ be the dimension of τ , and say that χ is irreducible whenever τ is. The character χ inherits the properties of the trace, and in particular is the sum of the eigenvalues of τ (g), counting multiplicities. As a representation always sends the identity element to the identity matrix, the dimension of any representation may be calculated by evaluating its character on the identity. A few more simple facts regarding characters are in order. For a complex number z ∈ C, let z denote the complex conjugate of z. For χ a character of the representation τ , we have (i) χ(g −1 ) = χ (g) for all g ∈ G. (ii) χ(hgh−1 ) = χ(g) for all g, h ∈ G. The first property may be shown by arguing that the transformations τ (g) may be taken to be unitary without loss of generality. Hence, any eigenvalue of τ (g) has modulus 1, so its inverse and complex conjugate are equal. Clearly then the same
14.1 Normal Approximation for Group Characters
373
holds for their sum. The second property is simply a consequence of the cyclic invariance of the trace. Recalling that C is a conjugacy class of the group G if g −1 hg ∈ C
for all g ∈ G and h ∈ C,
property (ii) may be stated as the fact that characters are constant on the conjugacy classes of G, that is, they are what are known as ‘class functions.’ Though the theory of group representations and group characters is more general, we closely follow a portion of the work of Fulman (2009), focusing on the following three compact Lie groups: 1. O(n, R), the orthogonal group, of all n × n real valued matrices such that U U = I . 2. SO(n, R), the special orthogonal group, of all elements in O(n, R) with determinant 1. 3. U Sp(2n, C), the unitary symplectic group of all 2n × 2n complex matrices U that satisfy 0 I U † ϒU = ϒ where ϒ = , −I 0 with U † is the conjugate transpose of U . We will not require a precise definition of a Lie group, and refer the reader to Hall (2003). To consider the analogs of the uniform measure in (14.1) on these groups, we select a group element with distribution according to Haar measure, that is, the unique measure μ on G of total mass one that satisfies μ(gS) = μ(S)
for all g ∈ G and all measurable subsets S of G.
Using the Haar measure, we can define an inner product on complex valued functions χ and ψ on G by χ, ψ = Eχψ , that is, as χ(g)ψ(g)dμ. χ, ψ = G
If χ and ψ are irreducible characters, then they satisfy the orthogonality relation χ, ψ = δχ,ψ .
(14.2)
In particular, if W = where τ is a nontrivial irreducible character and g is chosen according to Haar measure, this orthogonality relation implies χ τ (g)
EW = 0 and EW 2 = 1, where for the first equality we have used the orthogonality of χ τ to the character of the trivial representation 1. We apply Stein’s method to study the distribution of W = χ τ (g) for τ an irreducible character and g chosen according to the Haar measure of some compact Lie group. In particular, we construct an appropriate Stein pairs W, W so that Theorem 5.5 may be applied in this context. First, to show our choices are Stein pairs,
374
14
Group Characters and Malliavin Calculus
and for many subsequent calculations, we rely on Lemma 14.1 below, a result of Helgason (2000). Next, we must be able to bound the variance of a conditional expectation in order to handle the first term of Theorem 5.5. As W is a function of g, (4.143) yields (14.3) Var E (W − W )2 |W ≤ Var E (W − W )2 |g . The conditional expectation of (W )2 given g is computed in Lemma 14.2 with the help of Lemma 14.1, allowing for the computation of the conditional expectation on the right hand side of (14.3), and subsequently, its variance, in Lemma 14.3. The higher order moments required for the evaluation of the second term in the bound of Theorem 5.5 are handled in Lemma 14.4; as our constructions will lead to Stein pairs, the last term of that bound will be zero. Focusing now on the real case, let G be a compact Lie group and χ τ a nontrivial, real valued irreducible character of G. To create an appropriate W , let α be chosen independently and uniformly from some fixed self-inverse conjugacy class C of G, that is, a conjugacy class C such that h ∈ C implies h−1 ∈ C. (When all the characters of G are real valued, all conjugacy classes are self inverse, see Fulman 2009.) We claim that the pair (14.4) (W, W ) = χ τ (g), χ τ (αg) is exchangeable. Since for all α ∈ G the product α −1 g is distributed according to Haar measure whenever g is, and α −1 =d α when α is chosen uniformly over C, by the independence of g and α we obtain τ χ (g), χ τ (αg) =d χ τ α −1 g , χ τ (g) =d χ τ (αg), χ τ (g) . Furthermore, since C is self-inverse and characters are constant on conjugacy classes, for φ any representation, χ φ (α) = χ φ α −1 = χ φ (α), implying χ φ (α) is real. Recall that the class functions on G are the ones which are constant on the conjugacy classes of G, so in particular they form a vector space. We have seen that the characters themselves are class functions, but more is true. The irreducible characters of a group form an orthonormal basis for the class functions. Indeed, the calculation of the bounds in the theorems that follow hinge on the expansion of given characters in terms of the irreducibles. For the calculation of the higher order moments required for the evaluation of the second term in the bound of Theorem 5.5 we require some additional facts regarding characters and tensor products. If τ and ρ are representations of groups G and H , then we may define the tensor product representation on the product group G × H by (τ ⊗ ρ)(g, h) = τ (g) ⊗ ρ(h)
for all g ∈ G, h ∈ H .
14.1 Normal Approximation for Group Characters
375
Letting χ, ψ and χ ⊗ ψ be the characters associated to τ, ρ and τ ⊗ ρ, respectively, the properties of the tensor product directly imply that (χ ⊗ ψ)(g, h) = χ(g)ψ(h).
(14.5)
is again a repWhen H = G the mapping from g to τ (g) ⊗ τ (g), denoted resentation of G, and more generally we may define the r-fold tensor product representation τ r for r ∈ N; when r = 0 we let the product τ 0 be the trivial representation. If τ has character χ , then by (14.5) the representation τ r has character χ r . As the irreducible characters form a basis for the class functions, the character χ r can be so decomposed; in particular, all that is needed to specify the decomposition of χ r in terms of irreducible characters is the multiplicity mφ (τ r ) of the irreducible representation φ in τ r . The following lemma from Helgason (2000) is key in the sequel. τ 2,
Lemma 14.1 Let G be a compact Lie group and χ the character induced by the irreducible representation φ of G. Then χ φ (α) φ χ φ hαh−1 g dh = χ (g) dim(φ) G for all α, g ∈ G. We can now show that W, W is a Stein pair, and develop a number of its properties. Lemma 14.2 On the compact Lie group G let (W, W ) be given by (14.4) with g chosen from Haar measure, independently of α having the uniform distribution over some fixed self inverse conjugacy class. Then for all r ∈ N0 , Wr =
φ
mφ τ r χ φ (g)
r χ φ (α) φ and E (W )r |g = mφ τ χ (g), (14.6) dim(φ) φ
where the sum is over all irreducible representations of G, and E(W |W ) = (1 − λ)W and
where λ = 1 −
χ τ (α) dim(τ )
χ τ (α) . E(W − W )2 = 2 1 − dim(τ )
(14.7)
(14.8)
Proof The first claim is simply the decomposition of the character W r = χ τ (g)r of τ r of the product group Gr over the basis of irreducible representations of G. Using this decomposition on g = αg we obtain (W )r = mφ τ r χ φ (g ). φ
376
14
Group Characters and Malliavin Calculus
Now, since C = {hαh−1 : h ∈ G} is the conjugacy class of α, and when h is distributed according to Haar measure then hαh−1 is uniform over C, we have χ φ (α) φ χ φ hαh−1 g dh = χ (g), E χ φ (g )|g = dim(φ) G by Lemma 14.1, proving the second equality in (14.6). When r = 1 only the summand φ = τ is non-vanishing, yielding E(W |g) as a function of W , hence (14.7). The last claim is now immediate from (2.34). We now calculate the right hand side of (14.3) for use in bounding the first term in Theorem 5.5. Lemma 14.3 With W, W and g as in Lemma 14.2,
∗ 2 2χ τ (α) 2 χ φ (α) mφ τ 2 , − Var E (W − W )2 |g = 1+ dim(φ) dim(τ ) φ
where ∗ signifies that the sum is over all nontrivial irreducible representations of G. Proof Expanding the square and using the measurability of W with respect to g, then applying (14.7), and (14.6) with r = 2, of Lemma 14.2, we obtain E (W − W )2 |g = E (W )2 |g − 2W E(W |g) + W 2
2 2χ τ (α) = E (W ) |g + 1 − W2 dim(τ )
χ φ (α) 2χ τ (α) φ = mφ τ 2 1 + − χ (g). dim(φ) dim(τ ) φ
Now squaring and taking expectation, the orthogonality relation (14.2) for irreducible characters yields the second moment of the conditional expectation,
2 2 2 2χ τ (α) 2 χ φ (α) E E (W − W )2 |g = mφ τ . (14.9) − 1+ dim(φ) dim(τ ) φ
We note that the square of E(W − W )2 , as given in Lemma 14.2, is the summand of (14.9) corresponding to the trivial representation of multiplicity 1, and therefore subtraction of this term to yield the variance completes the proof. We now focus on the calculation of the fourth moment of W − W in order to bound the second term in Theorem 5.5. Lemma 14.4 With W, W as in Lemma 14.2,
2
χ φ (α) χ τ (α) mφ τ 2 8 1 − −6 1− . E(W − W )4 = dim(τ ) dim(φ) φ
14.1 Normal Approximation for Group Characters
377
Proof Expanding and using the measurability of W with respect to g, and then applying Lemma 14.2, we obtain
4 r 4 (−1) χ τ (g)4−r E (W )r |g E (W − W ) |g = r
4
r=0
=
4 4 τ 4−r r χ φ (α) φ (−1)r mφ τ χ (g). χ (g) dim(α) r φ
r=0
Now, taking expectation yields E(W − W )4 =
4 χ φ (α) 4 χ τ (g)4−r χ φ (g)dg (−1)r mφ τ r dim(α) r φ
r=0
4 χ φ (α) r 4 = (−1) mφ τ r mφ τ 4−r , dim(φ) r φ
r=0
where
χ τ (g)4−r χ φ (g)dg = mφ τ 4−r
by the decomposition of χ τ (g)4−r in terms of irreducible characters, and applying the orthogonality relation they satisfy. When α is the identity element of G then W = W and χ φ (α) = dim(φ), so 0=
4 4 (−1)r mφ τ r mφ τ 4−r , r φ
r=0
so we may write, for all α, E(W − W )4 = −
4 χ φ (α) 4 (−1)r mφ τ r mφ τ 4−r 1 − . dim(φ) r φ
r=0
The r = 0, 4 terms contribute zero, since the only φ which might contribute to the sum is the trivial representation, for which the last term vanishes. For r = 2 the contribution is
2 χ φ (α) mφ τ 2 . 1− −6 dim(φ) φ
For both the r = 1 and r = 3 terms, the only nonzero summand is the one where φ = τ , with mτ (τ ) = 1, and hence these contributions sum to
χ τ (α) 8 1− (14.10) mτ τ 3 . dim(τ )
378
14
Group Characters and Malliavin Calculus
Taking the inner product of the character of τ 3 with that of τ to find mτ (τ 3 ), and then using the decomposition of χ τ (g)2 in terms of irreducible characters, we have mτ τ 3 =
2 2 φ 2 χ (g) dg = mτ τ χ (g) dg = mφ τ 2 . 4
τ
φ
φ
Now substitution of this expression into (14.10) and the collection of terms yields the result. We now state a normal approximation theorem for general compact Lie groups with real valued characters. Theorem 14.1 Let G be a compact Lie group and let τ be a non-trivial irreducible representation of G with real valued character χ τ . Let W = χ τ (g) where g is chosen from the Haar measure of G. Then, if α is any non-identity element of G with the property that α and α −1 are conjugate, sup P (W ≤ z) − P (Z ≤ z) z∈R
2 ∗ 2 χ φ (α) 1 1 mφ τ 2 2 − 1− ≤ 2 λ dim(φ) φ
+
1/4
1 2 2 χ φ (α) 6 mφ τ , 1− 8− π λ dim(φ) φ
where λ = 1 − χ τ (α)/ dim(τ ), the first sum is over all non-trivial representations of G, and the second sum over all irreducible representations of G. Proof Let (W, W ) be given by (14.4) for g and an element chosen uniformly from the conjugacy class containing α. By Lemma 14.2 we may invoke Theorem 5.5 with the given λ and R = 0. The first term in the bound of Theorem 5.5 is handled using (5.4), (14.3) and Lemma 14.3, and the last term by (14.8) of Lemma 14.2, Lemma 14.4 and E|W − W |3 ≤
E(W − W )2 E(W − W )4 .
We apply Theorem 14.1 to characters of individual Lie groups with τ their defining representation. In order to apply the bounds, for each example we need information about the decomposition of τ 2 in terms of irreducible characters. We include details for the calculation of the bound for the character of O(2n, R), and omit the similar steps for the remaining examples. For additional details see Fulman (2009).
14.1 Normal Approximation for Group Characters
379
R) 14.1.1 O(2n,R Let τ be the 2n-dimensional defining representation and let x1 , x1−1 , . . . , xn , xn−1 be the eigenvalues of an element of O(2n, R). The following lemma is the k = 2 case of a result of Proctor (1990). Lemma 14.5 For n ≥ 2, the square of the defining representation of O(2n, R) decomposes in a multiplicity free way as the sum of the following three irreducible representations: (i) The trivial representation, with character 1. (ii) The representation with character 12 ( i xi + xi )2 − 12 i (xi2 + x i 2 ). (iii) The representation with character 12 ( i xi + xi )2 + 12 i (xi2 + x i 2 ) − 1. Armed with Lemma 14.5 we may now prove the following result. Theorem 14.2 Let g be chosen from the Haar measure of O(2n, R) with n ≥ 2, and let W be the trace of g. Then 1 sup P (W ≤ z) − P (Z ≤ z) ≤ √ . 2(n − 1) z∈R Proof We apply Theorem 14.1 with τ the defining representation and α a rotation by some angle θ , that is, an element conjugate to the diagonal matrix with entries {x1 , x1−1 , . . . , xn , xn−1 } where x1 = · · · = xn−1 = 1 and xn = eiθ . Then α is conjugate to α −1 and χ τ (α) 2(n − 1) + 2 cos(θ ) 1 − cos(θ ) =1− = . dim(τ ) 2n n To calculate the first error term in Theorem 14.1, we apply Lemma 14.5 to write the decomposition of τ 2 into non-trivial irreducibles. With φ1 the non-trivial irreducible character given in (ii), λ=1−
χ φ1 (α) =
2 1 2(n − 1) + 2 cos(θ ) − n − 1 + cos(2θ ) 2
(14.11)
with dim(φ1 ) = (2n)2 /2 − 2n/2 = 2n2 − n, and with φ2 as in (iii) χ φ2 (α) =
2 1 2(n − 1) + 2 cos(θ ) + n − 1 + cos(2θ ) − 1, 2
(14.12)
with dim(φ2 ) = (2n)2 /2 + 2n/2 − 1 = 2n2 + n − 1. Hence, substituting using (14.11) and (14.12), we find the first error term is
2 1 χ φ1 (α) 2 χ φ2 (α) 1 1 + 2− 1− 2 1− 2 2− 2 λ λ 2n − n 2n + n − 1 2 1 8(n cos(2θ ) − 2n(n − 1) cos(θ ) + 2n2 + 1) = . 2 (n + 1)(2n − 1)
380
14
Group Characters and Malliavin Calculus
Similarly, the second term in the error bound equals
1/4 χ φ1 (α) χ φ2 (α) 1 6 6 1− 2 1− 2 8+ 8− + 8− λ λ π 1/4 2n − n 2n + n − 1
1/4 24n(1 − cos θ ) = . π(n + 1)(2n − 1) Since the bound given by the sum of these two terms holds for all θ , and is continuous√in θ , the bound holds √ in the limit as θ → 0. The limiting value of the first term is 2/(2n − 1) < 1/( 2(n − 1)), while the second term vanishes in the limit, thus completing the argument.
R) 14.1.2 SO(2n + 1,R We follow the same lines of argument in Sect. 14.1.1, with corresponding notation. Let τ be the 2n + 1 dimensional defining representation of SO(2n + 1, R) and W = χ τ (g) with g chosen according to Haar measure. The following result following from Sundaram (1990) gives the decomposition of τ 2 into irreducibles. Lemma 14.6 For n ≥ 2, the square of the defining representation of SO(2n + 1, R) decomposes in a multiplicity free way as the sum of the following three irreducible representations: (i) The trivial representation, with character 1. (ii) The representation with character 12 ( i xi + xi−1 )2 + −1 i (xi + xi ). (iii) The representation with character 12 ( i xi + xi−1 )2 − −1 i (xi + xi ).
1 2 1 2
2 i (xi
+ xi−2 ) +
2 i (xi
+ xi−2 ) +
With the help of Lemma 14.5, we have the following result. Theorem 14.3 Let g be chosen from the Haar measure of SO(2n + 1, R) with n ≥ 2, and let W be the trace of g. Then 1 sup P (W ≤ z) − P (Z ≤ z) ≤ √ . 2n z∈R Proof Let α be a rotation by some angle θ , that is, an element conjugate to a diagonal matrix with entries {x1 , x1−1 , . . . , xn , xn−1 , 1} along the diagonal, where x1 = · · · = xn−1 = 1 and xn = eiθ . Then α is conjugate to α −1 and λ = 2(1 − cos(θ ))/(2n + 1). Taking the limit as θ → 0 Using Lemma 14.6 we decompose τ 2 into irreducibles. √ of the first error term of Theorem 14.1 yields 1/ 2n. Taking the limit of second term gives 0, as in the proof of Theorem 14.2.
14.2 Stein’s Method and Malliavin Calculus
381
C) 14.1.3 U Sp(2n,C Let τ be the 2n dimensional defining representation of U Sp(2n, C) and W = χ τ (g) with g chosen according to Haar measure. The following result following from Sundaram (1990) gives the decomposition of τ 2 into irreducibles. Lemma 14.7 For n ≥ 2, the square of the defining representation of U Sp(2n, C) decomposes in a multiplicity free way as the sum of the following three irreducible representations: (i) The trivial representation, with character 1. (ii) The representation with character 12 ( i xi + xi−1 )2 + 12 i (xi2 + xi−2 ). (iii) The representation with character 12 ( i xi + xi−1 )2 − 12 i (xi2 + xi−2 ) − 1. Using Lemma 14.7, we are able to prove the following result. Theorem 14.4 Let g be chosen from the Haar measure of U Sp(2n, C) with n ≥ 2, and let W be the trace of g. Then 1 sup P (W ≤ z) − P (Z ≤ z) ≤ √ . 2n z∈R Proof Let α be an element of conjugate to the diagonal matrix with diagonal entries {x1 , x1−1 , . . . , xn , xn−1 } where x1 = · · · = xn−1 = 1 and xn = eiθ . Then α is conjugate to α −1 and λ = 2(1 − cos(θ ))/n. 2 √ we decompose τ into irreducibles, and obtain the limit of √ Using Lemma 14.7 2/(2n + 1) < 1/ 2n for the first term, as θ → 0, and a limit of zero for the second term, as in the proof of Theorem 14.2.
14.2 Stein’s Method and Malliavin Calculus In some real sense, the most fundamental identity underlying Stein’s method for the normal, that for all absolutely continuous functions f such that the expectations below exist, E Zf (Z) = E f (Z) if and only if Z is standard normal, (14.13) can be seen as really nothing more than integration by parts. Proceeding along these same lines in more general spaces, the fact that (14.13) is a consequence of the Malliavin calculus integration by parts formula has some profound consequences. In this section we scratch the surface of the deep connection that exists between Stein’s method and the Malliavin calculus, as unveiled in Nourdin and Peccati (2009). Working exclusively below with one dimensional Brownian motion, we do not attempt to cover the generality of Nourdin and Peccati (2009), whose
382
14
Group Characters and Malliavin Calculus
framework includes Gaussian fields in higher dimensions, fractional Brownian motion, and parallel results for the Gamma distribution. Neither is our presentation one which approaches a treatment with complete technical details. We refer the reader to Nourdin and Peccati (2009), the lecture notes of Peccati (2009), and the text Nourdin and Peccati (2011) for full coverage of the material here, and to the standard reference, Nualart (2006), for the needed elements of the Malliavin calculus. Let B(t) be a standard Brownian motion on [0, 1] on a probability space ( , F , P ), and let L2 (B) be the Hilbert space of square integrable functionals of B, endowed with inner product X, Y L2 (B) = EXY . Starting with the defini b tion ψdB = a dBt = B(b) − B(a) of the stochastic integral for the indicator ψ = 1(a,b] of an interval in [0, 1], we may extend to all ψ ∈ L2 (λ), the collection of square integrable functions on [0, 1] with respect to Lebesgue measure λ, by taking L2 (B) limits to obtain ψdB; this integral will be denoted by I (ψ). The associated map ψ → I (ψ) from L2 (λ) → L2 (B) satisfies the isometry property (14.14) ψ, φL2 (λ) = I (ψ), I (φ) L2 (B) . In addition, I (ψ) is a mean zero normal random variable, which by (14.14) has variance ψ2L2 (λ) . In particular, the collection {I (ψ): ψ ∈ L2 (λ)} is a real valued Gaussian process. For ψ = 1A , the indicator of the measurable set A in [0, 1], we also write B(A) = I (1A ), and may think of B(A) as measure on [0, 1] with values in the space of random variables. For A = (0, t] the definition recovers the Brownian motion through B((0, t]) = B(t). To consider higher order integrals, for m ≥ 1 and L2 (λm ), the collection of square integrable functions on [0, 1]m with respect to m dimensional Lebesgue measure λm , the higher order stochastic integrals Im (ψ) are defined as follows. Consider first elementary functions of the form ψ(t1 , . . . , tm ) =
n
ai1 ,...,im 1(Ai1 · · · Aim )(t1 , . . . , tm )
i1 ,...,im =1
where A1 , . . . , An are disjoint measurable sets for all j = 1, . . . , n, and the coefficients ai1 ,...,im are zero if any of the two indices i1 , . . . , im are equal. For a function ψ of this form, define Im (ψ) =
n
ai1 ,...,im B(Ai1 ) · · · B(Aim ),
i1 ,...,im =1
and extend to L2 (λm ) by taking L2 (B) limits. It is clear that I (ψ) and I1 (ψ) agree. For a function ψ : [0, 1]m → R, define the symmetrization of ψ by (t1 , . . . , tm ) = ψ
1 ψ(tσ (1) , . . . , tσ (m) ), m! σ
(14.15)
where the sum is over all permutations σ of {1, . . . , m}. Letting L2s (λm ) be the closed subspace of L2 (λm ) of symmetric, square integrable functions on [0, 1]m
14.2 Stein’s Method and Malliavin Calculus
383
∈ L2s (λm ) by with respect to Lebesgue measure, we see that ψ ∈ L2 (λm ) implies ψ the triangle inequality, L2 (λ) ≤ ψL2 (λ) . ψ One can verify that the stochastic integrals Im (·) have the following properties: ) for all ψ ∈ L2 (λm ). (i) EIm (ψ) = 0 and Im (ψ) = Im (ψ 2 p 2 (ii) For all ψ ∈ L (λ ) and φ ∈ L (λq ), 0 p = q, E Ip (ψ)Iq (φ) = p!ψ , φ L2 (λp ) p = q.
(14.16)
(iii) The mapping ψ → Im (ψ) from L2 (λm ) to L2 (B) is linear. The goal of this section, achieved in Theorem 14.6, is to obtain a bound to the normal in the total variation distance for integrals Iq (ψ) for q ≥ 2. For this purpose we require the multiplication formula, which expresses the product of the stochastic integrals of ψ ∈ L2 (λp ) and φ ∈ L2 (λq ) in terms of sums of integrals of contractions of ψ and φ, Ip (ψ)Iq (φ) =
p∧q r=0
p q ⊗r φ ), r! Ip+q−2r (ψ r r
(14.17)
where, for r = 1, . . . , p ∧ q and (t1 , . . . , tp−r , s1 , . . . , sq−r ) ∈ [0, 1]p+q−2r , the contraction ⊗r is given by (ψ ⊗r φ)(t1 , . . . , tp−r , s1 , . . . , sq−r ) ψ(z1 , . . . , zr , t1 , . . . , tp−r )φ(z1 , . . . , zr , s1 , . . . , sq−r )λr (dz1 , . . . , dzr ), = [0,1]r
and for r = 0, denoting ⊗0 also by ⊗, by (ψ ⊗ φ)(t1 , . . . , tp , s1 , . . . , sq ) = ψ(t1 , . . . , tp )φ(s1 , . . . , sq ). r φ be the Even when ψ and φ are symmetric ψ ⊗r φ may not be, and we let ψ ⊗ symmetrization of ψ ⊗r φ as given in (14.15). Using the multiple stochastic integrals Iq , q ∈ N, any F ∈ L2 (B), that is, any square integrable function of the Brownian motion B, can be represented by the following Wiener chaos decomposition. For any such F , there exists a unique sequence {ψq : n ≥ 1} with ψq ∈ L2s (λq ) such that F=
∞
Iq (ψq )
(14.18)
q=0
where I0 (ψ0 ) = EF , and the series converges in L2 . When all terms but one in the sum vanish so that F = Iq (ψq ) for some q, we say F belongs to the qth Wiener
384
14
Group Characters and Malliavin Calculus
chaos of B. Applying the orthogonality relation (14.16) to the symmetric ‘kernels’ ψq for F of the form (14.18), F 2L2 (B) =
∞
q!ψq 2L2 (λq ) .
(14.19)
q=0
We now briefly describe two of the basic operators of the Malliavin calculus: the Malliavin derivative D, and the Ornstein–Uhlenbeck generator L. Beginning with D, for g : Rn → R a smooth function with compact support, consider a random variable of the form (14.20) F = g I (ψ1 ), . . . , I (ψn ) with ψ1 , . . . , ψn ∈ L2 (λ). For such an F , the Malliavin derivative is defined as DF =
n ∂ g I (ψ1 ), . . . , I (ψn ) ψi . ∂xi i=1
Note in particular DI (ψ) = ψ
for every ψ ∈ L2 (λ).
(14.21)
In general then, the mth derivative D m F , given by DmF =
n i1 ,...,im =1
∂m g I (ψ1 ), . . . , I (ψn ) ψi1 ⊗ · · · ⊗ ψim , ∂xi1 · · · ∂xim
maps random variables F to the Hilbert space L2 (B, L2 (λm )) of L2 (λm ) valued functionals of B, endowed with the inner product u, vL2 (B,L2 (λm )) = Eu, vL2 (λm ) . Letting S denote the set of random variables of the form (14.20), for every m ≥ 1 the domain of D m may be extended to Dm,2 , the closure of S with respect to the norm · m,2 given by F 2m,2 = EF 2 +
m 2 E D i F 2
L (λi )
.
i=1
A random variable F ∈ L2 (B) having chaotic expansion (14.18) is an element of Dm,2 if and only if the kernels ψq , q = 1, 2, . . . satisfy ∞
q m q!ψq 2L2 (λq ) < ∞,
q=1
in which case ∞ 2 (q)m q!ψq 2L2 (λq ) , E D m F L2 (λm ) = q=m
(14.22)
14.2 Stein’s Method and Malliavin Calculus
385
where (q)m is the falling factorial. In particular, any F having a finite Wiener chaos expansion is an element of Dm,2 for all m ≥ 1. The Malliavin derivative obeys a chain rule. If g : Rn → R is a continuously differentiable function with bounded derivative, and Fi ∈ D1,2 for i = 1, . . . , n, then g(F1 , . . . , Fn ) ∈ D1,2 and Dg(F1 , . . . , Fn ) =
n ∂ g(F1 , . . . , Fn )DFi . ∂xi
(14.23)
i=1
As we are considering Brownian motion on [0, 1], and the indexed family {I (ψ): ψ ∈ L2 (λ)} with λ non-atomic, the derivatives of random variables F of the form (14.18) can be identified with the element L2 ([0, 1] × ) given by Dt F =
∞
qIq−1 ψq (·, t) ,
t ∈ [0, 1].
(14.24)
q=1
We next introduce the Ornstein–Uhlenbeck generator L. For a square integrable random variable F represented as in (14.18), let LF =
∞
−qIq (ψq )
(14.25)
q=0
and, when EF = 0, let L−1 F =
∞ q=1
1 − Iq (ψq ). q
In view of (14.19) and (14.22), we see that the operator L−1 takes values in D2,2 . As the Malliavin derivative D maps random variables to the Hilbert space L2 (B, L2 (λ)) endowed with the inner product Eu, vL2 (λ) , by definition, the adjoint operator δ satisfies the (integration by parts) identity (14.26) E F δ(u) = EDF, uL2 (λ) , for every F ∈ D1,2 , when u lies in the domain dom(δ) of δ. One of the key consequences of this identity is that for every F ∈ D1,2 with EF = 0, (14.27) E Ff (F ) = E DF, −DL−1 F L2 (λ) f (F ) , for all real valued differentiable functions f with bounded derivative; identity (14.27) also holds when f is only a.e. differentiable if F has an absolutely continuous law. The Stein identity (14.13) is the special case where F = I (ψ) for ψ ∈ L2 (λ) with ψL2 (λ) = 1. For then F is standard normal, and by (14.21) and (14.25), respectively, we have, DF = ψ so
and L−1 F = L−1 I (f ) = −I (f ) = −F,
DF, −DL−1 F
L2 (λ)
= ψ, DF L2 (λ) = ψ, ψL2 (λ) = ψ2L2 (λ) = 1. Hence (14.27) implies (14.13).
(14.28)
386
14
Group Characters and Malliavin Calculus
Though the theory of Nourdin and Peccati (2009) supplies results in the Wasserstein, Fortet–Mourier, and the Kolmogorov distance, recalling (4.1) and (4.3), we confine ourselves to the total variation norm. In this case, by the bounds (2.12), for any random variable F , L(F ) − L(Z) ≤ sup E f (F ) − Ff (F ) , (14.29) TV f ∈FTV
where FTV is the √collection of piecewise continuously differentiable functions that are bounded by π/2 and whose derivatives are bounded by 2. Theorem 14.5 Let F ∈ D1,2 have mean zero and an absolutely continuous law with respect to Lebesgue measure. Then L(F ) − L(Z) ≤ 2E 1 − DF, −DL−1 F 2 . TV L (λ) Proof By (14.27), E f (F ) − Ff (F ) = E f (F ) 1 − DF, −DL−1 F L2 (λ) ,
and the proof is completed by applying (14.29).
To make use of Theorem 14.5 it is necessary to handle the inner product appearing in the bound. We note that (14.28) gives the simplest case, and also shows the upper bound to be tight in the sense that it is zero when F =d Z. Theorem 14.6 gives a much more substantial illustration of a case where computation with the inner product is possible. We now follow Nourdin et al. (2009), which simplifies the calculations in Nourdin and Peccati (2009), as well as generalizes the results from integrals I2 (ψ) to Iq (ψ) for all q ≥ 2. Theorem 14.6 Let F belong to the qth Wiener chaos of B for some q ≥ 2. Then 2 L(F ) − L(Z) ≤ 2 1 − EF + 2 q − 1 EF 4 − 3 EF 2 2 . TV 3q The following proof shows that it is always the case that EF 4 ≥ 3(EF 2 )2 . Proof Writing F = Iq (ψ) with ψ ∈ L2s (λq ), by (14.19) we obtain EF 2 = q!ψ2L2 (λq ) .
(14.30)
Now, applying (14.30) for the final inequality below, by (14.24) and the multiplication formula (14.17), we have 1 2 1 Iq−1 ψ(·, a) λ(da) DF 2L2 (λ) = q q 0 1 q−1
q −1 2 r ψ(·, a)λ(da) r! I2q−2−2r ψ(·, a) ⊗ =q r 0 r=1
14.2 Stein’s Method and Malliavin Calculus
=q =q
387
1 q−1
q −1 2 r ψ(·, a)λ(da) r! I2q−2−2r ψ(·, a) ⊗ r 0 r=1 q−1 r=0
q −1 2 r+1 ψ) r! I2q−2−2r (ψ ⊗ r
q−1 q −1 2 r ψ). = EF + q (r − 1)! I2q−2r (ψ ⊗ r −1 2
(14.31)
r=1
Subtracting EF 2 from both sides and applying (14.16) yields
2 1 2 2 DF L2 (λ) − EF E q
q−1 q −1 4 r ψ2 2 2q−2r . = q2 (r − 1)! (2q − 2r)!ψ ⊗ L (λ ) r −1
(14.32)
r=1
Next, again by (14.17), F2 =
2 q q r ψ). r! I2q−2r (ψ ⊗ r
(14.33)
r=0
Applying (14.27) and (14.25) for the second equality below, and (14.31), (14.33) and (14.16) for the third, we obtain
1 EF 4 = E F × F 3 = 3E F 2 × DF 2L2 (λ) q
q−1 2 3 2 q 4 r ψ2 2 2q−2r . (14.34) = 3E F 2 + rr! (2q − 2r)!ψ ⊗ L (λ ) q r r=1
Comparing (14.32) and (14.34) leads to
2 2 1 q − 1 2 2 E DF L2 (λ) − EF EF 4 − 3 EF 2 . ≤ q 3q Lastly, by Theorem 14.5 and (14.25), L(F ) − L(Z) ≤ 2E 1 − DF, −DL−1 F 2 TV L (λ) ≤ 2 1 − EF 2 + 2E EF 2 − DF, −DL−1 F L2 (λ) 2 2 ≤ 2 1 − EF + 2 E EF 2 − DF, −DL−1 F 2
2 q −1 2 ≤ 2 1 − EF + 2 EF 4 − 3 EF 2 . 3q
L (λ)
We note that, as one consequence of Theorem 14.6, Stein’s method provides a streamlined proof of the Nualart–Peccati criterion, that is, if Fn = Iq (ψn ) for some q ≥ 2, such that E[Fn2 ] → σ 2 > 0 as n → ∞, then the following are equivalent:
388
14
Group Characters and Malliavin Calculus
(i) L(Fn ) − L(σ Z)TV → 0. (ii) Fn →d σ Z. (iii) EFn4 → 3σ 4 . Though in the section we have considered the case of Brownian motion on [0, 1], with corresponding Gaussian process {I (ψ): ψ ∈ L2 (λ)}, much here carries over with no essential changes when considering Gaussian processes on {X(ψ): ψ ∈ H} indexed by more general Hilbert spaces; see Nourdin and Peccati (2009) and Nourdin and Peccati (2011) for details.
Appendix
Notation 1 =d →d →p Z N N0 R R+ Sn N (μ, σ 2 ) U [a, b] Z (z) Nh Ki (t) X∗ Xs X f F − G1 (α) (α, β) B(α, β) AT Tr(A) L(·) hL∞ m (R) Hm,∞
indicator function equality in distribution convergence in distribution convergence in probability {. . . , −1, 0, 1, . . .} {1, 2, . . .} {0, 1, . . .} (−∞, ∞) [0, ∞) the symmetric group on n symbols normal distribution with mean μ and variance σ 2 uniform distribution over [a, b] standard normal variable standard normal distribution function Eh(Z) K function, (prototype) page 19 zero bias, page 26 size bias, page 31 square bias, page 34 supremum norm of f L1 distance, page 64 Gamma function Gamma distribution Beta distribution transpose of the matrix A trace of the matrix A law, or distribution, of a random variable page 136 page 136
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4, © Springer-Verlag Berlin Heidelberg 2011
389
390
Hm,∞,p page 313 L(X) − L(Y)Hm,∞,p page 313
Appendix
References
Aldous, D. (1989). Applied mathematical sciences: Vol. 77. Probability approximations via the Poisson clumping heuristic. New York: Springer. Aldous, D., & Fill, J. A. (1994). Reversible Markov chains and random walks on graphs. Monograph in preparation. http://www.stat.berkeley.edu/aldous/RWG/book.html. Arratia, R., Goldstein, L., & Gordon, L. (1989). Two moments suffice for Poisson approximations: the Chen–Stein method. Annals of Probability, 17, 9–25. Baldi, P., & Rinott, Y. (1989). Asymptotic normality of some graph-related statistics. Journal of Applied Probability, 26, 171–175. Baldi, P., Rinott, Y., & Stein, C. (1989). A normal approximations for the number of local maxima of a random function on a graph. In T. W. Anderson, K. B. Athreya, & D. L. Iglehart (Eds.), Probability, statistics and mathematics, papers in honor of Samuel Karlin (pp. 59–81). San Diego: Academic Press. Barbour, A. D. (1990). Stein’s method for diffusion approximations. Probability Theory and Related Fields, 84, 297–322. Barbour, A. D., & Chen, L. H. Y. (2005a). In A. D. Barbour & L. H. Y. Chen (Eds.), The permutation distribution of matrix correlation statistics. Stein’s method and applications. Singapore: Singapore University Press. Barbour, A. D., & Chen, L. H. Y. (2005b). In A. D. Barbour & L. H. Y. Chen (Eds.), An introduction to Stein’s method. Singapore: Singapore University Press. Barbour, A. D., & Chen, L. H. Y. (2005c). In A. D. Barbour & L. H. Y. Chen (Eds.), Stein’s method and applications. Singapore: Singapore University Press. Barbour, A. D., & Eagleson, G. (1986). Random association of symmetric arrays. Stochastic Analysis and Applications, 4, 239–281. Barbour, A. D., & Xia, A. (1999). Poisson perturbations. ESAIM, P&S, 3, 131–150. Barbour, A. D., Karo´nski, M., & Ruci´nski, A. (1989). A central limit theorem for decomposable random variables with applications to random graphs. Journal of Combinatorial Theory. Series B, 47, 125–145. Barbour, A. D., Holst, L., & Janson, S. (1992). Poisson approximation. London: Oxford University Press. Bayer, D., & Diaconis, P. (1992). Trailing the dovetail shuffle to its lair. The Annals of Applied Probability, 2, 294–313. Bentkus, V., Götze, F., & Zitikis, R. (1994). Lower estimates of the convergence rate for U statistics. Annals of Probability, 22, 1707–1714. Berry, A. (1941). The accuracy of the Gaussian approximation to the sum of independent variates. Transactions of the American Mathematical Society, 49, 122–136. Bhattacharya, R. N., & Ghosh, J. (1978). On the validity of the formal Edgeworth expansion. Annals of Statistics, 6, 434–451. L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4, © Springer-Verlag Berlin Heidelberg 2011
391
392
References
Bhattacharya, R. N., & Rao, R. (1986). Normal approximation and asymptotic expansion. Melbourne: Krieger. Bickel, P., & Doksum, K. (1977). Mathematical statistics: basic ideas and selected topics. Oakland: Holden-Day. Biggs, N. (1993). Algebraic graph theory. Cambridge: Cambridge University Press. Bikjalis, A. (1966). Estimates of the remainder term in the central limit theorem. Lietuvos Matematikos Rinkinys, 6, 323–346 (in Russian). Billingsley, P. (1968). Convergence of probability measures. New York: Wiley. Bollobás, B. (1985). Random graphs. San Diego: Academic Press. Bolthausen, E. (1984). An estimate of the reminder in a combinatorial central limit theorem. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 66, 379–386. Borovskich, Yu. V. (1983). Asymptotics of U -statistics and von Mises’ functionals. Soviet Mathematics. Doklady 27, 303–308. Breiman, L. (1986). Probability. Reading: Addison–Wesley. Brouwer, A. E., Cohen, A. M., & Neumaier, A. (1989). Distance-regular graphs. Berlin: Springer. Bulinski, A., & Suquet, C. (2002). Normal approximation for quasi-associated random fields. Statistics & Probability Letters, 54, 215–226. Cacoullos, T., & Papathanasiou, V. (1992). Lower variance bounds and a new proof of the central limit theorem. Journal of Multivariate Analysis, 43, 173–184. Chatterjee, S. (2007). Stein’s method for concentration inequalities. Probability Theory and Related Fields, 138, 305–321. Chatterjee, S. (2008). A new method of normal approximation. Annals of Probability, 4, 1584– 1610. Chatterjee, S., & Meckes, E. (2008). Multivariate normal approximation using exchangeable pairs. ALEA. Latin American Journal of Probability and Mathematical Statistics, 4, 257–283. Chatterjee, S., & Shao, Q. M. (2010, to appear). Non-normal approximation by Stein’s method of exchangeable pairs with application to the Curie–Weiss model. The Annals of Applied Probability. Chatterjee, S., Fulman, J., & Röllin, A. (2008, to appear). Exponential approximation by Stein’s method and spectral graph theory. Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Annals of Probability, 3, 534– 545. Chen, L. H. Y. (1998). Stein’s method: some perspectives with applications. In L. Accardi & C. C. Heyde (Eds.), Lecture Notes in Statistics: Vol. 128. Probability towards 2000. Berlin: Springer. Chen, L. H. Y., & Leong, Y. K. (2010). From zero-bias to discretized normal approximation (Preprint). Chen, L. H. Y., & Röllin, A. (2010). Stein couplings for normal approximation. Chen, L. H. Y., & Shao, Q. M. (2001). A non-uniform Berry–Esseen bound via Stein’s method. Probability Theory and Related Fields, 120, 236–254. Chen, L. H. Y., & Shao, Q. M. (2004). Normal approximation under local dependence. Annals of Probability, 32, 1985–2028. Chen, L. H. Y., & Shao, Q. M. (2005). Stein’s method for normal approximation. In Lecture Notes Series, Institute for Mathematical Sciences, National University of Singapore: Vol. 4. An introduction to Stein’s method (p. 159). Singapore: Singapore University Press. Chen, L. H. Y., & Shao, Q. M. (2007). Normal approximation for nonlinear statistics using a concentration inequality approach. Bernoulli, 13, 581–599. Chen, L. H. Y., Fang, X., & Shao, Q. M. (2009). From Stein identities to moderate deviations. Conway, J. H., & Sloane, N. J. A. (1999). Sphere packings, lattices and groups (3rd ed.). New York: Springer. Cox, D. R. (1970). The continuity correction. Biometrika, 57, 217–219. Darling, R. W. R., & Waterman, M. S. (1985). Matching rectangles in d dimensions: algorithms and laws of large numbers. Advances in Applied Mathematics, 55, 1–12. Darling, R. W. R., & Waterman, M. S. (1986). Extreme value distribution for the largest cube in a random lattice. SIAM Journal on Applied Mathematics, 46, 118–132.
References
393
DeGroot, M. (1986). A conversation with Charles Stein. Statistical Science, 1, 454–462. Dembo, A., & Karlin, S. (1992). Poisson approximations for r-scan processes. The Annals of Applied Probability, 2, 329–357. Dembo, A., & Rinott, Y. (1996). Some examples of normal approximations by Stein’s method. In The IMA Volumes in Mathematics and Its Applications: Vol. 76. Random discrete structures (pp. 25–44) New York: Springer. Dharmadhikari, S., & Joag-Dev, K. (1998). Unimodality, convexity and applications. San Diego: Academic Press. Diaconis, P. (1977). The distribution of leading digits uniform distribution mod 1. Annals of Probability, 5, 72–81. Diaconis, P., & Freedman, D. (1987). A dozen de Finetti-style results in search of a theory. Annales de L’I.H.P. Probabilités Et Statistiques, 23, 397–423. Diaconis, P., & Holmes, S. (2004). Institute of mathematical statistics lecture notes, monograph series: Vol. 46. Stein’s method: expository lectures and applications. Beachwood: Institute of Mathematical Statistics. Diaconis, P., & Shahshahani, M. (1987). Time to reach stationarity in the Bernoulli–Laplace diffusion model. SIAM Journal on Mathematical Analysis, 18, 208–218. Diaconis, P., & Shahshahani, M. (1994). On the eigenvalues of random matrices. Journal of Applied Probability, 31A, 49–62. Donnelly, P., & Welsh, D. (1984). The antivoter problem: random 2-colourings of graphs. In B. Bollobás (Ed.), Graph theory and combinatorics (pp. 133–144). San Diego: Academic Press. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. Annals of Statistics, 9, 586–596. Ellis, R. (1985). Grundlehren der Mathematischen Wissenschaften. Entropy, large deviations, and statistical mechanics. New York: Springer. Ellis, R., & Newman, C. (1978a). The statistics of Curie–Weiss models. Journal of Statistical Physics, 19, 149–161. Ellis, R., & Newman, C. (1978b). Limit theorems for sums of dependent random variables occurring in statistical mechanics. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 44, 117–139. Ellis, R., Newman, C., & Rosen, J. (1980). Limit theorems for sums of dependent random variables occurring in statistical mechanics. II. Conditioning, multiple phases, and metastability. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 51, 153–169. Erdös, P., & Rényi, A. (1959a). On the central limit theorem for samples from a finite population. A Magyar Tudoma’nyos Akadémia Matematikai Kutató Intézetének Közleményei, 4, 49–61. Erdös, P., & Rényi, A. (1959b). On random graphs. Publicationes Mathematicae Debrecen, 6, 290–297. Erickson, R. (1974). L1 bounds for asymptotic normality of m-dependent sums using Stein’s technique. Annals of Probability, 2, 522–529. Esseen, C. (1942). On the Liapounoff limit of error in the theory of probability. Arkiv För Matematik, Astronomi Och Fysik, 28A, 19 pp. Esseen, C. (1945). Fourier analysis of distribution functions. A mathematical study of the Laplace– Gaussian law. Acta Mathematica, 77, 1–125. Ethier, S., & Kurtz, T. (1986). Markov processes: characterization and convergence. New York: Wiley. Feller, W. (1935). Über den Zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 40, 512–559. Feller, W. (1968b). An introduction to probability theory and its applications (Vol. 2). New York: Wiley. Feller, W. (1968a). An introduction to probability theory and its applications (Vol. 1). New York: Wiley. Ferguson, T. (1996). A course in large sample theory. New York: Chapman & Hall. Finkelstein, M., Kruglov, V., & Tucker, H. (1994). Convergence in law of random sums with nonrandom centering. Journal of Theoretical Probability, 7, 565–598.
394
References
Friedrich, K. (1989). A Berry–Esseen bound for functions of independent random variables. Annals of Statistics 17, 170–183. Fulman, J. (2006). An inductive proof of the Berry–Esseen theorem for character ratios. Annals of Combinatorics, 10, 319–332. Fulman, J. (2009). Communications in Mathematical Physics, 288, 1181–1201. Fulton, W., & Harris, J. (1991). Graduate texts in mathematics. Representation theory. New York: Springer. Geary, R. (1954). The continuity ratio and statistical mapping. Incorporated Statistician, 5, 115– 145. Ghosh, S. (2009). Lp bounds for a combinatorial central limit theorem with involutions (Preprint). Glaz, J., Naus, J., & Wallenstein, S. (2001). Springer series in statistics. Scan statistics. New York: Springer. Goldstein, L. (2004). Normal approximation for hierarchical sequences. The Annals of Applied Probability, 14, 1950–1969. Goldstein, L. (2005). Berry Esseen bounds for combinatorial central limit theorems and pattern occurrences, using zero and size biasing. Journal of Applied Probability, 42, 661–683. Goldstein, L. (2007). L1 bounds in normal approximation. Annals of Probability, 35, 1888–1930. Goldstein, L. (2010a). Bounds on the constant in the mean central limit theorem. Annals of Probability. 38, 1672–1689. Goldstein, L. (2010b). A Berry–Esseen bound with applications to counts in the Erdös–Rényi random graph (Preprint). Goldstein, L., & Penrose, M. (2010). Normal approximation for coverage models over binomial point processes. The Annals of Applied Probability, 20, 696–721. Goldstein, L., & Reinert, G. (1997). Stein’s method and the zero bias transformation with application to simple random sampling. The Annals of Applied Probability, 7, 935–952. Goldstein, L., & Reinert, G. (2005). Distributional transformations, orthogonal polynomials, and Stein characterizations. Journal of Theoretical Probability, 18, 237–260. Goldstein, L., & Rinott, Y. (1996). Multivariate normal approximations by Stein’s method and size bias couplings. Journal of Applied Probability, 33, 1–17. Goldstein, L., & Rinott, Y. (2003). A permutation test for matching and its asymptotic distribution. Metron, 61, 375–388. Goldstein, L., & Shao, Q. M. (2009). Berry–Esseen bounds for projections of coordinate symmetric random vectors. Electronic Communications in Probability, 14, 474–485. Goldstein, L., & Zhang, H. (2010). A Berry–Esseen theorem for the lightbulb problem. Submitted for publication. Götze, F. (1991). On the rate of convergence in the multivariate CLT. The Annals of Applied Probability, 19, 724–739. Griffeath, D. (1974/1975). A maximal coupling for Markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 31, 95–106. Griffiths, R., & Kaufman, M. (1982). Spin systems on hierarchical lattices. Introduction and thermodynamic limit. Physical Review. B, Solid State, 26, 5022–5032. Grimmett, G., & Stirzaker, D. (2001). Probability and random processes. London: Oxford University Press. Haagerup, U. (1982). The best constants in the Khintchine inequality. Studia Mathematica, 70, 231–283. Hájek, J. (1960). Limiting distributions in simple random sampling from a finite population. A Magyar Tudoma’nyos Akadémia Matematikai Kutató Intézetének Közleményei, 5, 361–374. Hall, P. (1988). Introduction to the theory of coverage processes. New York: Wiley. Hall, B. (2003). Lie groups, Lie algebras, and representations: an elementary introduction. Berlin: Springer. Hall, P., & Barbour, A. D. (1984). Reversing the Berry–Esseen inequality. Proceedings of the American Mathematical Society 90(1), 107–110. Helgason, S. (2000). Groups and geometric analysis. Providence: American Mathematical Society.
References
395
Helmers, R. (1977). The order of the normal approximation for linear combinations of order statistics with smooth weight functions. Annals of Probability, 5, 940–953. Helmers, R., & van Zwet, W. (1982). The Berry–Esseen bound for U -statistics. Statistical decision theory and related topics, III, West Lafayette, Ind. (Vol. 1, pp. 497–512). New York: Academic Press. Helmers, R., Janssen, P., & Serfling, R. (1990). Berry–Esséen and bootstrap results for generalized L-statistics. Scandinavian Journal of Statistics, 17, 65–77. Henze, N. (1988). A multivariate two-sample test based on the number of nearest neighbor type coincidences. Annals of Statistics, 16, 772–783. Ho, S. T., & Chen, L. H. Y. (1978). An Lp bound for the remainder in a combinatorial central limit theorem. Annals of Probability, 6, 231–249. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics, 19, 293–325. Hoeffding, W. (1951). A combinatorial central limit theorem. Annals of Mathematical Statistics, 22, 558–566. Horn, R., & Johnson, C. (1985). Matrix analysis. Cambridge: Cambridge University Press. Huang, H. (2002). Error bounds on multivariate normal approximations for word count statistics. Advances in Applied Probability, 34, 559–586. Hubert, L. (1987). Assignment methods in combinatorial data analysis. New York: Dekker. Janson, S., & Nowicki, K. (1991). The asymptotic distributions of generalized U -statistics with applications to random graphs. Probability Theory and Related Fields, 90, 341–375. Jing, B., & Zhou, W. (2005). A note on Edgeworth expansions for U -statistics under minimal conditions. Lietuvos Matematikos Rinkinys, 45, 435–440; translation in Lithuanian Math. J., 45, 353–358. Johansson, K. (1997). On random matrices from the compact classical groups. Annals of Mathematics, 145, 519–545. Jordan, J. H. (2002). Almost sure convergence for iterated functions of independent random variables. Journal of Applied Probability, 12, 985–1000. Karlub, S., & Brede, V. (1992). Chance and statistical significance in protein and DNA sequence analysis. Science, 257, 39–49. Karo´nski, M., & Ruci´nski, A. (1987). Poisson convergence of semi-induced properties of random graphs. Mathematical Proceedings of the Cambridge Philosophical Society, 101, 291–300. Klartag, B. (2007). A central limit theorem for convex sets. Inventiones Mathematicae, 168, 91– 131. Klartag, B. (2009). A Berry–Esseen type inequality for convex bodies with an unconditional basis. Probability Theory and Related Fields, 145, 1–33. Knox, G. (1964). Epidemiology of childhood leukemia in Northumberland and Durham. British Journal of Preventive & Social Medicine, 18, 17–24. Kolchin, V. F., & Chistyakov, V. P. (1973). On a combinatorial limit theorem. Theory of Probability and Its Applications, 18, 728–739. Kordecki, W. (1990). Normal approximation and isolated vertices in random graphs. In M. Karo´nski, J. Jaworski, & A. Ruci´nski (Eds.), Random graphs 1987. New York: Wiley. Koroljuk, V., & Borovskich, Y. (1994). Mathematics and its applications: Vol. 273. Theory of U statistics (translated from the 1989 Russian original by P. V. Malyshev & D. V. Malyshev and revised by the authors). Dordrecht: Kluwer Academic. LeCam, L. (1986). The central limit theorem around 1935. Statistical Science, 1, 78–96. Leech, J., & Sloane, N. (1971). Sphere packings and error-correcting codes. Canadian Journal of Mathematics, 23, 718–745. Lévy, P. (1935). Propriétes asymptotiques des sommes de variables indépendantes on enchainees. Journal de Mathématiques Pures Et Appliquées, 14, 347–402. Li, D., & Rogers, T. D. (1999). Asymptotic behavior for iterated functions of random variables. The Annals of Applied Probability, 9, 1175–1201. Liggett, T. (1985). Interacting particle systems. New York: Springer. Luk, M. (1994). Stein’s method for the Gamma distribution and related statistical applications. Ph.D. dissertation, University of Southern California, Los Angeles, USA.
396
References
Madow, W. G. (1948). On the limiting distributions of estimates based on samples from finite universes. Annals of Mathematical Statistics, 19, 535–545. Mantel, N. (1967). The detection of disease cluttering and a generalized regression approach. Cancer Research, 27, 209–220. Matloff, N. (1977). Ergodicity conditions for a dissonant voting model. Annals of Probability, 5, 371–386. Mattner, L., & Roos, B. (2007). A shorter proof of Kanter’s Bessel function concentration bound. Probability Theory and Related Fields, 139, 191–205. Meckes, M., & Meckes, E. (2007). The central limit problem for random vectors with symmetries. Journal of Theoretical Probability, 20, 697–720. Midzuno, H. (1951). On the sampling system with probability proportionate to sum of sizes. Annals of the Institute of Statistical Mathematics, 2, 99–108. Moran, P. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society. Series B. Methodological, 10, 243–251. Motoo, M. (1957). On the Hoeffding’s combinatorial central limit theorem. Annals of the Institute of Statistical Mathematics, 8, 145–154. Nagaev, S. (1965). Some limit theorems for large deviations. Theory of Probability and its Applications, 10, 214–235. Naus, J. I. (1982). Approximations for distributions of scan statistics. Journal of the American Statistical Association, 77, 177–183. Nourdin, I., & Peccati, G. (2009). Stein’s method on Weiner chaos. Probability Theory and Related Fields, 145, 75–118. Nourdin, I., & Peccati, G. (2011). Normal approximations with Malliavin calculus. From Stein’s method to universality. Nourdin, I., Peccati, G., & Reinert, G. (2009, to appear). Invariance principles of homogeneous sums: Universality of Gaussian Wiener chaos. Annals of Probability. Nualart, D. (2006). The Malliavin calculus and related topics (2nd ed.). Berlin: Springer Palka, Z. (1984). On the number of vertices of a given degree in a random graph. Journal of Graph Theory, 8, 167–170. Papangelou, F. (1989). On the Gaussian fluctuations of the critical Curie–Weiss model in statistical mechanics. Probability Theory and Related Fields, 83, 265–278. Peccati, G. (2009). Stein’s method, Malliavin calculus and infinite-dimensional Gaussian analysis. Lecture notes from Progress in Stein’s method. In http://www.ims.nus.edu.sg/Programs/ stein09/files/Lecture_notes_5.pdf. Peköz, E., & Röllin, A. (2009). New rates for exponential approximation and the theorems of Rényi and Yaglom (Preprint). Penrose, M. (2003). Random geometric graphs. Oxford: Oxford University Press. Petrov, V. (1995). Oxford studies in probability: Vol. 4. Limit theorems of probability theory: sequences of independent random variables. London: Oxford University Press. Pickett, A. (2004). Rates of convergence of χ 2 approximations via Stein’s method. Ph.D. dissertation, Oxford. Proctor, R. (1990). A Schensted algorithm which models tensor representations of the orthogonal group. Canadian Journal of Mathematics, 42, 28–49. Rachev, S. (1984). The Monge–Kantorovich transference problem and its stochastic applications. Theory of Probability and Its Applications, 29, 647–676. Raiˇc, M. (2004). A multivariate CLT for decomposable random vectors with finite second moment. Journal of Theoretical Probability, 17, 573–603. Raiˇc, M. (2007). CLT related large deviation bounds based on Stein’s method. Advances in Applied Probability, 39, 731–752. Rao, R., Rao, M., & Zhang, H. (2007). One bulb? Two bulbs? How many bulbs light up? A discrete probability problem involving dermal patches. Sankhy¯a. The Indian Journal of Statistics, 69, 137–161. Reinert, G., & Röllin, A. (2009). Multivariate normal approximation with Stein’s method of exchangeable pairs under a general linearity condition. Annals of Probability, 37, 2150–2173.
References
397
Reinert, G., & Röllin, A. (2010). Random subgraph counts and U -statistics: multivariate normal approximation via exchangeable pairs and embedding. Journal of Applied Probability, 47(2), 378–393. Rinott, Y. (1994). On normal approximation rates for certain sums of dependent random variables. Journal of Computational and Applied Mathematics, 55, 135–143. Rinott, Y., & Rotar, V. (1996). A multivariate CLT for local dependence with n−1/2 log n rate and applications to multivariate graph related statistics. Journal of Multivariate Analysis, 56, 333– 350. Rinott, Y., & Rotar, V. (1997). On coupling constructions and rates in the CLT for dependent summands with applications to the antivoter model and weighted U -statistics. The Annals of Applied Probability, 7, 1080–1105. Robbins, H. (1948). The asymptotic distribution of the sum of a random number of random variables. Bulletin of the American Mathematical Society, 54, 1151–1161. Ross, S., & Peköz, E. (2007). A second course in probability. United States: Pekozbooks. Sagan, B. (1991). The symmetric group. Belmont: Wadsworth. Sazonov, V. (1968). On the multi-dimensional central limit theorem. Sankhy¯a Series A, 30, 181– 204. Schechtman, G., & Zinn, J. (1990). On the volume of the intersection of two np balls. Proceedings of the American Mathematical Society, 110, 217–224. Schiffman, A., Cohen, S., Nowik, R., & Sellinger, D. (1978). Initial diagnostic hypotheses: factors which may distort physicians judgement. Organizational Behavior and Human Performance, 21, 305–315. Schilling, M. (1986). Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association, 81, 799–806. Schlösser, T., & Spohn, H. (1992). Sample to sample fluctuations in the conductivity of a disordered medium. Journal of Statistical Physics, 69, 955–967. Seber, G., & Lee, A. (2003). Wiley series in probability and statistics. Linear regression analysis (2nd ed.). New York: Wiley. Serfling, R. (1980). Approximation theorems of mathematical statistics. New York: Wiley. Serre, J.-P. (1997). Graduate texts in mathematics. Linear representations of finite groups. New York: Springer. Shao, Q. M., & Su, Z. (2005). The Berry–Esseen bound for character ratios. Proceedings of the American Mathematical Society, 134, 2153–2159. Shneiberg, I. (1986). Hierarchical sequences of random variables. Theory of Probability and Its Applications, 31, 137–141. Steele, J. M. (1986). An Efron–Stein inequality for nonsymmetric statistics. Annals of Statistics, 14, 753–758. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955 (Vol. I, pp. 197–206). Berkeley: University of California Press. Stein, E. M. (1970). Singular integrals and differentiability properties of functions. Princeton: Princeton University Press. Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 2, pp. 586–602). Berkeley: University of California Press. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9, 1135–1151. Stein, C. (1986). Approximate computation of expectations. Hayward: IMS. Stein, C. (1995). The accuracy of the normal approximation to the distribution of the traces of powers of random orthogonal matrices (Technical Report no. 470). Stanford University, Statistics Department. Stroock, D. (2000). Probability theory, an analytic view. Cambridge: Cambridge University Press. Sundaram, S. (1990). Tableaux in the representation theory of compact Lie groups. In IMA volumes in mathematics: Vol. 19. Invariant theory and tableaux (pp. 191–225). New York: Springer.
398
References
Tyurin, I. (2010, to appear). New estimates on the convergence rate in the Lyapunov theorem. von Bahr, B. (1976). Remainder term estimate in a combinatorial limit theorem. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 35, 131–139. Wald, A., & Wolfowitz, J. (1944). Statistical tests based on permutations of the observations. Annals of Mathematical Statistics, 15, 358–372. Wehr, J. (1997). A strong law of large numbers for iterated functions of independent random variables. Journal of Statistical Physics, 86, 1373–1384. Wehr, J. (2001). Erratum on: A strong law of large numbers for iterated functions of independent random variables [Journal of Statistical Physics, 86(1997), 1373–1384]. Journal of Statistical Physics, 104, 901. Wehr, J., & Woo, J. M. (2001). Central limit theorems for nonlinear hierarchical sequences of random variables. Journal of Statistical Physics, 104, 777–797. Weyl, H. (1997). The classical groups. Princeton: Princeton University Press. Zhao, L., Bai, Z., Chao, C.-C., & Liang, W.-Q. (1997). Error bound in the central limit theorem of double-indexed permtation statistics. Annals of Statistics, 25, 2210–2227. Zhou, H., & Lange, K. (2009). Composition Markov chains of multinomial type. Advances in Applied Probability, 41, 270–291. Zong, C. (1999). Sphere packings. New York: Springer.
Author Index
A Accardi, L., see Chen, L. H. Y., 54 Aldous, D., 213, 364 Anderson, T. W., see Baldi, P., 135, 210 Arratia, R., 3 Athreya, K. B., see Baldi, P., 135, 210 B Bai, Z., see Zhao, L., 202 Baldi, P., 135, 210, 254 Barbour, A. D., vii, 3, 4, 18, 64, 201, 231, 315 Barbour, A. D., see Barbour, A. D., vii, 4, 201 Barbour, A. D., see Hall, P., 59 Bayer, D., 209 Bentkus, V., 261 Berry, A., 2, 45 Bhattacharya, R. N., 161, 163, 273, 332, 337 Bickel, P., 95 Biggs, N., 206 Bikjalis, A., 233 Billingsley, P., 366 Bollobás, B., 315 Bollobás, B., see Donnelly, P., 213 Bolthausen, E., 41, 167, 169, 315 Borovskich, Y., see Koroljuk, V., 260, 263, 284, 287 Borovskich, Yu. V., 263 Brede, V., see Karlub, S., 254 Breiman, L., 6 Brouwer, A. E., 206 Bulinski, A., 246 C Cacoullos, T., 117 Chao, C.-C., see Zhao, L., 202 Chatterjee, S., 37, 117, 122, 293, 326, 343, 359, 362
Chen, L. H. Y., 3, 16, 54, 55, 125, 126, 135, 221, 231, 233, 243, 245, 247, 253, 257, 293, 305 Chen, L. H. Y., see Barbour, A. D., vii, 4, 201 Chen, L. H. Y., see Ho, S. T., 167 Chistyakov, V. P., see Kolchin, V. F., 183 Cohen, A. M., see Brouwer, A. E., 206 Cohen, S., see Schiffman, A., 183 Conway, J. H., 123 Cox, D. R., 221 D Darling, R. W. R., 210 DeGroot, M., 4 Dembo, A., 135, 254, 255 Dharmadhikari, S., 231 Diaconis, P., vii, 88, 217, 359, 371 Diaconis, P., see Bayer, D., 209 Doksum, K., see Bickel, P., 95 Donnelly, P., 213 E Eagleson, G., see Barbour, A. D., 201 Efron, B., 130 Ellis, R., 298, 353 Erdös, P., 112, 315 Erickson, R., 47, 63 Esseen, C., 2, 45, 233 Ethier, S., 18 F Fang, X., see Chen, L. H. Y., 293 Feller, W., 2, 32, 59, 221 Ferguson, T., 273 Fill, J. A., see Aldous, D., 213, 364 Finkelstein, M., 271 Freedman, D., see Diaconis, P., 88
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4, © Springer-Verlag Berlin Heidelberg 2011
399
400 Friedrich, K., 261 Fulman, J., 169, 371, 373, 374, 378 Fulman, J., see Chatterjee, S., 343, 359, 362 Fulton, W., 372 G Geary, R., 201 Ghosh, J., see Bhattacharya, R. N., 273 Ghosh, S., 169, 183 Glaz, J., 210, 254 Goldstein, L., 26, 34, 64, 67, 68, 88, 124, 125, 157, 160, 167, 169, 183, 211–213, 315, 334 Goldstein, L., see Arratia, R., 3 Gordon, L., see Arratia, R., 3 Götze, F., 18, 161, 163, 337 Götze, F., see Bentkus, V., 261 Griffeath, D., 366 Griffiths, R., 73 Grimmett, G., 364 H Haagerup, U., 121 Hájek, J., 112 Hall, B., 373 Hall, P., 59, 122 Harris, J., see Fulton, W., 372 Helgason, S., 374, 375 Helmers, R., 263, 266 Henze, N., 335, 336 Heyde, C. C., see Chen, L. H. Y., 54 Ho, S. T., 167 Hoeffding, W., 100, 260 Holmes, S., see Diaconis, P., vii Holst, L., see Barbour, A. D., 3, 18, 64 Horn, R., 318 Huang, H., 210 Hubert, L., 201 I Iglehart, D. L., see Baldi, P., 135, 210 J Janson, S., 315 Janson, S., see Barbour, A. D., 3, 18, 64 Janssen, P., see Helmers, R., 266 Jaworski, J., see Kordecki, W., 315 Jing, B., 261 Joag-Dev, K., see Dharmadhikari, S., 231 Johansson, K., 371 Johnson, C., see Horn, R., 318 Jordan, J. H., 74
Author Index K Karlin, S., see Dembo, A., 254 Karlub, S., 254 Karo´nski, M., 315 Karo´nski, M., see Barbour, A. D., 315 Karo´nski, M., see Kordecki, W., 315 Kaufman, M., see Griffiths, R., 73 Klartag, B., 88, 138 Knox, G., 201 Kolchin, V. F., 183 Kordecki, W., 315 Koroljuk, V., 260, 263, 284, 287 Kruglov, V., see Finkelstein, M., 271 Kurtz, T., see Ethier, S., 18 L Lange, K., see Zhou, H., 212, 213 LeCam, L., 2, 59 Lee, A., see Seber, G., 119 Leech, J., 123 Leong, Y. K., see Chen, L. H. Y., 221 Lévy, P., 59 Li, D., 72, 74 Liang, W.-Q., see Zhao, L., 202 Liggett, T., 213 Luk, M., 18, 345 M Madow, W. G., 112 Mantel, N., 201 Matloff, N., 213 Mattner, L., 231 Meckes, E., see Chatterjee, S., 326 Meckes, E., see Meckes, M., 88 Meckes, M., 88 Midzuno, H., 32 Moran, P., 201 Motoo, M., 100 N Nagaev, S., 233 Naus, J. I., 210, 254 Naus, J., see Glaz, J., 210, 254 Neumaier, A., see Brouwer, A. E., 206 Newman, C., see Ellis, R., 298, 353 Nourdin, I., 345, 371, 381, 382, 386, 388 Nowicki, K., see Janson, S., 315 Nowik, R., see Schiffman, A., 183 Nualart, D., 382 P Palka, Z., 315 Papangelou, F., 353 Papathanasiou, V., see Cacoullos, T., 117 Peccati, G., 382
Author Index Peccati, G., see Nourdin, I., 345, 371, 381, 382, 386, 388 Peköz, E., 343, 363, 364, 367 Peköz, E., see Ross, S., vii Penrose, M., 122 Penrose, M., see Goldstein, L., 124, 125, 157 Petrov, V., 293 Pickett, A., 345 Proctor, R., 379 R Rachev, S., 64, 65 Raiˇc, M., 37, 293 Rao, M., see Rao, R., 211, 212 Rao, R., 211, 212 Rao, R., see Bhattacharya, R. N., 161, 163, 332, 337 Reinert, G., 325, 326, 328, 329 Reinert, G., see Goldstein, L., 26, 34 Reinert, G., see Nourdin, I., 386 Rényi, A., see Erdös, P., 112, 315 Rinott, Y., 135, 149, 161–163, 213, 216, 254, 260, 325, 331–333, 335, 336 Rinott, Y., see Baldi, P., 135, 210, 254 Rinott, Y., see Dembo, A., 135, 254, 255 Rinott, Y., see Goldstein, L., 183, 315, 334 Robbins, H., 270 Rogers, T. D., see Li, D., 72, 74 Röllin, A., see Chatterjee, S., 343, 359, 362 Röllin, A., see Chen, L. H. Y., 125, 126 Röllin, A., see Peköz, E., 343, 363, 364, 367 Röllin, A., see Reinert, G., 325, 326, 328, 329 Roos, B., see Mattner, L., 231 Rosen, J., see Ellis, R., 353 Ross, S., vii Rotar, V., see Rinott, Y., 149, 161–163, 213, 216, 260, 325, 331–333, 335, 336 Ruci´nski, A., see Barbour, A. D., 315 Ruci´nski, A., see Karo´nski, M., 315 Ruci´nski, A., see Kordecki, W., 315 S Sagan, B., 184, 372 Sazonov, V., 332 Schechtman, G., 98 Schiffman, A., 183 Schilling, M., 335, 336 Schlösser, T., 73 Seber, G., 119 Sellinger, D., see Schiffman, A., 183 Serfling, R., 266, 267 Serfling, R., see Helmers, R., 266 Serre, J.-P., 372
401 Shahshahani, M., see Diaconis, P., 359, 371 Shao, Q. M., 150 Shao, Q. M., see Chatterjee, S., 343 Shao, Q. M., see Chen, L. H. Y., 16, 55, 135, 231, 233, 243, 245, 247, 253, 257, 293, 305 Shao, Q. M., see Goldstein, L., 88 Shneiberg, I., 73, 87 Sloane, N. J. A., see Conway, J. H., 123 Sloane, N., see Leech, J., 123 Spohn, H., see Schlösser, T., 73 Steele, J. M., 130 Stein, C., vii, 3, 5, 13, 21, 37, 154, 371 Stein, C., see Baldi, P., 135, 210 Stein, C., see Efron, B., 130 Stein, E. M., 136 Stirzaker, D., see Grimmett, G., 364 Stroock, D., vii, 37, 57 Su, Z., see Shao, Q. M., 150 Sundaram, S., 380, 381 Suquet, C., see Bulinski, A., 246 T Tucker, H., see Finkelstein, M., 271 Tyurin, I., 2, 45, 53, 67 V van Zwet, W., see Helmers, R., 263 von Bahr, B., 167 W Wald, A., 100, 112 Wallenstein, S., see Glaz, J., 210, 254 Waterman, M. S., see Darling, R. W. R., 210 Wehr, J., 74, 75, 77, 78, 81, 82, 84, 86 Welsh, D., see Donnelly, P., 213 Weyl, H., 372 Wolfowitz, J., see Wald, A., 100, 112 Woo, J. M., see Wehr, J., 75, 77, 78, 81, 82, 84, 86 X Xia, A., see Barbour, A. D., 231 Z Zhang, H., see Goldstein, L., 160, 211–213 Zhang, H., see Rao, R., 211, 212 Zhao, L., 202 Zhou, H., 212, 213 Zhou, W., see Jing, B., 261 Zinn, J., see Schechtman, G., 98 Zitikis, R., see Bentkus, V., 261 Zong, C., 123
Subject Index
A adjoint, 385 anti-voter model, 213, 295 antisymmetric function, 21 averaging function, 74 B Bennett–Hoeffding inequality, 234, 237 Bernoulli distribution, 67 Bernoulli–Laplace model, 358 Berry–Esseen constant, 45 Berry–Esseen inequality, 45, 53 Beta distribution, 94, 95 binary expansion of a random integer, 217, 297 bounds on the Stein equation, 16 Brownian motion, 382 C Cauchy’s formula, 184 chi squared distribution, 345 class functions, 373, 374 combinatorial central limit theorem, 24, 100, 167–169, 183, 295, 371 complete graph, 296 composition Markov chains of multinomial type, 212 concentration inequality, 53, 57, 150, 159, 249 concentration inequality, non-uniform, 233 concentration inequality, randomized, 277 conditional variance formula, 111 conductivity of random media, 72 cone measure, 88 conjugacy class, 373 continuity correction, 221 contraction, 383 contraction principle, 69 convergence determining class, 136
coordinate symmetric, 88 coverage processes, 122 covered volume, 122 Cramér series, 293 Curie–Weiss model, 297, 353 cycle type, 183 D delta method, 273 dependency graph, 135, 254 dependency neighborhoods, random, 334 diamond lattice, 73 discretized normal distribution, 221 distance r-regular graph, 206 distribution constant on cycle type, 183, 187, 196 E Efron–Stein inequality, 130 embedding method, 325 equilibrium distribution, 363 Erdös–Rényi random graph, 315 exchangeable pair, 21, 102, 111, 113, 149, 151, 153, 155, 217, 347, 358 exchangeable pair, multivariate, 325, 329 exponential distribution, 345 F fast rates of convergence, 136 first passage times, 364 G Gamma distribution, 94, 95, 345 Gamma function, 94 generator method, 17, 18, 25 Gini’s mean difference, 265
L.H.Y. Chen et al., Normal Approximation by Stein’s Method, Probability and Its Applications, DOI 10.1007/978-3-642-15007-4, © Springer-Verlag Berlin Heidelberg 2011
403
404 graph patterns, 209 graphical rule, 121, 130 group characters, 371, 372 H Haar measure, 373 hierarchical variables, 72 Hoeffding decomposition, 260 homogeneous, 73 I inductive method, 57, 169 integer valued variables, 227 integration by parts, 385 interaction rule, 122 involution, 183 irreducible representation, 372 isolated points, 123 J Johnson graph, 360 K Kantarovich metric, 64 K function, 19 Khintchine’s inequality, 121 kissing number, 123, 124, 335 Kolmogorov distance, 45, 63, 331 L L-statistics, 265 L1 distance, 63, 64 L1 distance, dual form, 64 L1 distance, sup form, 64 L∞ distance, 45, 63 Lie group, 373 lightbulb process, 210 Lindeberg condition, 48, 59 Lipschitz function, 46 local dependence, 133, 202, 205, 207, 245 local dependence, multivariate, 331 local dependence conditions, 245 local extremes, 210 local maxima, 210 M m-dependence, 133, 208 m-scans process, 254 Malliavin calculus, 381 Malliavin derivative, 384, 385 moderate deviations, 293 multi-sample U -statistics, 262 multiplication formula, 383 multivariate approximation, 313
Subject Index multivariate size bias coupling, 314 N nearest neighbor graph, 334 non-linear statistics, 258 non-uniform Berry–Esseen bounds, 237 non-uniform bounds, 233 O Ornstein–Uhlenbeck generator, 385 Ornstein–Uhlenbeck process, 25 orthogonal group, 373 P patterns in graphs, 209 permutation statistics, 201 permutation test, 100, 183, 201 Poisson binomial distribution, 222 Polish space, 64 positively homogeneous, 73 Q quadratic forms, 119 R random field, 133, 247 random graph, coloring, 334 random graph coloring, 333 random graphs, triangles, 326 random sums, 270 renewal–reward process, 364 representation, 372 resistor network, 77 rising sequence, 209 S scaled averaging function, 76 scaling conditional, 92 scan statistics, 208 simple random sampling, 25, 111, 112 size bias, 31, 32, 156, 157, 160, 162, 318 size biased coupling, multivariate, 314 smooth function bound, 314 smooth functions, 136 smoothed indicator, 153, 155, 158, 171, 251 smoothed indicator, bounds, 17 smoothing inequalities, 333 smoothing inequality, 161, 337 special orthogonal group, 373 square bias, 35, 90 Stein characterization, 13 Stein equation, 14, 15
Subject Index Stein equation, general, 344 Stein identity, 36 Stein pair, 22, 192, 375 stochastic integral, 382, 383 strictly averaging function, 75, 77 strongly unimodal, 231 sub-sequences of random permutations, 208 supremum norm, 45, 63 symmetrization, of functions, 382, 383 T tensor product representation, 374 total variation distance, 63, 221, 386 trimmed mean, 266 two-sample ω2 -statistic, 263 two-sample Wilcoxon statistic, 263
405 U U -statistics, 260 uniform distribution, 67 unimodal, 230 unitary symplectic group, 373 universal L1 constant, 67 V variance stabilizing transformation, 274 W Wasserstein distance, 64 Wiener chaos, 383, 386 Z zero bias, 26, 27, 29, 64, 102, 136, 147, 222, 228