This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
0 with II x111 - x'J II > E if r j; for large enough n, x must lie in some one of the spheres of radius e/3 0, this implies in fact about the x"', i = 1, 2, ... , N. Since JJ x.11 that all the x must be in some one fixed sphere, since to jump to another x. J J > e/3 which is never true for large n. Although we shall improve this result later, we have proved the following theorem. one requires I I
THEOREM 4.2.3. If Vf(x) is norm-continuous in x, if Vf(x) = 0 has only finitely many solutions in W(xo), and if [xn} is a criticizing sequence with JJ
x JJ --+ 0, then
either has no norm-limit points or x, -. x*
with Vf(x*) = 0. We do not wish to give the impression that the only way to treat minimization is from the criticizing-sequence point of view; other approaches also
can be taken. For example [Yakovlev (1965)], suppose the directions p are generated via p = where for each n, H. is a bounded, positivedefinite self-adjoint linear operator from the Hilbert space E into itself. Suppose
0 < a for all x, y in E, and suppose we take
0 < E, < t
X.+1 = X. + taPs'
A
- EZ
Then f of course is uniquely minimized at a point x* and one can prove that x, -- p x*. Arguing much as we shall in Section 4.6, one can show that f(x,,
1) -Ax.)
< 0 for 0 < z < t}
I- = SUP {t; KVf(x. + ?p.), A,> -
We assume 0 < at. < a < 1. Let
x+
then (x") is a criticizing
sequence and
f(x.) -f(x..,,) >_ s(cy,.)(l - c)y,. for all c E (0, 1 - (x) with Vf (x")'
I I P I I\
Proof. In any determination of t", clearly - a. = 0 and
dt [f(x. + tP.) - a.t ] G 0
for 0 < t < t", implying f(x. + t. P.) - a.t. (Vf(x"), P.> d(y.)s(c J(1 ,,
c)Y.
for all cin(0,1-a). Proof.- From the proof of the preceding theorem we know that t;, 11 p, I I
s(cy,) for all c in (0, 1 - a); therefore, t;, I I p. I I >_ t, I I p. I I >_ d(YJs(cy.)
II p, I I - d(y,)s(cy,) yields a criticizing sequence by part A of Theorem 4.2.4 with c,(t) - d(t)s(ct) and c2(t) - ct. Since Therefore,
dt l f (x, + tp,) - a,t ] < 0 for 0 < t < t, then
f(x.+J - a.t. < - < -a2 Then {x.} is a criticizing sequence and I I x.+, - x. I
0.
Proof: By Proposition 1.5.1 we have -T. > T.b,
Thus tf(x.)} is decreasing and hence convergent and T.S, tends to zero. If infinitely often we have
- > e > 0 then it cannot occur under condition l since then we have
f(x,)
-f(x.+1) >_ 612 >_ 6"F1
in contradiction to the boundedness below off. Under condition 2, however, we have
E(1 - b2) < (1 - a2) S the right-hand side of which tends to zero since Vf is uniformly continuous and since T. must tend to zero if does not. This gives a contradiction, leading us to conclude that and hence I I Vf(x.) II tends to zero. Finally, Ii x.+1 - x.lI = II T.P.II = I T.I _< I I
0
Q.E.D.
General references: Altman (1966a), Elkin (1968), Goldstein (1964b,
1965, 1966, 1967), Levitin-PoIjak (1966a).
84
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
SEC. 4.6
4.8. A RANGE FUNCTION ALONG THE LINE
We shall now describe another way of selecting IN by making use of a function g(x, t, p) which will determine the range of values t can assume.
The method is similar to that in Theorem 4.5.3 except that a different measure of the distance to be moved is used. The main idea is to pick t to guarantee that the decrease in f dominates
as discussed in Section 4.2. We shall determine admissible values of tM in terms of the range function
g(x, t, P) = (x) - f(x + zP)
-t< f(x),P>
which is continuous at t = 0 if we define g(x, 0, p) = I. We shall assume that an admissible sequence of directions pp is given satisfying II p II = 1. Given a number 6 satisfying 0 < 6 < # and a forcing function d satisfying d(t) < 8t,
we shall attempt to move from x to x,+, = xn + and x,-= x. + p. we find g(x., then we set and also
t.
as follows: if, for
d( !At)
for all t
if we have g(x,,, 1,
where z = A,d() where A. = I if t = 1 and A. = s(d()) if t :p6 1, where s is the reverse modulus of continuity of Vf. Proof: By Equation 4.6.1, f(x,) is decreasing and
f(x.) -f(x.+,) > t.d(< - Vf(x.), p.>)
(4.6.3)
If t = I does not satisfy Equation 4.6.1, then t, E (0, 1). For these n, we write
f(x.+,) -f(x.) _ for some A. E (0, 1) Thus, from Equation 4.6.2,
d() < g(x., t., P.) -II x.), P. x.), P.
d() and hence
t. = II x.+, - X. II > II A.t.P. II > soI Vf(x. + 2.t.P.) - Vf(x.) II] > s[d()] (4.6.4) Hence, using Equation 4.6.3, we conclude that
f(x,) - f(x.+.) >_ d()s[d()] Thus
f(x.) -f(x.+) >_ A.d() as asserted, which implies, as before, --, 0. Q.E.D.
86
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
SEC. 4.6
Computationally one needs a procedure for computing a tM E (0, 1) satisfying Equations 4.6.1 and 4.6.2 if tM = 1 does not satisfy Equation 4.6.1. We consider doing this [Armijo (1966), Elkin (1968)] by successively trying the values
tM = a, a2, a', ... , for some a E (0, 1) THEOREM 4.6.2. Under the hypotheses of Theorem 4.6.1, t may be chosen
as the first of the numbers a°, al, a2, ... satisfying Equation 4.6.1, and then (x,) is a criticizing sequence,
f(xM) -
f(xM+ 1) >_ 1d()
where
A. = 1
if tM = 1, aM = as[(1 - a)]
ift,$1. Proof. As in the previous theorem, tM = 1 yields no problem. in the other case, we have xM+1 = xM + alp,,, j > 1. Let xM = x, + a' 'p,. Then we have
f(x,) - f(xr) < I I xM - xM I I d() f(XM) - f(xM+ t )
II
XM+ I - xM I I d(< - Vf(x,), PM>)
Therefore,
f(x,+,) -
f(x:) < (1 - a) I I xM - xM I I d()
We can write
f(xa+,)
-{y J X.,) =
This leads to > -d()
PM>
Hence II
Vf[2MXM + (I - ~M)xn+I]
>-
- Vf(x,) II
> (1 - a)
(4.6.5)
We then have I I XM+
1
X. I I
a 112M x, + (1 -
1 - XM I I >_ as[(1 -
aK-Vf(x,), PM>]
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
SEC. 4.6
87
Therefore, from this and from Equation 4.6.1 we have f(xw)
-f(x.+1) > as[(1 - 6)]d()
0. Q.E.D.
which implies that
In particular, one can consider this algorithm with d(t) = 8t and, instead of Equation 4.6.2, the stronger condition
1-6
g (x,,, t.,
This method has been considered often [Goldstein (1964b, 1965, 1966, 1967)].
The two theorems above can be extended somewhat. For example, rather than demanding, in Theorem 4.6.1, that lip. II = 1, suppose we assume that lip. 11 >_ d,
Vf(x.).
I I P.11
))
for some forcing function d and that
(_Vf(x ) , ..
P.
lip. II
tends to zero whenever
d() IIP.II
tends to zero. EXERCISE. Show that the latter condition immediately above is valid, for example, if 11 p. II is bounded above or d(t) = qt, q # 0.
Looking at the proof of Theorem 4.6.1, we see that under these conditions Equation 4.6.3 with tw = 1 becomes
f(xw) - f(x.+J >_ d() = d((-Vf(x.3,
II P. Pw
fl>
IIP.II)
so that either (-Vf(xw),11P.
II
or
lip. 11 >_
d1\-Vf(x.).
IIp. [I))
must tend to zero, yielding U Vf(xw) II -- 0. For rw E (0, 1), Equation 4.6.4 becomes instead t. I I P. II >_
Sd()
IIP.II
88
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
SEC. 4.6
and thus d() d( - Vf(x.), PR )
f(x,) - f(x.+ 1) Z s
IIP.I
IIPRII
which implies that d() II P. II
and thereby II Vf(x,) II tends to zero. Thus we have proved the following corollary. COROLLARY 4.6.1. Theorem 4.6.1 is valid [except for the bound on
f(x,+,)] with the assumption that Iip.Il = I being replaced by I. Ip.II >_
11 d,\(-Vf(x.),
IIP.II/)'
and
2. (- Vf(x,J, I I - \
0 whenever d. ( and thence 1 xw+ 1 - X. 11 >
P.
L(I - b) (- Vf(x,),
lip. 11
»
and
f(x.) - f(x.+l)
acsl (I - O)(_Vf(x.),II l
\
1]
E 11p. 11, c > 0, then II
x. II , 0, since to is bounded; this is true in particular for p _ -Vf(x.), as we saw above.
General references: Altman (1966a), Elkin (1968), Goldstein (1964b, 1965, 1966, 1967). 4.7. SEARCH METHODS ALONG THE LINE
In actual computation it is of course necessary to deal with discrete data; this means, for example, that one cannot generally minimize f(x;, + tpn) over all t > 0 but only over some discrete set of t-values. In this section we shall indicate how, in some cases, we can guarantee convergence for practical, computationally convenient choices of step size. For theoretical analysis, we shall restrict ourselves to strictly unimodal functions-that is, to those that have a unique minimizing point along each
straight line; from Section 1.5 we know that this is equivalent to strong quasi-convexity. EXERCISE. Prove the equivalence of strict unimodality and strong quasiconvexity as asserted above.
This equivalence implies that if we have three t-values t, < t2 < t, such that
f(x + t2 P) < f(x + t, p) and f(x + t2p) < f(x + t,p), then f(x + tp) is minimized at a value of t between t, and t,. EXERCISE. Prove the preceding assertion concerning the location of the t-value minimizing the strictly unimodal function f(x + tp).
We combine this fact with Theorem 4.4.2 for a. = a = 0 to prove the following.
THEOREM 4.7.1. Let f be strongly quasi-convex and bounded below on W(x,), let Vf be uniformly continuous on W(xa), and let pn = define an admissible direction sequence. Suppose that for each n there are values such that tn, 1 , tn.29 ' ' ' f tn.ko
90
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
SEC. 4.7
f(x.) >f(x. + t..tP.) > ... >f(x. + t., k. P.) t..k.->>2 ,
tn. k.+1
and 2. < 1. Thus) Theorem 4.4.2 with d(t) = 2 implies our theorem for t. = t..k._I- Since
f(x. + t..k.P.) 2 is sufficient to guarantee that t. _ (k. - 1)h. or t. = kh. will make (x.) a criticizing sequence. Proof. In this case,
t..k.-I =k. - 1> 1 =2 3 k. + 1
t..k.+I Q.E.D.
COROLLARY 4.7.2 [Cea {1969)]. Under the hypotheses of Theorem 4.7.1, if in addition
f(x. + 2h.p.) > f(x. + h.p.) 0 such that
f(x,, + h.p.) < f(x. + 2h.p.) _ +s(cy,)(1 - c)y. for all c in (0, 1), with
(-Vf(x),II Proof.-.The point t;, providing the global minimum for f(x. + tp,) must satisfy 0 < t; < 2h and of course t = t; would yield a criticizing sequence. Since f is convex, for 0 < t < h we have
f(x. + tp.) > 2f(x. + h.p.) - f(x. + 2h. p.) + fix. + 2h. p.) -f(xx + h. p.)
h
> 2f(x. + h.p.) -f(x. + 2h.p.) > 2f(x. + h. p.) - f(x,) while arguing similarly for h < t G 2h we deduce
f(x. + tp.)
2f(x. + h. p.) -f(x.)
Setting t; = t thus gives
f(x. + t.'p.) > 2f(x. + h.p.) -f(x.)
92
SEC. 4.7
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
and therefore
f(x.) -f(x. + h.p.) >.[f(x.) -f(x. + t,p.)] The theorem follows from part B of Theorem 4.2.4 with 8 = }. Q.E.D. We can now give a practical algorithm of a search type to locate a suitable value of We assume that the algorithm is entered with a point x4, direction p,,, and a number h > 0 given. We write in a pseudo-ALGOL language for convenience.
Search routine [Cea (1969)]
if f(x + hp.) < reduce:
h
then go to first;
-T;
if f(x + hp,) > (x. + h pal > f(x4 + hp.) then EXIT FROM ROU-
f
TINE NOW WITH t = h; if f IS CONVEX then EXIT FROM ROUTINE NOW
WITH t. --2 h T;
loop:
while f (x + h
first:
EXIT FROM ROUTINE NOW WITH t = h; if f(x + f(x. + hp.) then go to oldway;
f(x + hp.) do h -
T;
t.- 2h; change: oldway:
while f(x + (t + f(x. + tp.) do t t + h; EXIT FROM ROUTINE NOW WITH t = t; if f IS CONVEX and f(x. + 2hp,J 0 and a2 > 0 be chosen and choose t to satisfy S,
and set xn+, = x - t.Vf(xn). Then x -+ x*, the unique point minimizing f over E, starting at any x0. Given any e > 0 there exists an N such that for
n > N,
-t.M1)
11x*-x.... II 0, then - 0.
98
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
SEC. 4.9
Proof: As in Theorem 4.2.4, we easily find f(x.) --.1(x. + t:P.) > tu. II P.11 [Y. - c2(Y.)] > c,(Y.)[Y. - c2(Y.)] where
/-Vf(x.),
-\
Y. =
P. IIP.II
If t = t", we then have -f(x.) - f(x.+,) > cl(Y.)[Y. - c2(Y.)] If t = t',, then t;, I I P. 11 < t.u 11p, 11, and arguing as for t.0 we get
f(x.) - f(x. + t.P.) >- tI[P.II IF. - c2(Y.)] d2(Y.)d1(IIP.IDEV. - c20.)]
Thus y.
0 or
11 P. 11
0; since 11 p.11= I I x. - x.11 and 11 Vf(x.) 11 are
bounded, this gives t.a', which implies t. = t:' and thus t > t;,; in the second case, clearly t > t;,. In either case, by the defining property of t. and the fact that t. > t,', we have
f(xa) - f(xa + tap.) > f(X.) - f(x. + t:P.) + a.(t. - t.X
f(x.) -
P.>
f(x. + to P.)
so that the theorem follows from part 2 of Theorem 4.9.2 with f
1. Q.E.D.
EXERCISE. Fill in the details in the above Proof.
Remark. Setting a - a = 0 yields the usual method. THEOREM 4.9.4. Let f and C be as in Theorem 4.9.3 and let C be normclosed. Let t be either: (1) the smallest positive t providing a local minimum for
f(xa + t.P.) - a t(Vf(x.), P.>
over the set of t such that x + tp E C, t > 0; or (2) the first positive root r of
- aa(Vf(x.),
if x. + rp E C, otherwise t = sup {t; X. + tP. (=- C) or (3) the following:
t = sup {t;
d((-Vf(xn), II
A P.
d,(1) = t, d2(t) = d(t), /3 = 1. Q.E.D. EXERCISE. Fill in the details in the above Proof.
As in the unconstrained case, if Vf is Lipschitz-continuous, while the 0, it is above theorems define a range of t-values leading to possible to double the size of this range by a more careful analysis. THEOREM 4.9.6. Let f be bounded below on C, let Vf satisfy
11Vf(x) - Vf(y)II /' L
for all n. For each n let xn+, = xn + to pA where to is defined via
to = min (1, ynl I I Pn 112
J
Then f (xn) decreases to a limit. If 1I PA 11 is uniformly bounded-for example,
if C is bounded-then lim = 0
If 1lpAI l -0 implies (Vf(x,i),
P. IIPAII
-> 0
then
Jim (Vf(xn), 11 P.
II = 0
SEC. 4.9
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
101
Proof.
f(xn+ 1)- f(xn) 0 such that f(xw) - f(x., ) z a' for either r = 1 or r = 2, which implies lim = 0. Since + M II x:-x*11 [_]
1/2
using Equation 4.10.2 and the positive-definiteness of A,,. Thus
0 where f;,' denotes the derivative f', . If x" is chosen to minimize g"(x)
+ + < 0
for convex f, implying that p is a direction of nonincreasing f-values as needed.
THEOREM 4.10.3. Let f be convex, bounded below on the norm-closed,
bounded convex set C, and attain its minimum over C at x*; let f', exist in C and I I f' I I < B > 0 for all x in C. Let [xj be a sequence in C such that lim = 0 where p = xn - x and xn minimizes
gn(x)=
f(x*).
over C. Then
Proof: By the definition of xn as we saw above, we have >_ - <x - x, f.'(`v.' - X.) > - <x - x'., V n(x'n)> -
since 0 and 11 s(x, t) I( S Kt2 for some K. Thus only a small perturbation of the linear motion keeps us in C. Algorithms have been given for determining t-values,
and convergence proofs are known. The methods for computing s(x, t), however, are very complex and do not appear to lend themselves to practical computation; therefore, we consider the method no further. One further type of method for constrained problems which we wish to consider is the penalty function method. We have met this approach before in Sections 3.2 and 3.3 in a more specialized form. In fact, the whole approach fits into the discretization analysis if one makes some extensions in those results, but this adds but little to the general applicability of those theorems; therefore, we treat the penalty-function method briefly in the more classical fashion. We seek to minimizef(x) over C = {x; g(x) < 0), where g is some nonlinear functional. Instead, we shall approximately minimize f(x) + PP[g(x)] over E, where the penalty functions P. are such that, fort > 0, lim PP(t) = 00; n-.«
uniformly for
t > 6 > 0 for all 6 > 0
Thus P. will penalize us for having an x with g(x) > 0. EXERCISE.. Give some examples of penalty functions that satisfy the conditions immediately above.
GENERAL BANACH-SPACE METHODS OF GRADIENT TYPE
sEc. 4.11
111
What we can hope will occur, then, is that our computed sequence xp will satisfy
lim sup g(xp) < 0 p+w
This, however, is not enough in general to guarantee that d(xp, C) = inf I I xp - x l l rEC
is tending to zero. DEFINITION 4.11.1 [Levitin-Poljak (1966a)]. The constraint defined by g
is called correct if lim sup p-.w
0 implies lim d(xp, C) = 0. 11-.w
EXERCISE. Find some explicit conditions under which constraints are correct.
THEOREM 4.11.1. Let g define a correct constraint; for some e > 0 let I f(x) - f(y) I < L II x - yli if d(x, C) < E and d(y, C) < e; let Pp[g(x)] > 0
for all x E E; let lim PP(t) = oo for t > 0, uniformly for t > a > 0 for all 6 > 0; and let lim PP[g(x)] = 0 for all x r= CO, a dense subset of C. Define mp = inf {f(x) + PP[g(x)]}, xEE
m = inf f(x) xEC
and assume inff(x) = rn > - oo. For a sequence ep > 0, satisfy
xEE
Ep ---p
0, let xp E E
f(x,) + PP[g(x,)] < mp + ep Then {xp} is an approximate minimizing sequence for f over C in the sense of Definition 1.6.1.
Proof: Let wp E C, lim f(wp) w, E Co with
m. Since f is continuous, there exists
If(w) -f(wp)I m, sequence bounded below by zero, we write
<xm - xn, N(xm - xn)> = E(xm) -
2<M*Hrn, xm - X.>
Since n > m, xm - x is in B.; since x minimizes E(x) over B., which equals -2M*Hr,,, is orthogoral to B. and therefore to xm - x,,. Thus we conclude that <xm - xn, N(xm - xn)> = E(xm) -
which tends to zero, since {E(x,)} is a Cauchy sequence. Since N is positivex'. definite, {x,} is a Cauchy sequence and there exists x' E B such that x By a continuity argument we find, setting r' = k - Mx', that <M*Hr', z> = 0 for all z E B. Since again VE(x') = -2M*Hr', this implies that x' minimizes E over B. Q.E.D. EXERCISE. Prove that <M *Hr', z> = 0 for all z E B, where r' is defined it the Proof of Theorem 5.2.1 above.
As a practical matter, the minimization is easier if B. is finite-dimensional-if, say, B. is spanned by the linearly independent vectors [po, p ... for all n. Then of course x a,,,jpj. It would be convenient if a., j
were independent of n, so that xn+, = x + a,p,,. It is a simple matter to or
= 0 show that this is the case if and only if either x = for i = 0, 1, . . . , j - 1 [Antosiewicz-Rheinboldt (1962)]. We include the proof in one direction as a part of the following.
116
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
SEC. 5.2
THEOREM 5.2.2. Let {p,); be a sequence of linearly independent vectors
satisfying
= 0 if i x,+ i = X. + c P,W
j and let x0 be arbitrary. Let c,
_
'
r, - k - Mx,
Let B. be spanned bypo, ... , p,-, and let B = closure of U B,. Then x, -- 'x' minimizing E over B.
Proof: For i < n - 1, <M*Hr P+) _ <M*Hr,-t, P,> - c,-, 0, and for any such a, A, we have
a < 4u(p) < I < ,u(Kg,) < A
and a< v(g)
= <M-'(M*H)-'M*Hr M*Hr,>
_ = y(gi) 1
E(x,) - E(xr+,) = c, = E(x,)c,v(g,)
The estimates of the theorem follow from cr > 1/A, Y(g) > a. If K and N commute, then ary(gr)
y(g,)
K)
,u , gr
_-
[Kg,, Kg,]2
[Kg,, TKg,][Kg,, T -'
where Ix, y] - <x, K-' y>. It is easy to see that T is self-adjoint positivedefinite relative to with spectral bounds a, A; thus 4aA
c.v(g) > (A+a)2 by the inequality of Kantorovich [Faddeev-Faddeeva (1963), Kantorovich (1948)]. Now let fi > 0 be the lower spectral bound for N. Then
fill x.-
I12S=E(x.)
SEC. 5.4
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
119
In some cases the method can be shown to converge even when a = 0, but examples are known in whit, i we then have 11x - x* 11 > (inn)-' for some ).. > 0, showing that no geometric convergence rate is possible [Odloleskal (1969), Poljak (1969a)]. General references: Antosiewicz-Rheinboldt (1962), Daniel (1965, 1967b), Hayes (1954), Hestenes (1956), Hestenes-Stiefel (1952). 5.4. CONJUGATE GRADIENTS AS AN OPTIMAL PROCESS
Much-improved bounds on the convergence rate can be obtained by viewing the conjugate-gradient method in a different light, one which shows more clearly the great power of the method-as opposed, say, to the steepestdescent method, which also has a convergence factor like (A - a)/(A + a).
Suppose we seek to solve Mx = k-that is, M*HMx = M*Hk-by
some sort of gradient method; for more generality we allow ourselves to multiply gradients also by an operator K, where M, H, N, K, Tare as defined earlier. If at each step we allow ourselves to make use of all previous information, we are lead to consider iterations of the form
x.+I =
h = M -'k
P JA) are polynomials of degree less than or equal to n. If we should by chance have x0 = h, we would want x = h for all n. This leads, since h should be considered arbitrary, to the requirement that where
xo + P.(T)T(h - x0)
(5.4.1)
where P,.(.t) is a polynomial of degree less than or equal to n. We wish to use methods of spectral analysis to discuss such methods,
so we are forced to assume that
N = p(T) where pa,) is a positive function continuous on some neighborhood of the spectrum of T. As we shall later see, this is satisfied in the practical methods, where usually p(..) ) or p(A) -- 1. For each n, we wish to choose so that E(x..,I) is the least possible under any method of the form of Equation 5.4.h According to the spectral theorem, we can write 1) _ A p(2)[1 - 2PP(1)12 ds(2) J
(5.4.2)
0
where s(.) is a known increasing function. The fact that there is a polynomial PP(2) yielding this least value follows from a straightforward generalization
120
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
SEC. 5.4
[Daniel (1965, 1967b)] of the theorem in finite dimensions as proved in Stiefel (1954, 1955).
is minimized by setting
PROPOSITION 5.4.1. The error measure
1-
to be the (n + 1)st element of the orthogonal [on
[a, A] relative to the weight function .p(A) ds(A)} set of polynomials R,(A) satisfying R,(0) = 1. EXERCISE. Prove Proposition 5.4.1.
We shall now show that, for each n, the vectors generated by the conjugate-gradient method are precisely those generated by this optimal process. THEOREM 5.4.1. For each 'n, the vector x generated by the conjugategradient (CG) method coincides with that generated by the optimal process of the form in Equation 5.4.1. Proof: Given n, the vectors p0, ... , p,-, in the CG method are independent. Since p0 = Kgo and p,+I = Kg,,.., + b,p it is clear that any linear combination of p0, ... , can be written as a linear combination of Kg0, ... , Thus the n vectors Kg0, ... , Kg, span at least the ndimensional space B.- sp[po, . . . , hence B = sp[Kg0, . . . , Kg._,].
Now Kg0 =g°Kg0; assume that for j< i, Kg, can be written as a linear combination of T°Kg0, T'Kg...... T'Kgo. Then Kgr+, = K(g, - c1Np,) = Kg, - c,Tp, We can write p, as a linear combination of Kg0, ... , Kgt, each of which, by
the inductive assumption, is a linear combination of T°Kg0, ... , T'Kgo. Therefore, Kg,+, is a linear combination of T°Kg0, ... , T'*'Kg,,. Reasoning as above, we have
B. = sp[T°Kg0,... ,
T"-'Kg.)
Now x, minimizes E(x) on x0 + B. if x is generated by the CG method. By what we have shown above, this says that the x generated by the CG method minimizes E(x) on the set of points ,.-1
x = x0 + E1-0s1T`Kg0 = x0 + P,,-,(T)T(h - x0) where P,,_,(A) is the (n - 1)st-degree polynomial P.-,(A)
among all iterations of the form
X. I = x0 + P (T)T(h - x°) the CG method makes
the least. Q.E.D.
_'E-,
1-0
s,V'. That is,
SEC. 5.4
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
121
Thus, if we insert any polynomial into Equation 5.4.2, we can get a where x,+, is generated by the conjugate-gradient method, bound for If we choose for comparison since that method gives the least value of as the (n + 1)st Chebyshev polynomial relative to )p()) ds(1) 1on [a, A], we find the following bound.
PROPOSITION 5.4.2. Let a - a/A. Then, for the conjugate-gradient method,
E(x.) S w.E(x0) S 4(1 \1 +
/a
a ) 2"E(xo)
and 11 x. - h 112 converges to zero at this same rate, where
"'. = (1 +
2(1 - ar )2. +
EXERCISE. Prove Proposition 5.4.2.
By this result we have reduced our estimate of the convergence factor from (1 - a)/(1 + a) to at least (1 - /-,% )/(1 _ix ). When one uses the steepest-descent algorithm to solve Mx = k by minimizing E, one moves from xx to x.., in the direction M*Hr.. Therefore, the steepest-descent method has the form of Equation 5.4.1 and, therefore, reduces the error E(x) by less than the conjugate-gradient method for every n. Since the best-known and in
certain cases best possible convergence estimates for steepest descent [Akaike (1959)] are of the form (1 - a)/(1 + a), while we have at least (1 - ^/ T )/(1 + ,/ _a), we see that the convergence of the conjugate-gradient method is also asymptotically better. For clarity, we now state the form that the conjugate-gradient algorithm takes in certain special cases. The iteration takes its simplest form in the case in which the operator M is itself positive-definite and self-adjoint; it was this case for which the method was originally developed. Here we may now take H = M-' and K = I. Thus
N=T=M,
E(x)=
Since N = T, we have p(A) = A, and the analysis of this section applies. The iteration becomes as follows:
Given x0, let po = ro = k - Mxo. For n = 0, 1,.. . , let
=
11r.11'
_
P., MP.>
r.+i = r. - c.MP.,
-
b"
I I r" 112
A second special case which is simple enough for practical use arises from setting H = K = I, so that T = N = M*M. Again, p(2) - A, and we have E(x) = 11 r 112. Fortunately, for computational purposes one can avoid
the actual calculation of M*M and can put the iteration in the following form :
Given x0, let ro = k - Mxo, Po = go = M*ra. For n = 0, 1, C.
_
- 11 MP" I I2'
r.+1 = r" - C. MP",
... , let
x"+ 1 = X. + C.P.
g"+ 1= M r"+ 1
P"+1 = g"+1 + b"P"
where
_<MP Mgr+I>=11 g"+1112
b"
II MP"112
11&11,
A third special case arises from H = (M*M)-', K - M*M, so that N = I, T = M*M, p(2) - 1, E(x) = II h - x112. By some manipulation, the iteration takes the following form:
Given x0, let r,, = k -- Mx0, p0 = M*ro. For n = 0, 1, . c" =
11 r" 112 I
IP P.
rn .1 = r
X" 1
112'
c"MP",
- X.
. .
, let
CnPn
P"+ I_ M*r"+ t
'?
b"P"
where b" . _.
II r"+1 E. Ilr"II2
EXERCISE. Show that the last two algorithms above generate the desired iterates.
General references: Daniel (1965, 1967b), Faddeev-Faddeeva (1963). S.S. THE PROJECTED-GRADIENT VIEWPOINT
It has been widely believed that the CG method exhibits superlinear convergence-that is, that II x" - h II tends to zero faster than any geometric
SEC. 5.5
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
123
sequence An with ) > 0-although the best error estimates in general only yield
,,/A
,/a
!
If we view the method as one of projecting the gradient direction onto the space conjugate to all preceding directions, we obtain an indication that the convergence might in fact be superlinear; the result we obtain in this way is also needed later for the analysis of nonquadratic functionals. For simplicity of notation, we restrict ourselves to the simplest special case of the CG method
with M itself positive-definite and self-adjoins, with N = T == Al, K 1. Without loss of generality, we consider the CG iteration starting with a first guess x0 - 0. Suppose we are given a vector d $ 0 such that 'd, k; = 0. by We define an equivalent inner product [x, y] = <x, My; Then we have [h, d] = 0-that is, h is M-conjugate to d. Let P, be the orthogonal (in the sense of the inner product projection onto the linear
subspace spanned by d, and let P, = I - PF Define the Hilbert space A', = P,. with inner product and define the operator M, = P,M in. ,. EXERCISE. Prove that M, is a bounded, self-adjoint, positive-definite linear operator from .7f', onto . V , and that, therefore, h is the unique solution of the equation
M,x -- k, = P,k Show that the spectral bounds a,, A, of M, are related to those a, A of M by a < a, A, A. Hint: For example, to solve Mix k' for k',,- Al',, let
aM-ld
x, - A4-1k' If
- = 48
soft = 0 and xa = x'.
To solve M,x = k, in .Y° we consider the general form of the CG method obtained by letting
K=M H=M2, sothat N=I,T=M, All the theory of the,CG method applies here, and we can in particular deduce
that E,(x") S w.zE,(xo)
where W. = (XI
2(1-a)" (1 + ax" + (I - -777A,
E,(x) = [h - x, h - x] = A straightforward calculation shows that the iterates x, generated by this general algorithm on M, in .", are precisely the same as the iterates generated by using the standard simple algorithm on M in ,' if the initial direction po
in the simple algorithm is not chosen as ro = k - Mxo = k as usual, but by the formula
Po=P,r0=ro
b_,d,
b_,=-
ro, Md>
that is, by the usual way of generating CG directions if we identify d with p_,. EXERCISE. Prove the assertion in the preceding paragraph.
All that the preceding paragraph says is that the standard CG method, modified to require the first direction po to be conjugate to d, is equivalent to a general CG method in a space M-conjugate to d; therefore, the modification of the standard method converges and, in fact, since
E,(k)=[h-x,h-x)=
E(x)
we have
E(x") < w2E(xo)
More generally, if we have proceeded through standard CG directions
SEC. 5.6
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
125
P09 pI, ... , PL-, to arrive at xL = 0, then the solution his M-conjugate to pi, 0 < i < L - 1, and we can define P,, as the orthogonal projection (in the [ ,
sense) onto the span of [p0, ... , PL-,}, P, = I - PA, .'I = P,.,°,
]
M, = PM. Then the remainder of the standard CG iterates are precisely the same as those generated by the more general CG method applied to M, in A, and, therefore, our convergence estimates can make use of the spectral bounds of M. on A e, rather than of M on A. Since the projections P, are "contracting" as we do this analysis after each new standard CG step, the spectral bounds on the operators M, might be contracting, allowing a proof of superlinear convergence. While we have not been successful in accomplishing this, it seems a worthwhile approach. 5.0. CONJUGATE GRADIENTS FOR GENERAL FUNCTIONALS
We now wish to consider minimizing a general functional f(x) over a
Hilbert space .° by some analogue of the conjugate-gradient method. In this case, Vf(x) plays the role of 2(Mx - k) and fx plays the role of 2M. For notational convenience we shall write J(x) _ Vf(x), P. = f we shall also write r" _ -J(x), J',.. Thus, in analogy to the quadratic problem, given x0, let pa = ro = -J(x0); for n = 0, 1, . . . , let x.+, = x" + c" to be determined; set r.+, = -J(x.+,), and p.+, = b"p", where
_ b"
- r"+ , J.+ i P" P., J.'. P.>
If the sequence of vectors p" that we generate in this manner is admissible,
then all the results of Chapter 4 apply to determine the choice of c,,; we consider the admissibility. If we desire
>allr"II2, then
a>0
precisely what we need is
b.-Xr.,P.-J > -(1 - a)IIr.II2 This follows, for example, if b.-
(I -a)
Ilr"II
IIP.-i Il
for which and
_ <J(x) - J(y), x - y> > a II x -
1I2
as does the error estimate with x = x', y = x,,. Q.E.D. This theorem by itself does not indicate any special value for the method; all of the methods of Chapter 4 behave essentially in this fashion. The advan-
tage of the method for quadratic functions is its rapid convergence rate; we show that, asymptotically, this same rate is obtained in general. 5.7. LOCAL-CONVERGENCE RATES
In examining the local-convergence rate, we discover that estimates can
be found simultaneously for a larger class of methods-namely, without
128
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
SEC. 5.7
choosing b. via the conjugacy requirement. We assume instead that I I b.-, P.-, I I < D 11 r.11 for some D; then I I P.112
1 1r
. 1 1I2 + I I b.-, P.-, III < (1 + D2) II r.112
which yields
r.,.-=11r.11Z 11r.1111P.11
IIP.II
121 (1+D)
so that the p are admissible directions. (This assumption can be weakened via Remark I following Theorem 4.2.4.) If we examine the effect of this change on the Proof of Theorem 5.6.1, we find instead that 1
A(1+D2) f(x.+,) _< f(x.) - A(1
I
D2 II r.li2
so that the conclusions of the theorem follow. Thus we have proved the following. THEoi
i 5.7.1. Let 0 < aI < Jx < Al for x E .-°, J = Df.
Given x0, let po = ro = -J(xp). For n = 0, 1, . . ., let
(5.7.1)
Then x. converges to the unique x* minim;zing f over .*°, and
iix-x*II \ 1 + >7.i
JJ)P., P.>dt
_ - II r. I I2 + C<J,P., P.> -+- c J This gives
-II r. 112 + c<J.'p.,p.> -+c2BIIP.1I' + c2B
I I P II3
On the interval
p0
if
d 0 small enough (independent of x0) and cw and b" determined as described above, it follows that xw - x*. Proof..
r(xw) -
f(xw+,) _ <J(xw+,), -cwPa> + 2 C.NPw, Jxw+rc.P.> C. IIP"112
{
II rw+, Il IPw
dIIP"112[Ta-6
II I1Pwil
ll
c"a}
(A+cw(1+6L)IJ
Because of the lower bound for cw, if 6 is small enough, then f(xw) - f (xw+,) > d, I I Pw 112
+
138
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
SEC. 5.8
for some d, > 0, and hence I I p.11 > (1 - SD) I I r I I tends to zero, implying x*. Q.E.D.
X.
The above theorem is somewhat similar to Theorem 4.5.1. In order to obtain good estimates of the local-convergence' rate, we need to determine c more accurately. According to Lemma 5.7.2, c. is approximately given by I r I J2
I
c - P", .,P">
then the asymptotic convergence rate-that is, for e,, small enough-is described as follows: for every m there exists N;" such that for n > Nm, we have E"+m(x.,+m) < [wn, ± O(e.11(4m-3)) 1. O(e+/(4m-»)]En(x")
where wm is given in Theorem 5.7.3. When J(x) is linear, we know that
b" = II r"+,IIZ IIr"II Since this formula does not involve Jn in any way, it is computationally useful
and has been used in practice for general problems; a computer program can be found in Fletcher-Reeves (1964). If b" satisfies II b"p" II < DII r"+, II, then convergence is guaranteed. by previous theorems; such an inequality
does not appear to be valid in general, however. It can be guaranteed by setting
b"=min{11r"+,IIZ A ,IIr"+,II I1r"IIZ
' a
IIP"II
Another way to compute a b" which is just as convenient from the computa-. tional viewpoint as that above, but more easily analyzed, is via the formula [Poljak (1969a)] IIr"II2
which is a correct formula for quadratics. EXERCISE. Prove that the three determinations of b", namely
11 r"+, III 11r"IIZ
r
Z
and
f,
P
,
are equivalent on quadratics.
140
CONJUGATE-GRADIENT AND SIMILAR METHODS IN HILBERT SPACE
SEC. 5.8
For the global convergence question, we have x.D l
I l b"P" I I =11 P I I
I N, we see that
a"fin-N
1
IIPNI2
and hence
But then, according to Remark I after Theorem 4.2.4, this implies that II VJ(x.)11-> 0, a contradiction. Therefore Vfix,,,) - 0 for some subsequence. Theorem 4.2.1 applied to this subsequence implies that the subsequence is
minimizing, while the inequality f(x",,,) j > n, implies that {x.) is a minimizing sequence. Q.E.D.
6
GRADIENT METHODS IN
(RI
6.1. INTRODUCTION
Since ER' under any norm (all of which are equivalent) is a Banach space, and is in fact a Hilbert space under the usual inner-product, all the results of Chapters 4 and 5 apply here. In fact, of course, more detailed results can be obtained for gradient methods in R' because of the especially simple structure of this space; in this chapter we examine some of these results.
First, because of the finite dimensionality of E', the weak and norm topologies coincide, and any closed, bounded set is (sequentially) compact and vice versa; thus the existence theory of Cnapter I is simplified, the precise simplifications being left to the reader. Second, because of the nature of the topology in R', criticizing sequences 1x,.) for a functional f are generally more valuable since, if W(x,,) is bounded (see Section 4.2), then limit points x' of {xn} exist and must be critical points off; in the following sections we shall examine the consequences of this more closely.
Finally, the asymptotic convergence rates of particular methods can be studied in more detail in ER'; w_- describe some of these results. 6.2. CONVERGENCE OF x,,, i - X. TO ZERO
We mentioned in Section 4.2, particularly in Theorem 4.2.3, that the
convergence of x,t, - x to zero could be of great value; in Section 6.3 we shall examine this in some detail. In the present section we shall examine
situations in which one can assert that x,,, - xn does converge to zero. We have already seen in Chapter 4 -according to Theorems 4.6.1, 4.6.2, x tends to zero when is determined by use of
and 4.6.3-that
142
SEC. 6.2
143
GRADIENT METHODS IN IR'
simple intervals along the line. For the methods of Section 4.7 involving a
range function along the line, we could not in general prove that 0, as indicated by Theorems 4.7.1 and 4.7.2 and their exxII tended versions in Corollaries 4.7.1 and 4.7.2. As shown in Corollary 4.7.3, II --+ 0 implies more generally, whenever II where p = IIP. II - . 0-in many special cases of this general method we can assert that II x.11 -- 0. It is not true in general, however, that the algorithms of Sections 4.3, 4.4, and 4.5 involving minimization along the line necessarily II
yield 11 x.+, - x II - 0; contradicting examples- can -be created. We can,
however, show that for many methods and certain kinds of functions we must always have I I x. (I --. 0.
If W(xo) is compact, then
has limit points x' and Vf(x') = 0.
Hence the following proposition follows. PROPOSITION 6.2.1. If Vf is continuous on the compact set W(xo) and
Vf(x) = 0 has only one solution x*, then x -- x*. We seek more significant results.
THEOREM 6.2.1 [Elkin (1968)]. If W(xo) is compact, if there exists a
a > 0 such that
f(x.+,) <x2 - x Vf(x,)> if x, :e-x2 Thus, if <x2 - x V1(x,)> > 0, we conclude f(x2) > f(x,); a function satisfying this property is called strictlypseudo-con vex [Elkin (1968), Ponstein (1967)].
GRADIENT METHODS IN IR'
SEC. 6.3
145
THEOREM 6.2.4 [Elkin (1968)]. For all x, y in the compact set W(x,,), f(x,,) for all n, let let (x - y, V f ( y ) > > 0 imply f(x) > f (y). Let f 0. Then D such that I I x,, +, I I 0 let no be the greatest integer not exceeding n which is congruent to zero modulo m. Then it is easy to see that
=I
. Yr
r,r
Since Wo = {x; f(x) g(to, 2); and in [to.1, to, 2] if g(to, 1) = g(to, 2). Thus we have located the minimum
in an interval [a b1] smaller than [ao, bo] and we can proceed iteratively. The method would be most efficient if we only need to evaluate g at one new point each time-that is, if either t,,, or t1,2 would equal whichever of to,1 and to,2 lies in (a b,); to allow this, we never choose a, = to,,, b1 = to,2 but in that case of g(to,,) = g(to,2) we define a, = ao, b, = 111,2. If one seeks the smallest final interval [a,,,, bm] for a given m, then it is known [Kiefer (1957), Spang (1962)] that one should choose tj, I
___
Fm- I -1
m+I -i
(b, - a1) + a
tt.2 =
F. -, Fm+1-t
(b, - a) + a,
where Fo = 1, F, = 1, F; = F1-1 + F,-2 are the Fibonacci numbers. This Fibonacci search always requires only one evaluation of g per step. On the
final step, one takes
tm- i,I = (# +
am-I) + am_,
t.-J,2 - 3-I
+ am-i
in order to isolate the minimum best. The final interval has width bm-am=(b,-ao)-1-E
2F.
156
sEC. 6.6
GRADIENT METHODS IN IR'
Since F20 > 104, we see that the intervals shrink rapidly. It is known that for large i, we have
F!-'
0.382,
F,r i
F' F,+t
- 0.618
which allows one to use the simpler formulas:
, = 0.382(b, - a,) + a, t, z = 0.618(b, - a,) + a, t,
The final interval in this way satisfies
b. - a.
= (0.618)m(bo - ao)
Thus one can isolate the minimum in this way as accurately as desired.
Next we turn to methods using interpolation, although some of our remarks apply to direct-search techniques as well. Some of the procedures first seek an interval in which the minimizing point t* lies. Usually this is done by taking some number t, as an estimate of t* and then evaluating g
at 0, to a2t a,t ... , for some sequence a, (often a, = 2') and stopping at the first instance that the values of g do not decrease; if one is willing to evaluate g'(a,t,) as well, one can also stop whenever g'(a, t,) becomes positive.
If the termination procedure occurs at t = t then t, is reduced and the process restarted. Thus we tnally find a, t, with g(a,t,) < g(a,_,t,), g(a,t,) < g(a,+I t,) and t* is isolated in [a,_, t a,+, t,]. The number of evaluations of g will be reduced if t, at least near the solution x*, is a good estimate, for then one would expect to isolate t* in [0, a, t,] every time. In fact if near the soution x* one sets t'. = it, where t, is asymptotically correct, then we should isolate t* easily in [t:, 3t,] and a choice of t* = t' or 2t,', whichever gave the lower f-value, would lead to convergence, as we saw at the start of this section. In this light we see that Theorem 5.8.1 on the convergence of the conjugate-gradient method with c determined as C.
_
.P P
can be considered as providing a good estimate t, which is asymptotically the correct t*; this has been used [Daniel (1967a)] as t, and has given good results. If is any admissible sequence of directions and the functional f on RI satisfies 0 < aI < f x < Al, the analogous choice for t, is P.>
fPP>
GRADIENT METHODS IN R'
sEC. 6.6
157
It has been shown [Elkin (1968)] that one obtains global convergence with t = /,t, where 0 < e < f, < 2 - E, and of course that t, is asymptotically correct. Thus linearization can always be used to get a good estimate t, if but has an estimate one can afford to evaluate f'.. If one cannot compute x f* for the minimum value off, then
f
f(x.) - f
t`
+ a(t) 159
160
VARIABLE-METRIC GRADIENT METHODS IN RI
SEC. 7.2
Suppose Qo is some self-adjoint positive-definite operator (I x I matrix, in IR'. Then we can write
f(xu + tP) =f(xo) + t + o(t) =f(xo) + t-orthogonal to p, and-asymptotically, of course-to po, . . . , also. Thus we can consider that the power of the conjugate-gradient method compared to the steepest-descent method comes from the former's use of a
VARIABLE-METRIC GRADIENT METHODS IN R'
SEC. 7.L
161
good variable metric. In Yakolev (1965), gradient-type methods are considered
strictly in the setting of variable-metric methods-that is, x"+! = X. - t"H"VJ lx")
for some sequence of operators H. and steps t Most of the results there concern convergence under various choices of t" given certain properties of H. such as
0 < a These correspond, with some minor changes, to the methods of Chapter 4, although more detailed convergence rates are often given in Yakovlev (1965). Thus we consider the methods in this completely general setting no further.
In a sense the best metric would be one which turns the level curves J(x) = c into spheres so that the interior normal direction to the surface-that
is, -Vf(x)-points to the point minimizing f. For quadratic functionals
f(x)_Kh-x,M(h-x)>=[h-x,h-x) where
[u, v] _ and thus generates the direction
P. = -J.'-'AX.) This is the direction of Newton's method. Because of this intuitive viewpoint and because Newton's method leads to quadratic convergence [KantorovichAkilov (1964), Rail (1969)], one often tries to pick the variable-metric formulation to mimic Newton's method; thus variable-metric methods are also called quasi-Newton methods [Broyden (1965, 1967), Zeleznik (1968)]. Because of the situation in the constrained case (see Section 4.10), one might
not greatly expect quadratic convergence from mimicking the Newton process if one proceeds close to the Newton direction to the minimum off along that line rather than using the pure Newton step -1
However, the value of t. which minimizes f(x + tp") is asymptotically P") = I r., J. r.>
162
VARIABLE-METRIC GRADIENT METHODS IN IR,
SEC. 7.2
in this case, and thus near the solution x* the minimization along x + tp, nearly leads to the normal Newton step. While one should then hope for quadratic convergence, most results known to us guarantee only superlinear convergence [Levitin-Poljak (1966a), Yakovlev (1965)]. From what we have done, this can most easily be seen from the viewpoint of conjugate gradients. In Sections 5.3, 5.4, and 5.5 we considered a very general form of conjugategradient methods involving arbitrary self-adjoint positive-definite operators H and K, while in Section 5.6 such extra operators were missing. Clearly one may define a general method using operators H., K. at each point x and develop convergence theory and error estimates in terms of the operator T, _ just as in the quadratic-functional case; this is done in Daniel (1965, 1967a, b), and the convergence rates are given via the spectral bounds a, A of T, as usual. If one takes Hz = K, = J` 1, where J', is self-adjoint, uniformly positive-definite, and uniformly bounded, one gets T, = I and a = A = 1, which implies superlinear convergence. In this case, of course, p, = J; 'r, = -J,''J(x,) and we have the minimization modification of Newton's method and a proof of superlinear convergence. It is possible, however, to show that the convergence is actually quadratic. If we let
so that f(x +,) < f(x,), from 0 < aI < J,,< AI for scalars and pick a, A, one can conclude a 11X.., - x* 112 . COROLLARY 7.3.1 [Vercoustre (1969)]. Suppose that H.6, = x,+, - x, for 0 < i < n - 1, 0 < n < I - 1, where 6, = M(x,+, - x,);, suppose that
x for 0 < n < I. Then x,+i = h, the solution for quadratics
f(x)=$<x-h,M(x-h)>.
n-1
Proof. If for some n we have 6. _ E T,8 then JQo .-I
.-1
1-0
(-0
M-'
r=o
.-s
HB,=ET,H.S.-ETJ(x,+1
-X)=ET,M'a, 1-0
T,6, = M-'6. = x.+,
- x.
which is a contradiction. Thus [b( ..... 6.) is linearly independent for all n. Q.E.D. We now suppose that Hn+, is symmetric and that t is always chosen to
minimize f(x. + tp,J, so that
0 = _ J.>
- , for n = 0, 1, ... , r Under these hypotheses for the two-parameter methods of Equation 7.3.1 it can then be proved [Broyden (1967)] that = 0 if i :*j for 0 < i, j < r and, therefore, that we have a conjugate-direction method. The proof goes roughly as follows. From the definitions of and it easily follows that
P.+. = a.P. +
for some scalars a,,, 8.
168
SEC. 7.4
VARIABLE-METRIC GRADIENT METHODS IN (R,
Now,
= 0
r = 0,
n=0,1,...,r
The induction then proceeds easily to give
i = 0, 1,...,n,
=0,
n =0, 1,...,r
Thus we have found a two-parameter class of exact variable metric methods; the admissibility of the direction sequence for nonquadratic functionals still remains unknown, however. Since a study of the admissibility requires considerable specialization of the vectors z,,, we consider this question for special methods, although little information is available even in special cases. 7.4. SOME PARTICULAR METHODS
We consider first the class of variable-metric methods [Broyden (1967)] defined via
q. ° (a.P. - R.H.a.) Z.
+ P.P.)
an = (I + )
(7.4.1)
P.>
a.,P. ) ., H..> Y.=(1-Q.t.
where P. is arbitrary, t is chosen so PA> = 0, and Ho is symmetric. By a straightforward inductive argument [Broyden (1967)] or by using Corollary 7.3.1, one can show that, if M is symmetric and nonsingular, then the b, are linearly independent and hence the method is exact.
VARIABLE-METRIC GRADIENT METHODS IN IRt
SEC. 7.4
169
THEOREM 7.4.1. If Ho and M are positive-definite and fin > 0, then it follows that H. is positive-definite for each n. Proof: The proof goes by induction. If H. is positive-definite, let LL* _ v =/L*x, and w/= L\*5N, we then have H,,. If we let u = Kx, H., 1X> = -
+
V,''
2
&t. 12 [l ;, U>
The only possible negative terms come from
v, v
lv w>2 .w, w>
which in fact is nulinegative by the Schwarz inequality and is positive unless
v = )w. If the term 2 is also zero, then = 0 but = -Ku, u> # 0. Therefore, Kx, H.+ x> > 0 if x:;& 0 and hence definite. Q.E.D.
is positive-
Since only finitely many iterations need be used for quadratics, we have with
">E>0
An
in that case. If we use the algorithm for more general functionals, however,
we cannot immediately conclude such bounds (see, however, Theorems 7.4.2 and 7.4.5). Similarly, it is not known whether or not such bounds exist for quadratics in infinite-dimensional spaces, a result which would at least give some indications for the nonquadratic case in IR'. At this time numerical experience testing various choices of the parameters f, is rather limited; thus the method for arbitrary P. requires further study both theoretically and computationally. In practice, when one uses this method one seldom actually performs an exact minimization along the direction p to reach.-(.,, -that is, p,> = 0; it is striking to note, however, that if one seldom has = 0 for all n, then the directions are independent of the parameters {f,). Computationally, however, one finds great dependence on the choice of these parameters.
THEOREM 7.4.2. Under the assumptions of this section, if pp> = 0, then the direction n+1
11p.+' 11
determined by Equation 7.4.1 is independent of f,,.
170
SEC. 7.4
VARIABLE-METRIC GRADIENT METHODS IN R,
Proof: ,, Hf(j)> H,,Vf(xw+,) - Hwaw +
Nwi'w
where tfpf{<JP H"a"9 - I)
-Pw+I = HUVf(xw+I) - H"b"
-Pw
H"aw
H,,a,i
-Pw
`a". H.Vf(x")> H a w
H"aw>
w
aw, H"(- aa>S -+HA> = + Therefore,
a.-, g., H.- g. l P. = Hg. - S.,[H.H.-I g. + g.- H.-. g. Then for n < m, we have
_ + fig,., H.- I [g.- - g,J>
g.,H.- g. g.- H.-Ig.=0+- X
X
g.,
1, H.
g., H.- g. + g,,
I g.
172
VARIABLE-METRIC GRADIENT METHODS IN (R'
SEC. 7.4
is positive-definite, we must have
Therefore, since
= 0 for n < m But then, from the definitions of H. and B,,, we have for n < m that
H.g..=H.-1g.=... =Hog. and hence g,,, is an eigenvector with eigenvalue 1 for Ho 1 H. if m > n. There-
fore, we find that P.
[I + H0g.g" dP.-
8.. Hog.> + g.-1,p.-1>
For the conjugate-gradient method, po = Kg0 = po, and therefore x', = x1.
If p, = 2,p; and x;+1 = x,+, for i = 0, 1, ... , n, as is true for n = 0, then P.+1 =
Kg.. , Hog.+1>I + Hog... 1g:J2.P., 8.+1, Hog.+1/ + g.,P.
while
P;.+1 = Hog.+1 + 1>P;.
H0g.+, + p g., Hog. 8.+1, Hg.+1> + AX g.' Haa.> If + Hog.+3 X f
1
+ Vg.' H0g.>H0g +1- Hog.+,$., A.P., + P.+
(where a...
is defined by the equality) since = - 6.-1 =
Thus
2.+1P:+1 and hence
x,+:. Q.E.D.
Thus, because of Theorems 7.4.2 and 7.4.3 and the results of Chapter 5 and Section 6.4, we know that the methods of Equation 7.4.1 (and in particular, Davidon's method) yield global convergence when implemented
with exact minimization along x + tp and applied to uniformly convex
SEC. 7.4
VARIABLE-METRIC GRADIENT METHODS IN IR'
173
quadratic functionals-for another proof see [Horwitz-Sarachik (1968)]-and we have estimates on the local-convergence rate. Similarly, for x, near x*, for uniformly convex nonquadratic functionals, their relationship to conjugate-gradient methods gives a local-convergence result for these methods as well. Only very recently [Powell (1970)] has a global-convergence result for the Davidon method been found; we give this result and outline the proof. THEOREM 7.4.4 (Powell (1970)]. Let f be twice continuously differenti-
able, let f > aI, a > 0, and let the Davidon algorithm be used with exact x minimization along x + tp,,, starting with an arbitrary x, and positivedefinite symmetric H,. Then the sequence tion 2.
converges to the unique solu-
Proof. Since H. is positive-definite for each n [Fletcher-Powell (1963)],
we can define r. - H;'. Writing
7,.x.+, -x. one can check that
r.+, _(I-)r"\I -
Using the fact that = t.
0
and 1
1
I(x,.+, ), H.+, f(x.+,)> = < f(x,.+, ), H. f(x.+, )> I
+
Ax.), H. f(x,.)>
we conclude that
tr(r.) +
11 ofx.+,) I I;
-
Ilof(x.+ IIZ + 1i H. x.+,) I_I
I I Vf(x.)112
H (x,.)
174
SEC. 7.4
VARIABLE-METRIC GRADIENT METHODS IN IR,
Solving this recursion thus yields tr
tr (ro) + 1-0
I I Vf(x
I I VAX. + J II{2
)112
J (x0), Ho J (x0)
0 and a,j < 0 if i # j,
and that 0: RI -p R' is isotope if -x < y implies O(x) < ra(y); is diagonal if, for I < i < 1, the ith component of O(x) depends only on the ith component of 0; and is convex if
0 [,lx + (1 - ))Y) < AO(x) + (I - AWY) for all x, y in [R'. Such operators commonly arise in the numerical solution of boundary-value problems for mildly nonlinear differential equations [Bers (1953), Greenspan-Parter (1965), Schechter (1962)]. The following two results are typical of what is known for the full nonlinear methods. PROPOSITION 8.3.2 [Bers (1953), Ortega-Rheiboldt (1967a, 1968)]. Let A be an M-matrix, 0: IR' IR''a continuous, diagonal isotone mapping, and set
G(x) = Ax + O(x)
Then for any xu in
IR',
the nonlinear Jacobi and nonlinear SOR (for
189
sEc. 8.3
OPERATOR-EQUATION METHODS
0 < w S 1 (and hence, for w = 1, the nonlinear Gauss-Seidel)] methods all yield sequences xa converging to the unique solution R of G(x) = 0.
PROPOSITION 8.3.3 (Caspar (1968), Kellogg (1969), Ortega-Rheinboldt
(1968)]. Let G(x) = H(x) + V(x) with H and V continuous, let G(R) = 0, let
Z 0 for all x, y in 1, and for each bounded set B let there exist positive constants
L, and a, such that
IH(x)-H(Y)IISL,llx-YII and Z a, I I x - y 112
for all x, y in B. Suppose that
0 0.
PROPOSITION 8.3.6 [Caspar (1968)]. Let G(x) = H(x) + V(x) with H
and V continuously differentiable, let G(x) - G(y) < G',(x - y) if x < y or y < x, let rI + Hx and rI + V' be M-matrices for all r > 0 and all x, let G(R) = 0 and xo > 2, G(xo) > 0. Then, for the Newton-l-step-ADI method, if
d(x.) 0
d,
where e, is the vector with e,, = 6,,, the Kronecker delta. It is clear that if the d, for i = 1, 2, . . ., I are small enough, this method will behave essentially
as well as the method with derivatives; the problem here, as with all the
other methods wherein derivatives are replaced by differences, lies in how to choose the d,. An excellent analysis of this problem has been given for Davidon's method [Stewart (1967)]; since the viewpoint is of interest for use on any method, we present the ideas here once and for all. The whole basis of most gradient methods is to treatf(x) as a quadratic, locally; thus we shall consider the problem of approximating the derivative
y = f'(0) of the quadratic f(t) =f(0) + yt + 40&t2 by the difference
f(d)
f(0 d
= Ya
The two sources of error in approximating the scalar y by the scalar Yd are the truncation error in the divided difference and the cancellation produced in computing f(d) - f(0) for small d; clearly we should balance these errors. We can estimate the relative truncation error by
YdY
YI
1I
Ild Y
If we assume that, in computing f(t), we actually compute
f +(t) = f(t)(l + E),
l e I < it known
then the relative cancellation error can be estimated as 21
f(o) f(d) - f(0)
I
q
If we equate the two estimates and solve for the number d, we find that I dl should be the positive root of
3a2z3 + lal ly! z2 - 41f(0)I IYI n =0 and sign (d) = sign (ay).
196
ssc. 9.2
AVOIDING CALCULATION OF DERIVATIVES
EXERCISE. Show that the above choice of d is correct.
To avoid solving the above cubic, experimentally it has been satisfactory to ignore the cubic or quadratic term, depending on which gives a smaller root, to solve the resulting simpler equation, and to refine the result by one Newton step applied to the original cubic. This gives the following:
2{ILI}u2'
Idl=r[1-
z = 211 fao)I}"'
Idl = s[1 -
JaIr 1a T1+14Iy1]
if
y2ZIaf(o)I17
if
y2 < Iaf(O)I,
This computation requires a crude estimate of y, which can easily be obtained, and also an estimate a ,of f "(0). For the Davidon method we are considering,
f(xx + te,) =f(xJ + t + 3g,'12+ o(t2) where a,, is the ith diagonal element of f' s'. Recall that H. - f' - '; thus we seek the diagonal elements of H.'. These can in fact be computed recursively (see Proof of Theorem 7.4.4) without knowledge of the off-diagonal elements H.-' from the equations [Stewart (1967)]
Hi+'i = Hr ' + ( ltr
-
f) a,a* +
I
[Vf(x)o' + a,vf(xi)*]
where
P, = (Vf(x), P,>,
.8r =
and we replace Vf(x,) by its approximation. EXERCISE. Show that the above recursion for H;+', is correct.
Thus we have a rule for determining the size of the numbers d, in the difference approximation to Vf(xJ. Certainly we have ignored many problems, implying that our analysis is far from rigorous, but the ideas in practice appear to lead to good results. For further computational details and examples the reader is referred to Stewart (1967), where the method is shown to be quite powerful in practice. A similar approach has been applied for a Davidonlike method applied to minimize I I J(x) 112 for a nonlinear operator J [Fletcher (1968)]; here J'- I usually does not exist, so one is led to finding a Davidontype approximation to a pseudo-inverse J'+ using differences. The method as proposed in Fletcher (1968) is exact for quadratics. EXAMPLE [Stewart (1967)]. The Davidon modification without derivatives was used to minimize
AVOIDING CALCULATION OF DERIVATIVES
SEC. 9.3
197
f(x, y) = 100(y - x2)2 + (1 - x)2 starting with xo = -1.2, yo = 1.0. After 163 function evaluations, f was reduced to 9 x 10-12 with (x, y) = (1.000002, 1.006003)
9.3. MODIFYING NEWTON'S METHOD
Computationally, at least two problems are involved in Newton's method: the need for solving a linear system at each step, and the need for evaluating roughly 12,derivatives at each step. One can eliminate most of the derivative evaluation by using the derivatives at one fixed point throughout,
but this eliminates the powerful feature of quadratic convergence. An alternative, of course, is to use differences to evaluate the derivatives. If the step size used for the differences is small enough, one should maintain the rapid convergence, it would seem. The local-convergence properties can be
analyzed by means of Proposition 8.2.1; if we replace the derivative of J(x) by AJ(x, f) ^ y, where the components yr =
E
[J(x + fe) - J(x.)]
as in Section 9.2, we have the following local result [Dennis (1969)]. PROPOSITION 9.3.1. Let
Ao' ° [AJ(xo, fo)] exist with
IIA0 'J'-A0 1J;II_KIIx-yIl for x, y E S(xo, r). Let e > 0 be such that
K(e + Suppose that
fo) < 1 and
2
>h=
KII Ao'J(xo) II
(1 - Ke - Kfo)2
2
r ro =
1
-
h
x
IIAo'J(xo)II
1-Kf-+Kfo
and that f is a sequence of numbers such that If. I < f and x + f.e, E S(xo, r) f o r i = 1, ... , 1. Then the sequence
x.+, = X. - [AJ(x., fJ]-'J(xJ
198
sEc. 9.3
AVOIDING CALCULATION OF DERIVATIVES
is well defined and converges to 2, solving J(x) = 0. If, for a constant C, we have I E. I S C I I J(x.)11, then the convergence is quadratic. For the global properties the situation is somewhat Less simple; we have to be sure that the directions generated are descent directions. Thus let us suppose that J(x) = Vf(x) and that AJ(x, e) is as defined before; let us consider the follwoing direction-generating algorithm [Goldstein-Price (1967)]: If AJ(x1, e.) is singular, or if
0, setting x' = x, starting with i = 1 up to i = 1, if f(x' + a,e,) < f(x'), then x' is replaced by x' + b,e, and i by i + 1; if f(x' + b,e) Z f(x'), but f(x' - 26,e) < f(x'), then x' is replaced by x' 26,e, and i by i + 1; otherwise x' is not changed. Finally, when i reaches 1-+- 1, we set EM(x) = x'. The entire algorithm proceeds from x to as follows, starting with some initial x, and x, = EM(xo) # x0. We compute 4 = EM(x,); if x ' . = x then is cut in half f o r i = 1, 2, ... , l and the
-
iteration restarts at x,,. Otherwise, if f[EM(2x f(x,), we set xx+, = EM(2x, if the latter inequality is invalid, we set x = EM(x ). Iff is strictly convex with Vf continuous and lim f(x) = +oo, then it can be shown [Cea (1969)] that II x - 2 -.0, where .2 minimizes f over IR'. A computer implementing this method can be found in Kaupe (1963).
Since the coordinate directions, which are used in the above algorithm, need not be the best ones, the process has been modified as follows (Rosen-
204
AVOIDING CALCULATION OF DERIVATIVES
SEC. 9.6
brook (1960)]. Given a vector x and I orthonormal directions d,(x),I step sizes 8,(x), and two parameters a > 1, ft E (0, 1), the exploratory-move operator EM(x) is defined as follows. For i cycling through the values 1, 2, ... ., 1setting
x' = x, if f(x' + 6,d,) < f(x'), then we replace x' by x' + 6, d,, i by i + 1 (or 1 by 1), b; by a3 and record a success; otherwise b, is replaced by -,86, and a failure is recorded. After one success and one failure have been recorded for each value of i, we set EM(x) = x'. The iteration, starting with xa and
and from directions d,(xa), ... , d,(xo), now proceeds from x to EM(x ). Let 2., be the sum of the steps in the as follows. Set to 2.,d,(x ). New directions direction d,(x,) and define vectors d,(x,+,) are now obtained by orthonormalizing the vectors q,; this completes the description of the method. Roughly speaking, we can say that is the most successful motion found so far, d2(x,+,) is the most successful direction orthogonal to d,(x,+,), and so on. A further modification of this to move to the point minmethod [Swann (1964)) is, for each direction imizing f in that direction, and then to compute new directions as before. We do not consider these methods further since we believe the methods to be considered next to be of greater importance and usefulness. We have seen several times in earlier chapters that there is great advantage to using conjugate directions of some type; the above methods, however,
all deal with orthogonal directions-that is, 1-conjugate directions. It is possible, however, to generate directions that are conjugate (at least for quadratics) without dealing with derivatives. These methods, which appear to be the best of those that ignore derivatives, are based on the fact that if one minimizes a quadratic
f(x) = in the direction p from two points x, and x2, arriving at the points x', and x'2, then x', - x'2 is M-conjugate to p, since
E
let dk+ I., = dk,,
for r # s,
dk+1,, = dk,,+,
and let
vk+1 - tk.sak ak yy
If, however, tk.syk <E ak
let dk+,,, = dk,, for r = 1, 2, . . . ,1 and set vk+I = 6kConsider the method applied to minimize the quadratic
f(x)_ Suppose each time that the last direction dk,, is replaced by dk,,+,. Then the last step of the k-iteration and that of the (k + 1)-iteration are in the same direction and, therefore, because of our earlier remark, we shall next- introduce a conjugate direction. After 1 + 1 steps we would have I conjugate directions and, if they are linearly independent, we shall therefore get the correct solution on the next iteration. The method of choosing the direction to be
206
sec. 9.6
AVOIDING CALCULATION OF DERIVATIVES
eliminated is the technique that keeps the directions dk....... dk,, independent, as we shall see; it also determines which direction, if any, is eliminated at each step and thereby invalidates the above argument. Thus it does not appear possible to prove that this method is exact for quadratics, although we can prove convergence. First we show that the directions dk,, (for arbitrary functionals) are linearly independent.
[Zangwill (1967)]. The directions dk....... dk,, are
THEOREM 9.6.1
linearly independent: In fact, their determinant satisfies det [(dk....... dk.,)] ak > E.
Proof: The result is true for k = 1; assume it for k. Then, since xk.J - xk.0 = akdk.J+1, we have det [(dk,,, ..., dk..-1, dk.,+I,
I
'om'' det [(dk,,,
ak
dk.,+I, ..., dk.,)]
..., dk.J)] = tk.sak for all s ak
The choice of s-that is, the direction to try to replace-gives us the greatest chance of replacement, while the criterion for replacing or not yields {k+1 > E. Q.E.D.
Having the above fact, we can prove convergence. THEOREM 9.6.2. Let f be a continuous and strongly quasi-convex functional on IR', and let the above method be applied starting with an arbitrary x1,,, ; suppose the sequence {xk,,],
r =0, 1,...,1,
k = 1,2,...,
is bounded. Then any limit point x' of xk,,, as k - oo for any ro = 0, 1,
... ,1 is also a limit point of xk.,, r:# r0, as k
oo, and for each such limit point there exist I linearly independent directions dl, ... , d, such that f(x') S
f(x'+td,)foralltandr=1,2,...,1.
Proof Since I I dk., II = I also, given any subsequence K of integers k, we can find a further subsequence K, such that dk,, d, for r = 1, 2, . . . , I as k , oo with k E KI, and xk,, -+ x, for r = 0, 1,. .. , I as k oo with k E K, ; we show next that x,+ 1 = x, for r = 0, 1, ... , 1- 1. Recalling that
f(xk,,+,) = 0. Since the d, are linearly independent, we have
Vf(x') = 0. Q.E.D. COROLLARY 9.6.2. If, in addition to the hypotheses of Corollary 9.6.1,
we know that [x; Vf(x) = 0} contains no continuum, then the sequence Xk., converges.
Proof. We have shown that the difference of successive elements in the sequence xk.0, ... , xk,, tends to zero; the same argument shows that I I xk,1 -
I
II I I =I I xk.1 - xk+1,011- -> 0
Thus we may apply Theorem 6.3.1. Q.E.D. COROLLARY 9.6.3. If the continuously differentiable, strongly quasiconvex functional f is strictly pseudo-convex-that is, if <x - y, Vf(y)> Z 0
SEC. 9.6
AVOIDING CALCULATION OF DERIVATIVES
implies f(x) > f(y) for x # y-and if [xk,,} is bounded, then xk,, --+ z, the unique minimizer for f.
Proof: Limit points x' exist with Vf(x') = 0 by Corollary 9.6.1; then, by the strict pseudo-convexity, for any x, we have 0 = <x - x', Vf(x )> and hence f(x) > f(x'), so x' minimizes f. By the strong quasi-convexity, such a minimizer is unique. Q.E.D. COROLLARY 9.6.4. If f is uniformly quasi-convex, strictly pseudo-con-
vex, and continuously differentiable, then for any x,,0, the sequence (xk,,} converges to the unique k minimizing f. COROLLARY 9.6.5 [Zangwill (1967)]. If 0 < aI S f' in IR', then for any x1,0, the sequence [xk,,) converges to the unique z minimizing f. EXERCISE. Prove Corollaries 9.6.4 and 9.6.5.
We have not, however, been able to prove that the method is exact for quadratics; Zangwill (1967) has developed a modification of the method that is exact. The Zangwill method is as follows. Let e i = 1, . . . , I be the coordinate directions and let an initial point x0., and directions dl,,, . . . , d,,,, II d,,, II = I be given. Let to,, minimize
f(x0,, + td,,,) and let xo.I+1 = x0,1 + to,ldl,,
Set n = 1 and iteratively apply the basic k-iteration starting with k = 1; the basic k-iteration is as follows, given xk_ I.,+ dk....... dk,, and n: (1) compute a' to minimize f(xk_1,,+, + let n' = n, and replace n by n(modulo 1) + 1. If a' :?,- 0, let xk,o = xk_,,,+1 + a'e,; ; if, however, a' = 0, return to the start of step 1, noting that if step 1 is performed I times, we may consider xk_,,,+, to be the solution. (2) For r = 1,. .. ,1, compute tk,, to minimize f(xk,,_ 1 + tdk,,) and let
Define dk,,+ l
I I xk,, - xk- 1.1+
1
compute tk,,+, to minimize f(xk., + tdk,J+,), and setxk,l+1 = xk,, + tk.,+1 dk,,+l
Define directions dk+,,, ° dk,,+, f o r r = 1 , 2, ... ,1.
SEC. 9.6
AVOIDING CALCULATION OF DERIVATIVES
209
This method differs from the preceding primarily in its feature of minimizing over the coordinate directions as well as the directions dk,,; this feature allows us to revise the directions dk,, in the simple manner of the algorithm and thus obtain exact convergence for quadratics, as the following theorem shows. THEOREM 9.6.3
[Zangwill (1967)].
Let f(x) = ,
where M is self-adjoint and positive-definite, and let the initial point xo,, be given. Then the iteration stops during step 1, with xk_,,,+, = h, the solution, for some k < 1. Proof: Assume that at the start of the basic k-iteration at step k for
k < n - 1, the directions
dk.1-k+I, dk.1-k+2, ..., dk,,
are mutually M-conjugate and linearly independent; clearly this is true for k = 1, starting the induction. If we do not stop in step I this time, then xk-1,1+I : Xk.o and, since M is positive-definite, f(xk,o)