# Gradient-Based Optimization

No gradient information was needed in any of the methods discussed in Section 4.1. In some optimization problems, it is possible to compute the gradient of the objective function, and this information can be used to guide the optimizer for more efficient optimization. Similar to the non-gradient optimization methods, the gradient-based methods in general are iterative methods in which the same update Eq. (4.3) is used. The step size a is computed in a similar way to non-gradient methods. The search direction vector T, however, is computed based on gradient information when using gradient-based methods. This section presents some of the most popular gradient-based optimization methods.

## Steepest Descent Method

The search direction vector at an iterate *к* is set to be the negative of the objective function gradient at that iterate; that is:

## Conjugate Gradient Method

Consider a quadratic function /(3c) of *n* variables; / can be expressed in matrix form as:

In Section 1.1.3.2, it was shown that it is possible to construct *n* Q-conjugate (Q-orthogonal) directions *d' _{:}i =* 1,• •

*n.*Lets express

*x*as a linear combination of

Hence we can write the function *f(x* as:

The term / can be expanded as follow:

Note that it was possible to obtain the final result in Eq. (4.17) because *d' ^{,T} [Q}d' =* 0, V/

*Ф j*(the vectors

*d‘*and

*d*are Q-orthogonal.) Hence we can express the function /(3c) as:

^{J}

The result obtained in Eq. (4.18) is significant; it shows that /(3c) is separable in a; that is /(3c) is a summation of terms each of them is a function of its own variables. Hence when minimizing *f(x)* it is possible to minimize each term independently. In other words, we can write:

So, we converted the optimization of one function of *n* variables to the optimization of *n* functions and each of them is of one variable (a one-dimensional problem).

The Conjugate Directions Method for optimization uses Eq. (4.18), in a recursive manner, starting with optimizing )C| (a^{1}), then *K _{2}(a^{2}),* and so on. In each iteration, the optimization of the variable

*a*can be carried out numerically using methods such as the Golden Section Method described in Section 3.2. The solution in Eq. (4.16) can be constructed recursively and hence the update equation for the solution can be written in the following form:

It is also possible to assume that *x =* T° + ^"_{=|} *a'd*', and we can show that the above conclusion is still valid; that is we can solve the optimization problem by solving *n* 1-Dimensional optimization problems.

*Expanding Manifold Property*

Define the subspaceG* = |t = T° + j afterwe have performed *к* steps

of Conjugate Directions Method *(к* < *n).* Then we can show that:

At any step (iteration) *k,* the current iterate *x ^{k}* is the optimal solution for the function

*f(x)*in the subspace

*G*If we define the vector

^{k}.*a*€

*Ш*and [D] = [

^{к}. к < n.*d*then we can write:

^{1}{d^{2}, ■ ■ ■ d^{k}]

Hence,

where *a** is the optimal *a.* Let V_{a}/ be the gradient of / over *a.* Then the necessary condition for optimality is:

Expanding Eq. (4.24), we get:

The result in (4.25) shows that at an iteration *k,* the gradient of the function / at the current iterated* is orthogonal to all previous directions *d'yi =* I, - *,k.*

*Using the Gradient Information*

In the above development, we constructed the Q-orthogonal vectors с/'.V/ = 1, • • • *.n.* As discussed in Section 1.1.3.2, it is possible to construct the *d* vectors given a set of linearly independent vectors in K". These vectors are here selected to be the negatives of the gradient vectors at each iteration as detailed below. Consider the minimization of the quadratic function *f(x) = x ^{!} [Qx + b^{T}x + c, *then the gradient of / at

*x*is:

^{k}

In such case, at each new iterate computed using Eq. (4.20), the gradient vector g**^{1} is computed and used to compute the *d ^{k+l}* in a Gram-Schmidt Q- orthogonalization process (see Section 1.1.3.2) as follows:

Note that in evaluating the search step *a* in Eq. (4.20), no numerical approach is needed; an exact line search can be performed to find the optimal *a.* The above process for optimizing the function / is the Conjugate Gradient Method.

*Simplifications*

Equation (4.27) can be written as:

From the update Eq. (4.29), we can solve for *d ^{k}:*

Using Eq. (4.26), then [*Q*] (jc*^{+1} —*x*^{k}) = g^{A+l} — *g ^{k}.* Hence,

Substituting this result into Eq. (4.28), we get:

Note that the summation term in Eq. (4.32) is a scalar, call it *в ^{к}* (summation of scalar terms up to

*k).*The expanding manifold property discussed above showed that »

^{A+I}is orthogonal to all

*d^.d*and also

^{1},--- ,c7^{k},*g*is orthogonal to all g ,g , ■ • • ,g (to see this, consider a 3-dimensional space K

^{k}~'^{3}, the g ) is orthogonal to

*eft*and

*d*

^{x}. The vector g° is —r/

^{0}, hence g

^{2}is also orthogonal to g°. From Eq. (4.32), we can see that the vector g

^{1}is in the same plane as the vectors and

*d*; since g" is orthogonal to both

*d*and

*d*then g" is orthogonal g . Hence, g

^{2}is orthogonal to both g

^{1}and g°. As a result of this orthogonality, all terms in the summation in Eq. (4.32) vanished except the last term when

*i = k.*Hence we can rewrite Eq. (4.32) as:

Note further that in the denominator of Eq. (4.33), the vector g^{A+1} is orthogonal to *d ^{k,T},* and hence their scalar product vanished. The remaining term in the denominator is

*—d*= — (—g* +

^{kT}g^{k}*в*=

^{к}~^{х}d^{k}~^{x})^{T}g^{k}*g*

^{k,T}g^{k }(g^{k}isorthogonaltod^{k}~

^{1}).

Hence we can update the search direction using the equation:

Using Eq. (4.34) to compute the search directions is known as the Polak-Ribiere Method.

The numerator of Eq. (4.34) can be simplified further by realizing that g^{A+l }is orthogonal to g*, and hence their scalar product vanishes. Hence we can write:

Using Eq. (4.35) to compute the search directions is known as the Fletcher- Reeves Method.

Both the Fletcher-Reeves and the Polak-Ribiere methods are Conjugate Gradient methods; both are equivalent when optimizing a quadratic function as can be seen from the above details. In the more general case of non-quadratic objective function, the two methods behave differently. **Algorithm **4.**1 **shows the Conjugate Gradient Algorithm.

**Algorithm 4.1 **Conjugate Gradient Algorithm

Select an initial guess for the solution Jc°, and set к = 0 Compte g° = V/(3r°)

Compute *(P =* — g°

**while **a termination condition is not satisfied **do **Compute *x ^{k+x} (a^{k}) =x^{k} + a^{k}d^{k }*Find the optimal

*a*that optimizes

^{k}*f(a*

^{k}) = f(x^{k+l})Compute the new point x^{k+l} and the new gradient g*^{+1} = Vf(x^{k+I}) Use the Fletcher-Reeves or the Polak-Ribiere methods to compute *в ^{к }*Compute the new search direction

*d*

^{k+l}**=**— g*

^{+1}

**+**

*6*

^{k}d^{k}. k = k+**1 end while**

## Variable Metric Methods

At any iteration *k,* the Conjugate Gradient methods use gradient information from one previous iteration. It makes sense that algorithms that use all the history of gradients (they are already computed) would be more efficient. The category of methods that do that is called Variable Metric Methods. The *Metric* matrix is the matrix that collects the gradients computed over all the previous iterations. The Variable Metric Methods stand on a solid theoretical foundation and they have practical convergence properties.

**Section **4.3 presents second order methods that need the Hessian matrix for optimization; they have quadratic convergence characteristics. The Variable Metric Methods behave like a second order method. Yet the Hessian matrix need not to be computed. In fact, the Hessian matrix can be a by product from a Variable Metric Method as will be seen below. The Variable Metric Methods are also called quasi-Newton Methods. As before, they update the solution at each iteration using the following equation:

where the search direction *s* is computed as a function of the gradients.

*Davidon-Fletcher-Powell (DFP) Method*

The search direction is updated at an iteration *к* using the equation:

where the matrix [A] is initialized by the user and is updated each iteration using the process outlined in Algorithm 4.2.

Algorithm 4.2 Davidon-Fletcher-Powell (DFP) Algorithm

Select an initial guess for the solution x°, and the matrix [A^{0}]. Set *к =* 0 Compte g° = V/(jc°)

Compute *s° =* — [A°]g°

while a termination condition is not satisfied do Compute T^{A+I} (*a*^{k}) = *x ^{k}* +

*a*Find the optimal

^{k}s^{k }*a*that optimizes

^{k}*f{a*

^{k}) = f(x^{k+l})Compute the new solution T*^{+l}, and the new gradient g^{A+1} = V/(it^{A+1}) Compute the Av = T*^{+l} *—x ^{k}*

Compute the vector *Ag = g ^{k+l} — g^{k}*

*- - т*

*„* , . ггч! AvAv

Compute the matrix *[D = *- _{r} ^

Av Ag

Compute the matrix [C] = _ H^jAgAg [A*]^{r}

Ag [A^{A}']Ag

Update the cuixent [A] matrix: [A^{A+1}] = [A*] + [C] + [D] *k = k+ *end while

The [A] matrix converges to the inverse of the Hessian matrix in the DFP method.

*Broydon-Fletcher-Goldfarb-Shanno (BFGS) Method*

The search direction is updated at an iteration *к* using the equation:

where the matrix [A] is initialized by the user and is updated at each iteration using the process outlined in **Algorithm **4.3. The [A] matrix converges to the Hessian matrix in the BFGS method.

**Algorithm 4.3 **Broydon-Fletcher-goldfarb-Shanno (BFGS) Algorithm

Select an initial guess for the solution 3c°, and the matrix [A^{0}]. Set *к* = 0

Compte g° = V/(3r°)

Compute *s° =* — [A°]^{_1}g°

**while **a termination condition is not satisfied **do**

Compute T*^{+l} (*а*^{к}) *=x ^{k} + a^{k}s^{k}*

Find the optimal *a ^{k}* that optimizes

*f{a*

^{k}) =

*f(x*

^{k+l})Compute the new solution 3t*^{+l}, and the new gradient g*^{+1} = V/(;c*^{+l})

Compute the *Ax = x ^{k+l} —x^{k}*

Compute the vector Ag = g*^{+l} — g*

*- ~ т *ДаДа

Compute the matrix *[D]* = ^

Ag *Ax*

*o ^{k}p^{kT}*

Compute the matrix [C] = —zttzt

Update the cuixent [A] matrix: [A*^{+1}] = [A*] + [C] + [D] *k = k+ ***end while**

# Second Order Methods

The update equation in a second order method is the same as the update equation in the Variable Metric Methods, Eq. (4.36). The update matrix [A] in the DFP and BFGS algorithms (both are first order methods) is computed as function of the gradient information. In the second order method, however, the update matrix is the Hessian matrix. That is the search direction is updated as follows:

Clearly, this method requires computing the Hessian matrix; hence it may not be practical in many applications due to the computational cost associated with computing the Hessian matrix. The second order method is outlined in **Algorithm **4.4.

**Example 4.2. **Find the minimum of the function *f(x* ,*2), where

Algorithm 4.4 Algorithm for The Second Order Method

Select an initial guess for the solution *x°,* Set & = 0 while a termination condition is not satisfied do Compte g^{A} = V/(T^{A})

Compte the Hessian matrix [#(**)]

Compute *s ^{k} =* -[#]

^{_l}g°

Compute T^{A+I} (*a*^{k}) *= x ^{k} + a^{k}s^{k}*

Find the optimal *a ^{k}* that optimizes

*f{a*

^{k}) =

*f(x*

^{k+l})Compute the new solution T^{A+I}, and the new gradient g^{A}~' = V/(Jt^{A+1}) *k = k+* 1 end while

Solution: The gradient of the function /(*i,;t_{2}) is:

The Hessian of the function *f{x*,x_{2}) is:

Assume an initial guess of *x* = —2, and x_{2} = -3. The fist update is:

The solution is obtained after one iteration for this second order function. Try using any other initial guess.