### Edit CGD derivation

parent 3d0842f9
 ... ... @@ -40,24 +40,28 @@ $$\nabla f(x_i) \cdot u_j \neq 0$$ So each successive line minimization undoes the work of the previous ones, in the sense that it's necessary to cycle back and minimize along the previous direction again. This can lead to a very inefficient search. This says that the directional derivative at $$x_i$$ along $$u_j$$ is nonzero. So each successive line minimization undoes the work of the previous ones, in the sense that it's necessary to cycle back and minimize along the previous direction again. This can lead to a very inefficient search. This begs the question: can we find directions $$u_i$$ for which successive line minimizations don't undo the previous ones? In particular, for $$1 \leq j < i \leq n$$ we want disturb the previous ones? In particular, for $$1 \leq j < i \leq n$$ we want $$\nabla f(x_i) \cdot u_j = 0$$ If we can do this, after performing all $$n$$ line minimizations, the resulting point will still be minimal along all $$n$$ directions. So as long as the $$u_i$$ span $$\mathbb{R}^n$$, the function will be minimized along all possible directions; i.e. $$x_n$$ is a local minimum. If we can do this, after performing $$k$$ line minimizations, the resulting point $$x_k$$ will still be minimal along all $$k$$ directions considered so far. This implies that $$x_k$$ is the minimum within the subspace spanned by $$\{u_1, \ldots, u_k\}$$. So each successive line search expands the space within which we've minimized by one dimension. And after $$n$$ minimizations, we've covered the whole space — the direction derivative at $$x_n$$ must be zero in all directions (i.e. $$\nabla f(x_n) = 0$$), so $$x_n$$ is a local minimum. It turns out it's possible to do this exactly for quadratic minima, and it can be approximated for other functions (after all, every function looks quadratic close to a local extremum). In the latter case, repeating the whole process multiple times yields better and better solutions. other functions (after all, every function without vanishing gradient and Hessian looks quadratic close to a local extremum). In the latter case, repeating the whole process multiple times yields better and better solutions. ### Derivation ... ... @@ -71,6 +75,10 @@ $$f(x) = f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac{1}{2} (x - x_0)^T H (x - x_0)$$ $$H$$ is the Hessian matrix at $$x_0$$. I assume we don't have any way to compute it directly, but it's important to consider its presence as we derive the algorithm. Notationally, I don't attach an $$x_0$$ to it since we will never consider the Hessian at any other location. By differentiating, we find that $$... ... @@ -83,12 +91,19 @@$$ \nabla f(x_0 + x) = \nabla f(x_0) + H x $$Now if we could compute the Hessian matrix$$H$$, we could set the gradient to zero and compute the minimum$$x^* $$exactly by solving$$H (x^* - x_0) = -\nabla f(x_0)$$. We will assume we can only compute function values and gradients, but this still brings up an important point: if$$H$$doesn't have full rank, there isn't a unique solution. We will assume that$$H$$is positive definite. Now if we could compute$$H$$, we could set the gradient to zero and find the minimum$$x^* $$directly by solving$$x^* = x_0 -H^{-1} \nabla f(x_0)$$. By our assumption, this is forbidden to us, but it still brings up an important point: if$$H$$isn't invertible, there isn't a unique solution. The easiest way out is to assume that$$H$$is positive definite. However to handle the case of cubic or higher order minima, we need to relax this. Everything in the following derivation works even when the Hessian disappears, as long as the line searches terminate — i.e. there's not a direction where it can go downhill forever, and you aren't unlucky enough to shoot one exactly along the floor of a flat valley. Better yet, if your line search is smart enough to quit for flat functions, then you just need to ensure you can't go downhill forever — i.e.$$H$$is positive semidefinite. For any$$1 \leq i \leq n$$, Moving along, define$$x_i = x_{i - 1} + \alpha_i u_i$$as before. For any$$1 \leq i \leq n$$,$$ \begin{aligned} ... ... @@ -150,7 +165,7 @@ $$\nabla f(x_1) \cdot \nabla f(x_0) = \nabla f(x_1) \cdot u_1 = 0$$ So trivially, the following properties hold: So trivially, $$\{ u_1 \}$$ spans a subspace of dimension one, and the following properties hold: $$\begin{cases} ... ... @@ -162,8 +177,9 @@$$ #### Induction Now assume that we've constructed $$u_1$$, $$\ldots$$, $$u_k$$ and $$x_1$$, $$\ldots$$, $$x_k$$, and that the following properties hold. Now assume that we've constructed $$u_1$$, $$\ldots$$, $$u_k$$ and $$x_1$$, $$\ldots$$, $$x_k$$, that $$\{ u_1, \ldots, u_k \}$$ span a subspace of dimension $$k$$, and that the following properties hold: $$\begin{cases} ... ... @@ -180,9 +196,9 @@ u_{k + 1} = \nabla f(x_k) + \gamma_k u_k$$ for some undetermined scalar $$\gamma_k$$. As before, if the gradient at $$x_k$$ is zero, then that is the minimum. Additionally, since $$\nabla f(x_k) \cdot u_k = 0$$, for no value of $$\gamma_k$$ is $$u_{k + 1}$$ zero. Finally, since $$\nabla f(x_k) \cdot u_j = 0$$ for all $$1 \leq j \leq k$$, $$u_{k + 1}$$ is not a linear combination of the prior $$u_j$$. is a minimum. Additionally, since $$\nabla f(x_k) \cdot u_k = 0$$, no value of $$\gamma_k$$ can make $$u_{k + 1}$$ zero. And since $$\nabla f(x_k) \cdot u_j = 0$$ for all $$1 \leq j \leq k$$, no value of $$\gamma$$ can make $$u_{k + 1}$$ a linear combination of the prior $$u_j$$. Our primary concern is that $$u_{k + 1}^T H u_k = 0$$. So we expand ... ... @@ -196,8 +212,9 @@ u_{k + 1}^T H u_k \\ \end{aligned} $$If$$H$$is positive definite,$$u_k^T H u_k \neq 0$$. And even if$$H$$is not positive definite, we will soon rewrite the denominator, and it will be clear that it is nonzero. So we can solve If$$H$$is positive definite, by definition$$u_k^T H u_k \neq 0$$. Even without this assumption, though, we will soon rewrite the denominator and show that it is nonzero. So I will go ahead and solve$$ \gamma_k = - \frac{\nabla f(x_k)^T H u_k}{u_k^T H u_k} ... ... @@ -258,14 +275,18 @@ $$for$$1 \leq j \leq k$$. So finally we must show the orthogonality of the gradients. For any$$0 \leq j \leq k$$, So finally we must show the orthogonality of the gradients. For any$$1 \leq j \leq k$$,$$ \begin{aligned} \nabla f(x_{k + 1}) \cdot \nabla f(x_j) &= \nabla f(x_{k + 1}) \cdot (u_{j + 1} - \gamma_j u_j) \\ &= 0 \end{aligned} = \nabla f(x_{k + 1}) \cdot (u_{j + 1} - \gamma_j u_j) = 0 $$And$$ \nabla f(x_{k + 1}) \cdot \nabla f(x_0) = \nabla f(x_{k + 1}) \cdot u_1 = 0  Thus we have proven ... ...
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment