Commit 9ea9cf49 authored by Erik Strand's avatar Erik Strand

Edit CGD derivation

parent 3d0842f9
......@@ -40,24 +40,28 @@ $$
\nabla f(x_i) \cdot u_j \neq 0
$$
So each successive line minimization undoes the work of the previous ones, in the sense that it's
necessary to cycle back and minimize along the previous direction again. This can lead to a very
inefficient search.
This says that the directional derivative at $$x_i$$ along $$u_j$$ is nonzero. So each successive
line minimization undoes the work of the previous ones, in the sense that it's necessary to cycle
back and minimize along the previous direction again. This can lead to a very inefficient search.
This begs the question: can we find directions $$u_i$$ for which successive line minimizations don't
undo the previous ones? In particular, for $$1 \leq j < i \leq n$$ we want
disturb the previous ones? In particular, for $$1 \leq j < i \leq n$$ we want
$$
\nabla f(x_i) \cdot u_j = 0
$$
If we can do this, after performing all $$n$$ line minimizations, the resulting point will still be
minimal along all $$n$$ directions. So as long as the $$u_i$$ span $$\mathbb{R}^n$$, the function
will be minimized along all possible directions; i.e. $$x_n$$ is a local minimum.
If we can do this, after performing $$k$$ line minimizations, the resulting point $$x_k$$ will still
be minimal along all $$k$$ directions considered so far. This implies that $$x_k$$ is the minimum
within the subspace spanned by $$\{u_1, \ldots, u_k\}$$. So each successive line search expands the
space within which we've minimized by one dimension. And after $$n$$ minimizations, we've covered
the whole space &mdash; the direction derivative at $$x_n$$ must be zero in all directions (i.e.
$$\nabla f(x_n) = 0$$), so $$x_n$$ is a local minimum.
It turns out it's possible to do this exactly for quadratic minima, and it can be approximated for
other functions (after all, every function looks quadratic close to a local extremum). In the latter
case, repeating the whole process multiple times yields better and better solutions.
other functions (after all, every function without vanishing gradient and Hessian looks quadratic
close to a local extremum). In the latter case, repeating the whole process multiple times yields
better and better solutions.
### Derivation
......@@ -71,6 +75,10 @@ $$
f(x) = f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac{1}{2} (x - x_0)^T H (x - x_0)
$$
$$H$$ is the Hessian matrix at $$x_0$$. I assume we don't have any way to compute it directly, but
it's important to consider its presence as we derive the algorithm. Notationally, I don't attach an
$$x_0$$ to it since we will never consider the Hessian at any other location.
By differentiating, we find that
$$
......@@ -83,12 +91,19 @@ $$
\nabla f(x_0 + x) = \nabla f(x_0) + H x
$$
Now if we could compute the Hessian matrix $$H$$, we could set the gradient to zero and compute the
minimum $$x^* $$ exactly by solving $$H (x^* - x_0) = -\nabla f(x_0)$$. We will assume we can only
compute function values and gradients, but this still brings up an important point: if $$H$$ doesn't
have full rank, there isn't a unique solution. We will assume that $$H$$ is positive definite.
Now if we could compute $$H$$, we could set the gradient to zero and find the minimum $$x^* $$
directly by solving $$x^* = x_0 -H^{-1} \nabla f(x_0)$$. By our assumption, this is forbidden to us,
but it still brings up an important point: if $$H$$ isn't invertible, there isn't a unique solution.
The easiest way out is to assume that $$H$$ is positive definite. However to handle the case of
cubic or higher order minima, we need to relax this. Everything in the following derivation works
even when the Hessian disappears, as long as the line searches terminate &mdash; i.e. there's not a
direction where it can go downhill forever, and you aren't unlucky enough to shoot one exactly along
the floor of a flat valley. Better yet, if your line search is smart enough to quit for flat
functions, then you just need to ensure you can't go downhill forever &mdash; i.e. $$H$$ is positive
semidefinite.
For any $$1 \leq i \leq n$$,
Moving along, define $$x_i = x_{i - 1} + \alpha_i u_i$$ as before. For any $$1 \leq i \leq n$$,
$$
\begin{aligned}
......@@ -150,7 +165,7 @@ $$
\nabla f(x_1) \cdot \nabla f(x_0) = \nabla f(x_1) \cdot u_1 = 0
$$
So trivially, the following properties hold:
So trivially, $$\{ u_1 \}$$ spans a subspace of dimension one, and the following properties hold:
$$
\begin{cases}
......@@ -162,8 +177,9 @@ $$
#### Induction
Now assume that we've constructed $$u_1$$, $$\ldots$$, $$u_k$$ and $$x_1$$, $$\ldots$$, $$x_k$$, and
that the following properties hold.
Now assume that we've constructed $$u_1$$, $$\ldots$$, $$u_k$$ and $$x_1$$, $$\ldots$$, $$x_k$$,
that $$\{ u_1, \ldots, u_k \}$$ span a subspace of dimension $$k$$, and that the following
properties hold:
$$
\begin{cases}
......@@ -180,9 +196,9 @@ u_{k + 1} = \nabla f(x_k) + \gamma_k u_k
$$
for some undetermined scalar $$\gamma_k$$. As before, if the gradient at $$x_k$$ is zero, then that
is the minimum. Additionally, since $$\nabla f(x_k) \cdot u_k = 0$$, for no value of $$\gamma_k$$ is
$$u_{k + 1}$$ zero. Finally, since $$\nabla f(x_k) \cdot u_j = 0$$ for all $$1 \leq j \leq k$$,
$$u_{k + 1}$$ is not a linear combination of the prior $$u_j$$.
is a minimum. Additionally, since $$\nabla f(x_k) \cdot u_k = 0$$, no value of $$\gamma_k$$ can make
$$u_{k + 1}$$ zero. And since $$\nabla f(x_k) \cdot u_j = 0$$ for all $$1 \leq j \leq k$$, no value
of $$\gamma$$ can make $$u_{k + 1}$$ a linear combination of the prior $$u_j$$.
Our primary concern is that $$u_{k + 1}^T H u_k = 0$$. So
we expand
......@@ -196,8 +212,9 @@ u_{k + 1}^T H u_k \\
\end{aligned}
$$
If $$H$$ is positive definite, $$u_k^T H u_k \neq 0$$. And even if $$H$$ is not positive definite, we
will soon rewrite the denominator, and it will be clear that it is nonzero. So we can solve
If $$H$$ is positive definite, by definition $$u_k^T H u_k \neq 0$$. Even without this assumption,
though, we will soon rewrite the denominator and show that it is nonzero. So I will go ahead and
solve
$$
\gamma_k = - \frac{\nabla f(x_k)^T H u_k}{u_k^T H u_k}
......@@ -258,14 +275,18 @@ $$
for $$1 \leq j \leq k$$.
So finally we must show the orthogonality of the gradients. For any $$0 \leq j \leq k$$,
So finally we must show the orthogonality of the gradients. For any $$1 \leq j \leq k$$,
$$
\begin{aligned}
\nabla f(x_{k + 1}) \cdot \nabla f(x_j)
&= \nabla f(x_{k + 1}) \cdot (u_{j + 1} - \gamma_j u_j) \\
&= 0
\end{aligned}
= \nabla f(x_{k + 1}) \cdot (u_{j + 1} - \gamma_j u_j)
= 0
$$
And
$$
\nabla f(x_{k + 1}) \cdot \nabla f(x_0) = \nabla f(x_{k + 1}) \cdot u_1 = 0
$$
Thus we have proven
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment