Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
What's new
7
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Open sidebar
Erik Strand
nmm_2020_site
Commits
9ea9cf49
Commit
9ea9cf49
authored
Apr 14, 2020
by
Erik Strand
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Edit CGD derivation
parent
3d0842f9
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
48 additions
and
27 deletions
+48
-27
_notes/optimization.md
_notes/optimization.md
+48
-27
No files found.
_notes/optimization.md
View file @
9ea9cf49
...
...
@@ -40,24 +40,28 @@ $$
\n
abla f(x_i)
\c
dot u_j
\n
eq 0
$$
So each successive line minimization undoes the work of the previou
s on
es, in the sense that it's
necessary to cycle back and minimize along the previous direction again. This can lead to a very
inefficient search.
This says that the directional derivative at $$x_i$$ along $$u_j$$ i
s
n
on
zero. So each successive
line minimization undoes the work of the previous ones, in the sense that it's necessary to cycle
back and minimize along the previous direction again. This can lead to a very
inefficient search.
This begs the question: can we find directions $$u_i$$ for which successive line minimizations don't
undo
the previous ones? In particular, for $$1
\l
eq j < i
\l
eq n$$ we want
disturb
the previous ones? In particular, for $$1
\l
eq j < i
\l
eq n$$ we want
$$
\n
abla f(x_i)
\c
dot u_j = 0
$$
If we can do this, after performing all $$n$$ line minimizations, the resulting point will still be
minimal along all $$n$$ directions. So as long as the $$u_i$$ span $$
\m
athbb{R}^n$$, the function
will be minimized along all possible directions; i.e. $$x_n$$ is a local minimum.
If we can do this, after performing $$k$$ line minimizations, the resulting point $$x_k$$ will still
be minimal along all $$k$$ directions considered so far. This implies that $$x_k$$ is the minimum
within the subspace spanned by $$
\{
u_1,
\l
dots, u_k
\}
$$. So each successive line search expands the
space within which we've minimized by one dimension. And after $$n$$ minimizations, we've covered
the whole space
—
the direction derivative at $$x_n$$ must be zero in all directions (i.e.
$$
\n
abla f(x_n) = 0$$), so $$x_n$$ is a local minimum.
It turns out it's possible to do this exactly for quadratic minima, and it can be approximated for
other functions (after all, every function looks quadratic close to a local extremum). In the latter
case, repeating the whole process multiple times yields better and better solutions.
other functions (after all, every function without vanishing gradient and Hessian looks quadratic
close to a local extremum). In the latter case, repeating the whole process multiple times yields
better and better solutions.
### Derivation
...
...
@@ -71,6 +75,10 @@ $$
f(x) = f(x_0) +
\n
abla f(x_0)^T (x - x_0) +
\f
rac{1}{2} (x - x_0)^T H (x - x_0)
$$
$$H$$ is the Hessian matrix at $$x_0$$. I assume we don't have any way to compute it directly, but
it's important to consider its presence as we derive the algorithm. Notationally, I don't attach an
$$x_0$$ to it since we will never consider the Hessian at any other location.
By differentiating, we find that
$$
...
...
@@ -83,12 +91,19 @@ $$
\n
abla f(x_0 + x) =
\n
abla f(x_0) + H x
$$
Now if we could compute the Hessian matrix $$H$$, we could set the gradient to zero and compute the
minimum $$x^
* $$ exactly by solving $$H (x^*
- x_0) = -
\n
abla f(x_0)$$. We will assume we can only
compute function values and gradients, but this still brings up an important point: if $$H$$ doesn't
have full rank, there isn't a unique solution. We will assume that $$H$$ is positive definite.
Now if we could compute $$H$$, we could set the gradient to zero and find the minimum $$x^
*
$$
directly by solving $$x^
*
= x_0 -H^{-1}
\n
abla f(x_0)$$. By our assumption, this is forbidden to us,
but it still brings up an important point: if $$H$$ isn't invertible, there isn't a unique solution.
The easiest way out is to assume that $$H$$ is positive definite. However to handle the case of
cubic or higher order minima, we need to relax this. Everything in the following derivation works
even when the Hessian disappears, as long as the line searches terminate
—
i.e. there's not a
direction where it can go downhill forever, and you aren't unlucky enough to shoot one exactly along
the floor of a flat valley. Better yet, if your line search is smart enough to quit for flat
functions, then you just need to ensure you can't go downhill forever
—
i.e. $$H$$ is positive
semidefinite.
For any $$1
\l
eq i
\l
eq n$$,
Moving along, define $$x_i = x_{i - 1} +
\a
lpha_i u_i$$ as before.
For any $$1
\l
eq i
\l
eq n$$,
$$
\b
egin{aligned}
...
...
@@ -150,7 +165,7 @@ $$
\n
abla f(x_1)
\c
dot
\n
abla f(x_0) =
\n
abla f(x_1)
\c
dot u_1 = 0
$$
So trivially, the following properties hold:
So trivially,
$$
\{
u_1
\}
$$ spans a subspace of dimension one, and
the following properties hold:
$$
\b
egin{cases}
...
...
@@ -162,8 +177,9 @@ $$
#### Induction
Now assume that we've constructed $$u_1$$, $$
\l
dots$$, $$u_k$$ and $$x_1$$, $$
\l
dots$$, $$x_k$$, and
that the following properties hold.
Now assume that we've constructed $$u_1$$, $$
\l
dots$$, $$u_k$$ and $$x_1$$, $$
\l
dots$$, $$x_k$$,
that $$
\{
u_1,
\l
dots, u_k
\}
$$ span a subspace of dimension $$k$$, and that the following
properties hold:
$$
\b
egin{cases}
...
...
@@ -180,9 +196,9 @@ u_{k + 1} = \nabla f(x_k) + \gamma_k u_k
$$
for some undetermined scalar $$
\g
amma_k$$. As before, if the gradient at $$x_k$$ is zero, then that
is
the
minimum. Additionally, since $$
\n
abla f(x_k)
\c
dot u_k = 0$$,
for
no value of $$
\g
amma_k$$
is
$$u_{k + 1}$$ zero.
Finally,
since $$
\n
abla f(x_k)
\c
dot u_j = 0$$ for all $$1
\l
eq j
\l
eq k$$,
$$u_{k + 1}$$
is not
a linear combination of the prior $$u_j$$.
is
a
minimum. Additionally, since $$
\n
abla f(x_k)
\c
dot u_k = 0$$, no value of $$
\g
amma_k$$
can make
$$u_{k + 1}$$ zero.
And
since $$
\n
abla f(x_k)
\c
dot u_j = 0$$ for all $$1
\l
eq j
\l
eq k$$,
no value
of $$
\g
amma$$ can make
$$u_{k + 1}$$ a linear combination of the prior $$u_j$$.
Our primary concern is that $$u_{k + 1}^T H u_k = 0$$. So
we expand
...
...
@@ -196,8 +212,9 @@ u_{k + 1}^T H u_k \\
\e
nd{aligned}
$$
If $$H$$ is positive definite, $$u_k^T H u_k
\n
eq 0$$. And even if $$H$$ is not positive definite, we
will soon rewrite the denominator, and it will be clear that it is nonzero. So we can solve
If $$H$$ is positive definite, by definition $$u_k^T H u_k
\n
eq 0$$. Even without this assumption,
though, we will soon rewrite the denominator and show that it is nonzero. So I will go ahead and
solve
$$
\g
amma_k = -
\f
rac{
\n
abla f(x_k)^T H u_k}{u_k^T H u_k}
...
...
@@ -258,14 +275,18 @@ $$
for $$1
\l
eq j
\l
eq k$$.
So finally we must show the orthogonality of the gradients. For any $$
0
\l
eq j
\l
eq k$$,
So finally we must show the orthogonality of the gradients. For any $$
1
\l
eq j
\l
eq k$$,
$$
\b
egin{aligned}
\n
abla f(x_{k + 1})
\c
dot
\n
abla f(x_j)
&=
\n
abla f(x_{k + 1})
\c
dot (u_{j + 1} -
\g
amma_j u_j)
\\
&= 0
\e
nd{aligned}
=
\n
abla f(x_{k + 1})
\c
dot (u_{j + 1} -
\g
amma_j u_j)
= 0
$$
And
$$
\n
abla f(x_{k + 1})
\c
dot
\n
abla f(x_0) =
\n
abla f(x_{k + 1})
\c
dot u_1 = 0
$$
Thus we have proven
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment