Commit 278f11a9 authored by Erik Strand's avatar Erik Strand

Add eighth pset

parents 15a787be 8a1ae21a
import sympy as sp
# Verifying that the Nelder-Mead reflection scheme in Numerical Methods is equivalent to the
# standard presentation.
x = sp.symbols("x", real=True)
x_1, x_2, x_3 = sp.symbols("x_1 x_2 x_3", real=True)
y_1, y_2, y_3 = sp.symbols("y_1 y_2 y_3", real=True)
term_1 = y_1 * (x - x_2) * (x - x_3) / (x_1 - x_2) / (x_1 - x_3)
term_2 = y_2 * (x - x_1) * (x - x_3) / (x_2 - x_1) / (x_2 - x_3)
term_3 = y_3 * (x - x_1) * (x - x_2) / (x_3 - x_1) / (x_3 - x_2)
poly = term_1 + term_2 + term_3
dpoly = sp.diff(poly, x)
sol = sp.solve(dpoly, x)
assert(len(sol) == 1)
sol = sol[0]
print("lagrange polynomial")
print(sp.latex(poly))
print("derivative")
print(sp.latex(dpoly))
print("minimum")
print(sp.latex(sol))
q = (x_2 - x_3) * (y_2 - y_1)
r = (x_2 - x_1) * (y_2 - y_3)
alt = x_2 - ((x_2 - x_3) * q - (x_2 - x_1) * r) / 2 / (q - r)
print(sp.latex(alt))
print(sp.simplify(sol - alt))
---
title: Optimization Algorithms
---
## Parabolic Interpolation
A task that comes up frequently in optimization problems is guessing a function minimum based on a
parabola fit between three points. So I'll derive this technique here.
First, we need a parabola that goes through three points, say $$(x_1, y_1)$$, $$(x_1, y_1)$$, and
$$(x_3, y_3)$$. In the interest of generality, I'll construct it using Lagrange polynomials.
$$
p(x) = \frac{y_1 (x - x_2) (x - x_3)}{(x_1 - x_2) (x_1 - x_3)}
+ \frac{y_2 (x - x_1) (x - x_3)}{(x_2 - x_1) (x_2 - x_3)}
+ \frac{y_3 (x - x_1) (x - x_2)}{(x_3 - x_1) (x_3 - x_2)}
$$
Differentiating, we find that the derivative is equal to the following.
$$
\begin{aligned}
p'(x) &=
\frac{y_{1} \left(x - x_{2}\right)}{\left(x_{1} - x_{2}\right) \left(x_{1} - x_{3}\right)}
+ \frac{y_{1} \left(x - x_{3}\right)}{\left(x_{1} - x_{2}\right) \left(x_{1} - x_{3}\right)} \\
&+ \frac{y_{2} \left(x - x_{1}\right)}{\left(- x_{1} + x_{2}\right) \left(x_{2} - x_{3}\right)}
+ \frac{y_{2} \left(x - x_{3}\right)}{\left(- x_{1} + x_{2}\right) \left(x_{2} - x_{3}\right)} \\
&+ \frac{y_{3} \left(x - x_{1}\right)}{\left(- x_{1} + x_{3}\right) \left(- x_{2} + x_{3}\right)}
+ \frac{y_{3} \left(x - x_{2}\right)}{\left(- x_{1} + x_{3}\right) \left(- x_{2} + x_{3}\right)}
\end{aligned}
$$
Setting this to zero, we find that the solution is surprisingly simple.
$$
x_\text{min} = \frac{1}{2} \cdot \frac{x_1^2 (y_2 - y_3) + x_2^2 (y_3 - y_1) + x_3^2 (y_{1} - y_2)}
{x_1 (y_2 - y_3) + x_2 (y_3 - y_1) + x_3 (y_1 - y_2)}
$$
This can also be factored so that it only involves differences of coordinates and function values.
$$
x_\text{min} =
x_{2} - \frac{1}{2} \cdot
\frac{(x_2 - x_3)^2 (y_2 - y_1) - (x_2 - x_1)^2 (y_2 - y_3)}{(x_2 - x_3) (y_2 - y_1) - (x_2 - x_1) (y_2 - y_3)}
$$
## Nelder-Mead
TODO: Demonstrate that the Numerical Recipes formulation is equivalent to the standard formulation.
## Conjugate Gradient Descent
### Motivation
Since one dimensional optimization problems can be approached systematically, it's reasonable to
consider schemes for multidimensional optimization that consist of iterative line searches.
Say we have a function $$f$$ that maps $$\mathbb{R}^n$$ to $$\mathbb{R}$$, and some arbitrary
starting point $$x_0$$. If we choose $$n$$ linearly independent directions $$u_1$$, $$\ldots$$,
$$u_n$$, we can perform $$n$$ successive line minimizations. This will give us $$n$$ points
$$
\begin{aligned}
x_1 &= x_0 + \alpha_1 u_1 \\
x_2 &= x_1 + \alpha_2 u_2 \\
&\vdots \\
x_n &= x_{n-1} + \alpha_n u_n = x_0 + \sum_{i = 1}^n \alpha_i u_i \\
\end{aligned}
$$
where the coefficients $$a_i$$ are chosen via some one dimensional optimization routine. Since each
point $$x_i$$ is a local minimum along the direction $$u_i$$, it must be true that
$$
\nabla f(x_i) \cdot u_i = 0
$$
Otherwise we could have gone farther along $$u_i$$ and made $$f(x_i)$$ smaller. The problem with
choosing random directions $$u_i$$ is that, for $$i \neq j$$, in general
$$
\nabla f(x_i) \cdot u_j \neq 0
$$
This says that the directional derivative at $$x_i$$ along $$u_j$$ is nonzero. So each successive
line minimization undoes the work of the previous ones, in the sense that it's necessary to cycle
back and minimize along the previous direction again. This can lead to a very inefficient search.
This begs the question: can we find directions $$u_i$$ for which successive line minimizations don't
disturb the previous ones? In particular, for $$1 \leq j < i \leq n$$ we want
$$
\nabla f(x_i) \cdot u_j = 0
$$
If we can do this, after performing $$k$$ line minimizations, the resulting point $$x_k$$ will still
be minimal along all $$k$$ directions considered so far. This implies that $$x_k$$ is the minimum
within the subspace spanned by $$\{u_1, \ldots, u_k\}$$. So each successive line search expands the
space within which we've minimized by one dimension. And after $$n$$ minimizations, we've covered
the whole space &mdash; the direction derivative at $$x_n$$ must be zero in all directions (i.e.
$$\nabla f(x_n) = 0$$), so $$x_n$$ is a local minimum.
It turns out it's possible to do this exactly for quadratic minima, and it can be approximated for
other functions (after all, every function without vanishing gradient and Hessian looks quadratic
close to a local extremum). In the latter case, repeating the whole process multiple times yields
better and better solutions.
### Derivation
#### Preliminaries
Let's express $$f$$ as a second order Taylor expansion about $$x_0$$. I'll assume this is exact in
the following discussion, but if you have higher order terms as well, you can view everything as an
approximation.
$$
f(x) = f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac{1}{2} (x - x_0)^T H (x - x_0)
$$
$$H$$ is the Hessian matrix at $$x_0$$. I assume we don't have any way to compute it directly, but
it's important to consider its presence as we derive the algorithm. Notationally, I don't attach an
$$x_0$$ to it since we will never consider the Hessian at any other location.
By differentiating, we find that
$$
\nabla f(x) = \nabla f(x_0) + H (x - x_0)
$$
or equivalently,
$$
\nabla f(x_0 + x) = \nabla f(x_0) + H x
$$
Now if we could compute $$H$$, we could set the gradient to zero and find the minimum $$x^* $$
directly by solving $$x^* = x_0 -H^{-1} \nabla f(x_0)$$. By our assumption, this is forbidden to us,
but it still brings up an important point: if $$H$$ isn't invertible, there isn't a unique solution.
The easiest way out is to assume that $$H$$ is positive definite. However to handle the case of
cubic or higher order minima, we need to relax this. Everything in the following derivation works
even when the Hessian disappears, as long as the line searches terminate &mdash; i.e. there's not a
direction where it can go downhill forever, and you aren't unlucky enough to shoot one exactly along
the floor of a flat valley. Better yet, if your line search is smart enough to quit for flat
functions, then you just need to ensure you can't go downhill forever &mdash; i.e. $$H$$ is positive
semidefinite.
Moving along, define $$x_i = x_{i - 1} + \alpha_i u_i$$ as before. For any $$1 \leq i \leq n$$,
$$
\begin{aligned}
\nabla f(x_i)
&= \nabla f \left( x_0 + \sum_{k = 1}^i \alpha_k u_k \right) \\
&= \nabla f(x_0) + H \left( \sum_{k = 1}^i \alpha_k u_k \right) \\
&= \nabla f(x_0) + \sum_{k = 1}^i \alpha_k H u_k \\
\end{aligned}
$$
From this it follows that
$$
\nabla f(x_i) = \nabla f(x_{i - 1}) + \alpha_i H u_i
$$
or more generally, for $$0 \leq j < i \leq n$$,
$$
\nabla f(x_i) = \nabla f(x_j) + \alpha_{j + 1} H u_{j + 1} + \ldots + \alpha_i H u_i
$$
Thus for $$1 \leq j < i \leq n$$,
$$
\begin{aligned}
\nabla f(x_i) \cdot u_{j}
&= \left( \nabla f(x_j) + \alpha_{j + 1} H u_{j + 1} + \ldots + \alpha_i H u_i \right) \cdot u_j \\
&= \nabla f(x_j) \cdot u_j + \alpha_{j + 1} u_{j + 1}^T H u_j + \ldots + \alpha_i u_i^T H u_j \\
&= \alpha_{j + 1} u_{j + 1}^T H u_j + \ldots + \alpha_i u_i^T H u_j \\
\end{aligned}
$$
(Recall that $$\nabla f(x_i) \cdot u_i = 0$$ because $$\alpha_i$$ is chosen to make $$x_i$$ a local
minimum along $$u_i$$.) The whole point of this exercise is to make this inner product zero, since
then each line minimization won't spoil the previous ones. So our goal will be achieved if we can
ensure that
$$
u_i^T H u_j = 0
$$
for all $$1 \leq j < i \leq n$$. Such vectors are called *conjugate vectors*, from which this
algorithm derives its name. (Though the careful reader will notice that it's not the gradients that
are conjugate &mdash; and the vectors that are conjugate aren't gradients.)
#### Base Case
We will now construct a set of conjugate vectors inductively. If $$\nabla f(x_0) = 0$$, we are done.
So take $$u_1 = \nabla f(x_0) \neq 0$$ as the first direction, and find $$x_1 = x_0 + \alpha_1 u_1$$
via a line search. Yes, $$u_1$$ points uphill; I assume the line search is smart enough to find a
negative $$\alpha_1$$. Also note that if the line search fails, then clearly $$f$$ wasn't strictly
convex. Finally, since the gradient at $$x_0$$ isn't zero, $$\alpha_1$$ will not be zero.
By construction,
$$
\nabla f(x_1) \cdot \nabla f(x_0) = \nabla f(x_1) \cdot u_1 = 0
$$
So trivially, $$\{ u_1 \}$$ spans a subspace of dimension one, and the following properties hold:
$$
\begin{cases}
u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq 1 \\
\nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq 1 \\
\nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq 1
\end{cases}
$$
#### Induction
Now assume that we've constructed $$u_1$$, $$\ldots$$, $$u_k$$ and $$x_1$$, $$\ldots$$, $$x_k$$,
that $$\{ u_1, \ldots, u_k \}$$ span a subspace of dimension $$k$$, and that the following
properties hold:
$$
\begin{cases}
u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq k \\
\nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq k \\
\nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq k
\end{cases}
$$
We will choose a new direction of the form
$$
u_{k + 1} = \nabla f(x_k) + \gamma_k u_k
$$
for some undetermined scalar $$\gamma_k$$. As before, if the gradient at $$x_k$$ is zero, then that
is a minimum. Additionally, since $$\nabla f(x_k) \cdot u_k = 0$$, no value of $$\gamma_k$$ can make
$$u_{k + 1}$$ zero. And since $$\nabla f(x_k) \cdot u_j = 0$$ for all $$1 \leq j \leq k$$, no value
of $$\gamma$$ can make $$u_{k + 1}$$ a linear combination of the prior $$u_j$$.
Our primary concern is that $$u_{k + 1}^T H u_k = 0$$. So
we expand
$$
\begin{aligned}
0 &=
u_{k + 1}^T H u_k \\
&= (\nabla f(x_k) + \gamma_k u_k)^T H u_k \\
&= \nabla f(x_k)^T H u_k + \gamma_k u_k^T H u_k
\end{aligned}
$$
If $$H$$ is positive definite, by definition $$u_k^T H u_k \neq 0$$. Even without this assumption,
though, we will soon rewrite the denominator and show that it is nonzero. So I will go ahead and
solve
$$
\gamma_k = - \frac{\nabla f(x_k)^T H u_k}{u_k^T H u_k}
$$
Since we can't compute $$H$$, we can't use this equation directly. But recall
$$
\nabla f(x_k) - \nabla f(x_{k - 1}) = \alpha_k H u_k
$$
Thus
$$
\begin{aligned}
\gamma_k
&= - \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{u_k^T (\nabla f(x_k) - \nabla f(x_{k - 1}))} \\
&= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{u_k^T \nabla f(x_{k - 1})} \\
&= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{(\nabla f(x_{k - 1}) + \gamma_{k - 1} u_{k - 1})^T \nabla f(x_{k - 1})} \\
&= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{\nabla f(x_{k - 1})^T \nabla f(x_{k - 1})}
\end{aligned}
$$
We can simplify one step further and write
$$
\gamma_k = \frac{\nabla f(x_k)^T \nabla f(x_k)}{\nabla f(x_{k - 1})^T \nabla f(x_{k - 1})}
$$
This makes sense if the function we're minimizing really is quadratic. But when that assumption is
only approximately true, empirically it tends to work better to use the first form. Also, note that
the denominator (in both forms) is guaranteed to be nonzero since if $$\nabla f(x_{k - 1})$$ was
zero we already found our minimum.
Now that we have determined $$\gamma_k$$, we can compute $$u_{k + 1}$$ explicitly and find $$x_{k +
1} = x_k + \alpha_{k + 1} u_{k + 1}$$ by line minimization. Because $$x_{k + 1}$$ is a minimum along
$$u_{k + 1}$$, $$\nabla f(x_{k + 1}) \cdot u_{k + 1} = 0$$. And our choice of $$\gamma_k$$ ensures
that $$u_{k + 1}^T H u_k = 0$$, which as we saw earlier implies that $$\nabla f(x_{k + 1}) \cdot u_k
= 0$$. Now we must show that these relations hold for $$j < k$$ as well.
First we show the conjugacy conditions. For any $$1 \leq j < k$$,
$$
\begin{aligned}
u_{k + 1}^T H u_j
&= (\nabla f(x_k) + \gamma_k u_k)^T H u_j \\
&= \nabla f(x_k)^T H u_j \\
&= \nabla f(x_k)^T \frac{1}{\alpha_j} (\nabla f(x_j) - \nabla f(x_{j - 1})) \\
&= 0
\end{aligned}
$$
As we have seen, this implies that
$$
\nabla f(x_{k + 1}) \cdot u_j = 0
$$
for $$1 \leq j \leq k$$.
So finally we must show the orthogonality of the gradients. For any $$1 \leq j \leq k$$,
$$
\nabla f(x_{k + 1}) \cdot \nabla f(x_j)
= \nabla f(x_{k + 1}) \cdot (u_{j + 1} - \gamma_j u_j)
= 0
$$
And
$$
\nabla f(x_{k + 1}) \cdot \nabla f(x_0) = \nabla f(x_{k + 1}) \cdot u_1 = 0
$$
Thus we have proven
$$
\begin{cases}
u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq k + 1\\
\nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq k + 1 \\
\nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq k + 1
\end{cases}
$$
By induction this shows that these properties hold up to $$n$$. (We cannot induct further than this
since the gradient at $$x_n$$ will be zero, and so $$u_{n + 1}$$ would be in the span of $$u_n$$.)
---
title: Problem Set 8 (Optimization)
---
This week's questions are all about exploring various optimization algorithms, so I won't write them
here specifically. This page summarizes my results. For notes on derivation and implementation, see
[here](../notes/optimization.html). Code is [here](https://gitlab.cba.mit.edu/erik/optimization).
I implemented Nelder-Mead, naive gradient descent (always step a constant factor times the
gradient), and conjugate gradient descent. For the latter I initially used a golden section line
search, but replaced it with Brent's method. I still want to update it to use Brent's method with
derivatives. I also played around with the reference CMA-ES implementation for comparison.
## Visualizations
Nelder-Mead 2d
![psi](../assets/img/08_nelder_mead_plot_2d.png)
Nelder-Mead 10d
![psi](../assets/img/08_nelder_mead_plot_10d.png)
Naive Gradient Descent 2d
![psi](../assets/img/08_gradient_descent_plot_2d.png)
Naive Gradient Descent 10d
![psi](../assets/img/08_gradient_descent_plot_10d.png)
Conjugate Gradient Descent 2d
![psi](../assets/img/08_conjugate_gradient_descent_plot_2d.png)
Conjugate Gradient Descent 10d
![psi](../assets/img/08_conjugate_gradient_descent_plot_10d.png)
CMA-ES 2d
![psi](../assets/img/08_cma_es_plot_2d.png)
CMA-ES 10d
![psi](../assets/img/08_cma_es_plot_10d.png)
## Performance Comparisons
I haven't made a pretty chart for this yet since I'm still playing around with the parameters (to
make it as fair a fight as possible). But initial results suggest that on the multi-dimensional
Rosenbrock function, conjugate gradient descent is a clear winner in scaling, at least through
one thousand dimensions.
Here are some raw numbers. Nelder Mead and CMA-ES are asked to reach a tolerance of 1e-8, while the
gradient descent methods are asked to find points where the gradient norm is less than 1e-8. These
are the number of evaluations required to reach those goals, capped at 1e6.
```
Dim 2
nm: 193
gd: 28527
cgd: 1021
cma: 870
Dim 4
nm: 562
gd: 35724
cgd: 9448
cma: 2760
Dim 8
nm: 1591
gd: 45049
cgd: 5987
cma: 5790
Dim 16
nm: 27440
gd: 124637
cgd: 13083
cma: 14292
Dim 32
nm: 78969
gd: 200567
cgd: 13898
cma: 57876
Dim 64
nm: 561195
gd: 455578
cgd: 15194
cma: 211616
Dim 128
nm: 1000001
gd: 564451
cgd: 19565
cma: 953694
Dim 256
nm: 1000001
gd: 782196
cgd: 28110
cma: 1000000
Dim 512
nm: 1000001
gd: 1000000
cgd: 46918
cma: 1000000
Dim 1024
nm: 1000001
gd: 1000000
cgd: 10055
cma: 1000000
```
As a final note, all the algorithms except CMA-ES spent the majority of their time evaluating the
function. But CMA-ES imposes large computational costs of its own, since it essentially has to
invert a covariance matrix (or at least solve an eigenvalue problem).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment