### Add eighth pset

parents 15a787be 8a1ae21a
 import sympy as sp # Verifying that the Nelder-Mead reflection scheme in Numerical Methods is equivalent to the # standard presentation. x = sp.symbols("x", real=True) x_1, x_2, x_3 = sp.symbols("x_1 x_2 x_3", real=True) y_1, y_2, y_3 = sp.symbols("y_1 y_2 y_3", real=True) term_1 = y_1 * (x - x_2) * (x - x_3) / (x_1 - x_2) / (x_1 - x_3) term_2 = y_2 * (x - x_1) * (x - x_3) / (x_2 - x_1) / (x_2 - x_3) term_3 = y_3 * (x - x_1) * (x - x_2) / (x_3 - x_1) / (x_3 - x_2) poly = term_1 + term_2 + term_3 dpoly = sp.diff(poly, x) sol = sp.solve(dpoly, x) assert(len(sol) == 1) sol = sol print("lagrange polynomial") print(sp.latex(poly)) print("derivative") print(sp.latex(dpoly)) print("minimum") print(sp.latex(sol)) q = (x_2 - x_3) * (y_2 - y_1) r = (x_2 - x_1) * (y_2 - y_3) alt = x_2 - ((x_2 - x_3) * q - (x_2 - x_1) * r) / 2 / (q - r) print(sp.latex(alt)) print(sp.simplify(sol - alt))
 --- title: Optimization Algorithms --- ## Parabolic Interpolation A task that comes up frequently in optimization problems is guessing a function minimum based on a parabola fit between three points. So I'll derive this technique here. First, we need a parabola that goes through three points, say \$\$(x_1, y_1)\$\$, \$\$(x_1, y_1)\$\$, and \$\$(x_3, y_3)\$\$. In the interest of generality, I'll construct it using Lagrange polynomials. \$\$ p(x) = \frac{y_1 (x - x_2) (x - x_3)}{(x_1 - x_2) (x_1 - x_3)} + \frac{y_2 (x - x_1) (x - x_3)}{(x_2 - x_1) (x_2 - x_3)} + \frac{y_3 (x - x_1) (x - x_2)}{(x_3 - x_1) (x_3 - x_2)} \$\$ Differentiating, we find that the derivative is equal to the following. \$\$ \begin{aligned} p'(x) &= \frac{y_{1} \left(x - x_{2}\right)}{\left(x_{1} - x_{2}\right) \left(x_{1} - x_{3}\right)} + \frac{y_{1} \left(x - x_{3}\right)}{\left(x_{1} - x_{2}\right) \left(x_{1} - x_{3}\right)} \\ &+ \frac{y_{2} \left(x - x_{1}\right)}{\left(- x_{1} + x_{2}\right) \left(x_{2} - x_{3}\right)} + \frac{y_{2} \left(x - x_{3}\right)}{\left(- x_{1} + x_{2}\right) \left(x_{2} - x_{3}\right)} \\ &+ \frac{y_{3} \left(x - x_{1}\right)}{\left(- x_{1} + x_{3}\right) \left(- x_{2} + x_{3}\right)} + \frac{y_{3} \left(x - x_{2}\right)}{\left(- x_{1} + x_{3}\right) \left(- x_{2} + x_{3}\right)} \end{aligned} \$\$ Setting this to zero, we find that the solution is surprisingly simple. \$\$ x_\text{min} = \frac{1}{2} \cdot \frac{x_1^2 (y_2 - y_3) + x_2^2 (y_3 - y_1) + x_3^2 (y_{1} - y_2)} {x_1 (y_2 - y_3) + x_2 (y_3 - y_1) + x_3 (y_1 - y_2)} \$\$ This can also be factored so that it only involves differences of coordinates and function values. \$\$ x_\text{min} = x_{2} - \frac{1}{2} \cdot \frac{(x_2 - x_3)^2 (y_2 - y_1) - (x_2 - x_1)^2 (y_2 - y_3)}{(x_2 - x_3) (y_2 - y_1) - (x_2 - x_1) (y_2 - y_3)} \$\$ ## Nelder-Mead TODO: Demonstrate that the Numerical Recipes formulation is equivalent to the standard formulation. ## Conjugate Gradient Descent ### Motivation Since one dimensional optimization problems can be approached systematically, it's reasonable to consider schemes for multidimensional optimization that consist of iterative line searches. Say we have a function \$\$f\$\$ that maps \$\$\mathbb{R}^n\$\$ to \$\$\mathbb{R}\$\$, and some arbitrary starting point \$\$x_0\$\$. If we choose \$\$n\$\$ linearly independent directions \$\$u_1\$\$, \$\$\ldots\$\$, \$\$u_n\$\$, we can perform \$\$n\$\$ successive line minimizations. This will give us \$\$n\$\$ points \$\$ \begin{aligned} x_1 &= x_0 + \alpha_1 u_1 \\ x_2 &= x_1 + \alpha_2 u_2 \\ &\vdots \\ x_n &= x_{n-1} + \alpha_n u_n = x_0 + \sum_{i = 1}^n \alpha_i u_i \\ \end{aligned} \$\$ where the coefficients \$\$a_i\$\$ are chosen via some one dimensional optimization routine. Since each point \$\$x_i\$\$ is a local minimum along the direction \$\$u_i\$\$, it must be true that \$\$ \nabla f(x_i) \cdot u_i = 0 \$\$ Otherwise we could have gone farther along \$\$u_i\$\$ and made \$\$f(x_i)\$\$ smaller. The problem with choosing random directions \$\$u_i\$\$ is that, for \$\$i \neq j\$\$, in general \$\$ \nabla f(x_i) \cdot u_j \neq 0 \$\$ This says that the directional derivative at \$\$x_i\$\$ along \$\$u_j\$\$ is nonzero. So each successive line minimization undoes the work of the previous ones, in the sense that it's necessary to cycle back and minimize along the previous direction again. This can lead to a very inefficient search. This begs the question: can we find directions \$\$u_i\$\$ for which successive line minimizations don't disturb the previous ones? In particular, for \$\$1 \leq j < i \leq n\$\$ we want \$\$ \nabla f(x_i) \cdot u_j = 0 \$\$ If we can do this, after performing \$\$k\$\$ line minimizations, the resulting point \$\$x_k\$\$ will still be minimal along all \$\$k\$\$ directions considered so far. This implies that \$\$x_k\$\$ is the minimum within the subspace spanned by \$\$\{u_1, \ldots, u_k\}\$\$. So each successive line search expands the space within which we've minimized by one dimension. And after \$\$n\$\$ minimizations, we've covered the whole space — the direction derivative at \$\$x_n\$\$ must be zero in all directions (i.e. \$\$\nabla f(x_n) = 0\$\$), so \$\$x_n\$\$ is a local minimum. It turns out it's possible to do this exactly for quadratic minima, and it can be approximated for other functions (after all, every function without vanishing gradient and Hessian looks quadratic close to a local extremum). In the latter case, repeating the whole process multiple times yields better and better solutions. ### Derivation #### Preliminaries Let's express \$\$f\$\$ as a second order Taylor expansion about \$\$x_0\$\$. I'll assume this is exact in the following discussion, but if you have higher order terms as well, you can view everything as an approximation. \$\$ f(x) = f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac{1}{2} (x - x_0)^T H (x - x_0) \$\$ \$\$H\$\$ is the Hessian matrix at \$\$x_0\$\$. I assume we don't have any way to compute it directly, but it's important to consider its presence as we derive the algorithm. Notationally, I don't attach an \$\$x_0\$\$ to it since we will never consider the Hessian at any other location. By differentiating, we find that \$\$ \nabla f(x) = \nabla f(x_0) + H (x - x_0) \$\$ or equivalently, \$\$ \nabla f(x_0 + x) = \nabla f(x_0) + H x \$\$ Now if we could compute \$\$H\$\$, we could set the gradient to zero and find the minimum \$\$x^* \$\$ directly by solving \$\$x^* = x_0 -H^{-1} \nabla f(x_0)\$\$. By our assumption, this is forbidden to us, but it still brings up an important point: if \$\$H\$\$ isn't invertible, there isn't a unique solution. The easiest way out is to assume that \$\$H\$\$ is positive definite. However to handle the case of cubic or higher order minima, we need to relax this. Everything in the following derivation works even when the Hessian disappears, as long as the line searches terminate — i.e. there's not a direction where it can go downhill forever, and you aren't unlucky enough to shoot one exactly along the floor of a flat valley. Better yet, if your line search is smart enough to quit for flat functions, then you just need to ensure you can't go downhill forever — i.e. \$\$H\$\$ is positive semidefinite. Moving along, define \$\$x_i = x_{i - 1} + \alpha_i u_i\$\$ as before. For any \$\$1 \leq i \leq n\$\$, \$\$ \begin{aligned} \nabla f(x_i) &= \nabla f \left( x_0 + \sum_{k = 1}^i \alpha_k u_k \right) \\ &= \nabla f(x_0) + H \left( \sum_{k = 1}^i \alpha_k u_k \right) \\ &= \nabla f(x_0) + \sum_{k = 1}^i \alpha_k H u_k \\ \end{aligned} \$\$ From this it follows that \$\$ \nabla f(x_i) = \nabla f(x_{i - 1}) + \alpha_i H u_i \$\$ or more generally, for \$\$0 \leq j < i \leq n\$\$, \$\$ \nabla f(x_i) = \nabla f(x_j) + \alpha_{j + 1} H u_{j + 1} + \ldots + \alpha_i H u_i \$\$ Thus for \$\$1 \leq j < i \leq n\$\$, \$\$ \begin{aligned} \nabla f(x_i) \cdot u_{j} &= \left( \nabla f(x_j) + \alpha_{j + 1} H u_{j + 1} + \ldots + \alpha_i H u_i \right) \cdot u_j \\ &= \nabla f(x_j) \cdot u_j + \alpha_{j + 1} u_{j + 1}^T H u_j + \ldots + \alpha_i u_i^T H u_j \\ &= \alpha_{j + 1} u_{j + 1}^T H u_j + \ldots + \alpha_i u_i^T H u_j \\ \end{aligned} \$\$ (Recall that \$\$\nabla f(x_i) \cdot u_i = 0\$\$ because \$\$\alpha_i\$\$ is chosen to make \$\$x_i\$\$ a local minimum along \$\$u_i\$\$.) The whole point of this exercise is to make this inner product zero, since then each line minimization won't spoil the previous ones. So our goal will be achieved if we can ensure that \$\$ u_i^T H u_j = 0 \$\$ for all \$\$1 \leq j < i \leq n\$\$. Such vectors are called *conjugate vectors*, from which this algorithm derives its name. (Though the careful reader will notice that it's not the gradients that are conjugate — and the vectors that are conjugate aren't gradients.) #### Base Case We will now construct a set of conjugate vectors inductively. If \$\$\nabla f(x_0) = 0\$\$, we are done. So take \$\$u_1 = \nabla f(x_0) \neq 0\$\$ as the first direction, and find \$\$x_1 = x_0 + \alpha_1 u_1\$\$ via a line search. Yes, \$\$u_1\$\$ points uphill; I assume the line search is smart enough to find a negative \$\$\alpha_1\$\$. Also note that if the line search fails, then clearly \$\$f\$\$ wasn't strictly convex. Finally, since the gradient at \$\$x_0\$\$ isn't zero, \$\$\alpha_1\$\$ will not be zero. By construction, \$\$ \nabla f(x_1) \cdot \nabla f(x_0) = \nabla f(x_1) \cdot u_1 = 0 \$\$ So trivially, \$\$\{ u_1 \}\$\$ spans a subspace of dimension one, and the following properties hold: \$\$ \begin{cases} u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq 1 \\ \nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq 1 \\ \nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq 1 \end{cases} \$\$ #### Induction Now assume that we've constructed \$\$u_1\$\$, \$\$\ldots\$\$, \$\$u_k\$\$ and \$\$x_1\$\$, \$\$\ldots\$\$, \$\$x_k\$\$, that \$\$\{ u_1, \ldots, u_k \}\$\$ span a subspace of dimension \$\$k\$\$, and that the following properties hold: \$\$ \begin{cases} u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq k \\ \nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq k \\ \nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq k \end{cases} \$\$ We will choose a new direction of the form \$\$ u_{k + 1} = \nabla f(x_k) + \gamma_k u_k \$\$ for some undetermined scalar \$\$\gamma_k\$\$. As before, if the gradient at \$\$x_k\$\$ is zero, then that is a minimum. Additionally, since \$\$\nabla f(x_k) \cdot u_k = 0\$\$, no value of \$\$\gamma_k\$\$ can make \$\$u_{k + 1}\$\$ zero. And since \$\$\nabla f(x_k) \cdot u_j = 0\$\$ for all \$\$1 \leq j \leq k\$\$, no value of \$\$\gamma\$\$ can make \$\$u_{k + 1}\$\$ a linear combination of the prior \$\$u_j\$\$. Our primary concern is that \$\$u_{k + 1}^T H u_k = 0\$\$. So we expand \$\$ \begin{aligned} 0 &= u_{k + 1}^T H u_k \\ &= (\nabla f(x_k) + \gamma_k u_k)^T H u_k \\ &= \nabla f(x_k)^T H u_k + \gamma_k u_k^T H u_k \end{aligned} \$\$ If \$\$H\$\$ is positive definite, by definition \$\$u_k^T H u_k \neq 0\$\$. Even without this assumption, though, we will soon rewrite the denominator and show that it is nonzero. So I will go ahead and solve \$\$ \gamma_k = - \frac{\nabla f(x_k)^T H u_k}{u_k^T H u_k} \$\$ Since we can't compute \$\$H\$\$, we can't use this equation directly. But recall \$\$ \nabla f(x_k) - \nabla f(x_{k - 1}) = \alpha_k H u_k \$\$ Thus \$\$ \begin{aligned} \gamma_k &= - \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{u_k^T (\nabla f(x_k) - \nabla f(x_{k - 1}))} \\ &= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{u_k^T \nabla f(x_{k - 1})} \\ &= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{(\nabla f(x_{k - 1}) + \gamma_{k - 1} u_{k - 1})^T \nabla f(x_{k - 1})} \\ &= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{\nabla f(x_{k - 1})^T \nabla f(x_{k - 1})} \end{aligned} \$\$ We can simplify one step further and write \$\$ \gamma_k = \frac{\nabla f(x_k)^T \nabla f(x_k)}{\nabla f(x_{k - 1})^T \nabla f(x_{k - 1})} \$\$ This makes sense if the function we're minimizing really is quadratic. But when that assumption is only approximately true, empirically it tends to work better to use the first form. Also, note that the denominator (in both forms) is guaranteed to be nonzero since if \$\$\nabla f(x_{k - 1})\$\$ was zero we already found our minimum. Now that we have determined \$\$\gamma_k\$\$, we can compute \$\$u_{k + 1}\$\$ explicitly and find \$\$x_{k + 1} = x_k + \alpha_{k + 1} u_{k + 1}\$\$ by line minimization. Because \$\$x_{k + 1}\$\$ is a minimum along \$\$u_{k + 1}\$\$, \$\$\nabla f(x_{k + 1}) \cdot u_{k + 1} = 0\$\$. And our choice of \$\$\gamma_k\$\$ ensures that \$\$u_{k + 1}^T H u_k = 0\$\$, which as we saw earlier implies that \$\$\nabla f(x_{k + 1}) \cdot u_k = 0\$\$. Now we must show that these relations hold for \$\$j < k\$\$ as well. First we show the conjugacy conditions. For any \$\$1 \leq j < k\$\$, \$\$ \begin{aligned} u_{k + 1}^T H u_j &= (\nabla f(x_k) + \gamma_k u_k)^T H u_j \\ &= \nabla f(x_k)^T H u_j \\ &= \nabla f(x_k)^T \frac{1}{\alpha_j} (\nabla f(x_j) - \nabla f(x_{j - 1})) \\ &= 0 \end{aligned} \$\$ As we have seen, this implies that \$\$ \nabla f(x_{k + 1}) \cdot u_j = 0 \$\$ for \$\$1 \leq j \leq k\$\$. So finally we must show the orthogonality of the gradients. For any \$\$1 \leq j \leq k\$\$, \$\$ \nabla f(x_{k + 1}) \cdot \nabla f(x_j) = \nabla f(x_{k + 1}) \cdot (u_{j + 1} - \gamma_j u_j) = 0 \$\$ And \$\$ \nabla f(x_{k + 1}) \cdot \nabla f(x_0) = \nabla f(x_{k + 1}) \cdot u_1 = 0 \$\$ Thus we have proven \$\$ \begin{cases} u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq k + 1\\ \nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq k + 1 \\ \nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq k + 1 \end{cases} \$\$ By induction this shows that these properties hold up to \$\$n\$\$. (We cannot induct further than this since the gradient at \$\$x_n\$\$ will be zero, and so \$\$u_{n + 1}\$\$ would be in the span of \$\$u_n\$\$.)
_psets/08.md 0 → 100644
 --- title: Problem Set 8 (Optimization) --- This week's questions are all about exploring various optimization algorithms, so I won't write them here specifically. This page summarizes my results. For notes on derivation and implementation, see [here](../notes/optimization.html). Code is [here](https://gitlab.cba.mit.edu/erik/optimization). I implemented Nelder-Mead, naive gradient descent (always step a constant factor times the gradient), and conjugate gradient descent. For the latter I initially used a golden section line search, but replaced it with Brent's method. I still want to update it to use Brent's method with derivatives. I also played around with the reference CMA-ES implementation for comparison. ## Visualizations Nelder-Mead 2d ![psi](../assets/img/08_nelder_mead_plot_2d.png) Nelder-Mead 10d ![psi](../assets/img/08_nelder_mead_plot_10d.png) Naive Gradient Descent 2d ![psi](../assets/img/08_gradient_descent_plot_2d.png) Naive Gradient Descent 10d ![psi](../assets/img/08_gradient_descent_plot_10d.png) Conjugate Gradient Descent 2d ![psi](../assets/img/08_conjugate_gradient_descent_plot_2d.png) Conjugate Gradient Descent 10d ![psi](../assets/img/08_conjugate_gradient_descent_plot_10d.png) CMA-ES 2d ![psi](../assets/img/08_cma_es_plot_2d.png) CMA-ES 10d ![psi](../assets/img/08_cma_es_plot_10d.png) ## Performance Comparisons I haven't made a pretty chart for this yet since I'm still playing around with the parameters (to make it as fair a fight as possible). But initial results suggest that on the multi-dimensional Rosenbrock function, conjugate gradient descent is a clear winner in scaling, at least through one thousand dimensions. Here are some raw numbers. Nelder Mead and CMA-ES are asked to reach a tolerance of 1e-8, while the gradient descent methods are asked to find points where the gradient norm is less than 1e-8. These are the number of evaluations required to reach those goals, capped at 1e6. ``` Dim 2 nm: 193 gd: 28527 cgd: 1021 cma: 870 Dim 4 nm: 562 gd: 35724 cgd: 9448 cma: 2760 Dim 8 nm: 1591 gd: 45049 cgd: 5987 cma: 5790 Dim 16 nm: 27440 gd: 124637 cgd: 13083 cma: 14292 Dim 32 nm: 78969 gd: 200567 cgd: 13898 cma: 57876 Dim 64 nm: 561195 gd: 455578 cgd: 15194 cma: 211616 Dim 128 nm: 1000001 gd: 564451 cgd: 19565 cma: 953694 Dim 256 nm: 1000001 gd: 782196 cgd: 28110 cma: 1000000 Dim 512 nm: 1000001 gd: 1000000 cgd: 46918 cma: 1000000 Dim 1024 nm: 1000001 gd: 1000000 cgd: 10055 cma: 1000000 ``` As a final note, all the algorithms except CMA-ES spent the majority of their time evaluating the function. But CMA-ES imposes large computational costs of its own, since it essentially has to invert a covariance matrix (or at least solve an eigenvalue problem).

71.3 KB

63.9 KB

59.8 KB

60.5 KB

58.3 KB

60.3 KB

92 KB

61.7 KB

Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment