### Add a full derivation of conjugate gradient descent

parent e3709e73
 --- title: Optimization Algorithms --- ## Nelder-Mead TODO: Demonstrate that the Numerical Recipes formulation is equivalent to the standard formulation. ## Conjugate Gradient Descent ### Motivation Since one dimensional optimization problems can be approached systematically, it's reasonable to consider schemes for multidimensional optimization that consist of iterative line searches. Say we have a function \$\$f\$\$ that maps \$\$\mathbb{R}^n\$\$ to \$\$\mathbb{R}\$\$, and some arbitrary starting point \$\$x_0\$\$. If we choose \$\$n\$\$ linearly independent directions \$\$u_1\$\$, \$\$\ldots\$\$, \$\$u_n\$\$, we can perform \$\$n\$\$ successive line minimizations. This will give us \$\$n\$\$ points \$\$ \begin{aligned} x_1 &= x_0 + \alpha_1 u_1 \\ x_2 &= x_1 + \alpha_2 u_2 \\ &\vdots \\ x_n &= x_{n-1} + \alpha_n u_n = x_0 + \sum_{i = 1}^n \alpha_i u_i \\ \end{aligned} \$\$ where the coefficients \$\$a_i\$\$ are chosen via some one dimensional optimization routine. Since each point \$\$x_i\$\$ is a local minimum along the direction \$\$u_i\$\$, it must be true that \$\$ \nabla f(x_i) \cdot u_i = 0 \$\$ Otherwise we could have gone farther along \$\$u_i\$\$ and made \$\$f(x_i)\$\$ smaller. The problem with choosing random directions \$\$u_i\$\$ is that, for \$\$i \neq j\$\$, in general \$\$ \nabla f(x_i) \cdot u_j \neq 0 \$\$ So each successive line minimization undoes the work of the previous ones, in the sense that it's necessary to cycle back and minimize along the previous direction again. This can lead to a very inefficient search. This begs the question: can we find directions \$\$u_i\$\$ for which successive line minimizations don't undo the previous ones? In particular, for \$\$1 \leq j < i \leq n\$\$ we want \$\$ \nabla f(x_i) \cdot u_j = 0 \$\$ If we can do this, after performing all \$\$n\$\$ line minimizations, the resulting point will still be minimal along all \$\$n\$\$ directions. So as long as the \$\$u_i\$\$ span \$\$\mathbb{R}^n\$\$, the function will be minimized along all possible directions; i.e. \$\$x_n\$\$ is a local minimum. It turns out it's possible to do this exactly for quadratic minima, and it can be approximated for other functions (after all, every function looks quadratic close to a local extremum). In the latter case, repeating the whole process multiple times yields better and better solutions. ### Derivation #### Preliminaries Let's express \$\$f\$\$ as a second order Taylor expansion about \$\$x_0\$\$. I'll assume this is exact in the following discussion, but if you have higher order terms as well, you can view everything as an approximation. \$\$ f(x) = f(x_0) + \nabla f(x_0)^T (x - x_0) + \frac{1}{2} (x - x_0)^T H (x - x_0) \$\$ By differentiating, we find that \$\$ \nabla f(x) = \nabla f(x_0) + H (x - x_0) \$\$ or equivalently, \$\$ \nabla f(x_0 + x) = \nabla f(x_0) + H x \$\$ Now if we could compute the Hessian matrix \$\$H\$\$, we could set the gradient to zero and compute the minimum \$\$x^* \$\$ exactly by solving \$\$H (x^* - x_0) = -\nabla f(x_0)\$\$. We will assume we can only compute function values and gradients, but this still brings up an important point: if \$\$H\$\$ doesn't have full rank, there isn't a unique solution. We will assume that \$\$H\$\$ is positive definite. For any \$\$1 \leq i \leq n\$\$, \$\$ \begin{aligned} \nabla f(x_i) &= \nabla f \left( x_0 + \sum_{k = 1}^i \alpha_k u_k \right) \\ &= \nabla f(x_0) + H \left( \sum_{k = 1}^i \alpha_k u_k \right) \\ &= \nabla f(x_0) + \sum_{k = 1}^i \alpha_k H u_k \\ \end{aligned} \$\$ From this it follows that \$\$ \nabla f(x_i) = \nabla f(x_{i - 1}) + \alpha_i H u_i \$\$ or more generally, for \$\$0 \leq j < i \leq n\$\$, \$\$ \nabla f(x_i) = \nabla f(x_j) + \alpha_{j + 1} H u_{j + 1} + \ldots + \alpha_i H u_i \$\$ Thus for \$\$1 \leq j < i \leq n\$\$, \$\$ \begin{aligned} \nabla f(x_i) \cdot u_{j} &= \left( \nabla f(x_j) + \alpha_{j + 1} H u_{j + 1} + \ldots + \alpha_i H u_i \right) \cdot u_j \\ &= \nabla f(x_j) \cdot u_j + \alpha_{j + 1} u_{j + 1}^T H u_j + \ldots + \alpha_i u_i^T H u_j \\ &= \alpha_{j + 1} u_{j + 1}^T H u_j + \ldots + \alpha_i u_i^T H u_j \\ \end{aligned} \$\$ (Recall that \$\$\nabla f(x_i) \cdot u_i = 0\$\$ because \$\$\alpha_i\$\$ is chosen to make \$\$x_i\$\$ a local minimum along \$\$u_i\$\$.) The whole point of this exercise is to make this inner product zero, since then each line minimization won't spoil the previous ones. So our goal will be achieved if we can ensure that \$\$ u_i^T H u_j = 0 \$\$ for all \$\$1 \leq j < i \leq n\$\$. Such vectors are called *conjugate vectors*, from which this algorithm derives its name. (Though the careful reader will notice that it's not the gradients that are conjugate — and the vectors that are conjugate aren't gradients.) #### Base Case We will now construct a set of conjugate vectors inductively. If \$\$\nabla f(x_0) = 0\$\$, we are done. So take \$\$u_1 = \nabla f(x_0) \neq 0\$\$ as the first direction, and find \$\$x_1 = x_0 + \alpha_1 u_1\$\$ via a line search. Yes, \$\$u_1\$\$ points uphill; I assume the line search is smart enough to find a negative \$\$\alpha_1\$\$. Also note that if the line search fails, then clearly \$\$f\$\$ wasn't strictly convex. Finally, since the gradient at \$\$x_0\$\$ isn't zero, \$\$\alpha_1\$\$ will not be zero. By construction, \$\$ \nabla f(x_1) \cdot \nabla f(x_0) = \nabla f(x_1) \cdot u_1 = 0 \$\$ So trivially, the following properties hold: \$\$ \begin{cases} u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq 1 \\ \nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq 1 \\ \nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq 1 \end{cases} \$\$ #### Induction Now assume that we've constructed \$\$u_1\$\$, \$\$\ldots\$\$, \$\$u_k\$\$ and \$\$x_1\$\$, \$\$\ldots\$\$, \$\$x_k\$\$, and that the following properties hold. \$\$ \begin{cases} u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq k \\ \nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq k \\ \nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq k \end{cases} \$\$ We will choose a new direction of the form \$\$ u_{k + 1} = \nabla f(x_k) + \gamma_k u_k \$\$ for some undetermined scalar \$\$\gamma_k\$\$. As before, if the gradient at \$\$x_k\$\$ is zero, then that is the minimum. Additionally, since \$\$\nabla f(x_k) \cdot u_k = 0\$\$, for no value of \$\$\gamma_k\$\$ is \$\$u_{k + 1}\$\$ zero. Finally, since \$\$\nabla f(x_k) \cdot u_j = 0\$\$ for all \$\$1 \leq j \leq k\$\$, \$\$u_{k + 1}\$\$ is not a linear combination of the prior \$\$u_j\$\$. Our primary concern is that \$\$u_{k + 1}^T H u_k = 0\$\$. So we expand \$\$ \begin{aligned} 0 &= u_{k + 1}^T H u_k \\ &= (\nabla f(x_k) + \gamma_k u_k)^T H u_k \\ &= \nabla f(x_k)^T H u_k + \gamma_k u_k^T H u_k \end{aligned} \$\$ If \$\$H\$\$ is positive definite, \$\$u_k^T H u_k \neq 0\$\$. And even if \$\$H\$\$ is not positive definite, we will soon rewrite the denominator, and it will be clear that it is nonzero. So we can solve \$\$ \gamma_k = - \frac{\nabla f(x_k)^T H u_k}{u_k^T H u_k} \$\$ Since we can't compute \$\$H\$\$, we can't use this equation directly. But recall \$\$ \nabla f(x_k) - \nabla f(x_{k - 1}) = \alpha_k H u_k \$\$ Thus \$\$ \begin{aligned} \gamma_k &= - \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{u_k^T (\nabla f(x_k) - \nabla f(x_{k - 1}))} \\ &= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{u_k^T \nabla f(x_{k - 1})} \\ &= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{(\nabla f(x_{k - 1}) + \gamma_{k - 1} u_{k - 1})^T \nabla f(x_{k - 1})} \\ &= \frac{\nabla f(x_k)^T (\nabla f(x_k) - \nabla f(x_{k - 1}))}{\nabla f(x_{k - 1})^T \nabla f(x_{k - 1})} \end{aligned} \$\$ We can simplify one step further and write \$\$ \gamma_k = \frac{\nabla f(x_k)^T \nabla f(x_k)}{\nabla f(x_{k - 1})^T \nabla f(x_{k - 1})} \$\$ This makes sense if the function we're minimizing really is quadratic. But when that assumption is only approximately true, empirically it tends to work better to use the first form. Also, note that the denominator (in both forms) is guaranteed to be nonzero since if \$\$\nabla f(x_{k - 1})\$\$ was zero we already found our minimum. Now that we have determined \$\$\gamma_k\$\$, we can compute \$\$u_{k + 1}\$\$ explicitly and find \$\$x_{k + 1} = x_k + \alpha_{k + 1} u_{k + 1}\$\$ by line minimization. Because \$\$x_{k + 1}\$\$ is a minimum along \$\$u_{k + 1}\$\$, \$\$\nabla f(x_{k + 1}) \cdot u_{k + 1} = 0\$\$. And our choice of \$\$\gamma_k\$\$ ensures that \$\$u_{k + 1}^T H u_k = 0\$\$, which as we saw earlier implies that \$\$\nabla f(x_{k + 1}) \cdot u_k = 0\$\$. Now we must show that these relations hold for \$\$j < k\$\$ as well. First we show the conjugacy conditions. For any \$\$1 \leq j < k\$\$, \$\$ \begin{aligned} u_{k + 1}^T H u_j &= (\nabla f(x_k) + \gamma_k u_k)^T H u_j \\ &= \nabla f(x_k)^T H u_j \\ &= \nabla f(x_k)^T \frac{1}{\alpha_j} (\nabla f(x_j) - \nabla f(x_{j - 1})) \\ &= 0 \end{aligned} \$\$ As we have seen, this implies that \$\$ \nabla f(x_{k + 1}) \cdot u_j = 0 \$\$ for \$\$1 \leq j \leq k\$\$. So finally we must show the orthogonality of the gradients. For any \$\$0 \leq j \leq k\$\$, \$\$ \begin{aligned} \nabla f(x_{k + 1}) \cdot \nabla f(x_j) &= \nabla f(x_{k + 1}) \cdot (u_{j + 1} - \gamma_j u_j) \\ &= 0 \end{aligned} \$\$ Thus we have proven \$\$ \begin{cases} u_i^T H u_j = 0 &\text{for } 1 \leq j < i \leq k + 1\\ \nabla f(x_i) \cdot u_j = 0 &\text{for } 1 \leq j \leq i \leq k + 1 \\ \nabla f(x_i) \cdot \nabla f(x_j) = 0 &\text{for } 0 \leq j < i \leq k + 1 \end{cases} \$\$ By induction this shows that these properties hold up to \$\$n\$\$. (We cannot induct further than this since the gradient at \$\$x_n\$\$ will be zero, and so \$\$u_{n + 1}\$\$ would be in the span of \$\$u_n\$\$.)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment