### Add tenth pset

parents 939034e2 6363b29c
 import matplotlib.pyplot as plt import numpy as np import sympy as sp x = sp.symbols("x", real=True) def pade_approximant(function, N, M): # Compute relevant taylor series terms derivative = function taylor_coefficients = [ function.subs(x, 0) ] for i in range(1, N + M + 1): derivative = sp.diff(derivative, x) taylor_coefficients.append(derivative.subs(x, 0) / sp.factorial(i)) # Build matrix one_one =  +  * M matrix_rows = [one_one] for i in range(1, M + 1): new_row = [] for j in range(0, min(N + i, M) + 1): new_row.append(taylor_coefficients[N + i - j]) for j in range(min(N + i, M) + 1, M + 1): new_row.append(0) matrix_rows.append(new_row) A = sp.Matrix(matrix_rows) b = sp.Matrix(one_one) sp.pprint(A) sp.pprint(b) # Solve answer = sp.linsolve((A, b)) b_coeffs = list(answer.args) a_coeffs = [ sum(b_coeffs[m] * taylor_coefficients[n - m] for m in range(0, min(n, M) + 1)) for n in range(0, N + 1) ] print(a_coeffs) print(b_coeffs) print("") return lambda x_val: sum(a_coeffs[n] * x_val ** n for n in range(0, N + 1)) \ / sum(b_coeffs[m] * x_val ** m for m in range(0, M + 1)) # print pade approximant values function = sp.exp(x) pade_approximations = [ pade_approximant(function, i, i) for i in range(1, 6) ] for approx in pade_approximations: print(approx(1)) print("") for approx in pade_approximations: print(approx(1.0)) print("") pade_errors = [ abs(approx(1.0) - function.subs(x, 1.0)) for approx in pade_approximations ] # Create polynomial approximations and print their values poly_approximations = [ #lambda x_val: sum(taylor_coefficients[n] * x**n for n in range(0, order)) for order in [3, 5, 7, 9, 11] sum(x**n / sp.factorial(n) for n in range(0, order)) for order in [3, 5, 7, 9, 11] ] for approx in poly_approximations: print(approx) print(approx.subs(x, 1)) print(approx.subs(x, 1.0)) print("") poly_errors = [ abs(approx.subs(x, 1.0) - function.subs(x, 1.0)) for approx in poly_approximations ] # Graph errors fig1 = plt.figure() left, bottom, width, height = 0.1, 0.1, 0.8, 0.8 ax1 = fig1.add_axes([left, bottom, width, height]) x_vals = [3, 5, 7, 9, 11] # number of free parameters ax1.set_yscale("log") ax1.plot(x_vals, pade_errors, label="padé") ax1.plot(x_vals, poly_errors, label="poly") ax1.set_xlabel("free parameters") ax1.set_ylabel("absolute error") ax1.legend() ax1.set_title("Absolute errors of Padé and polynomial approximations") fig1.savefig("../../../assets/img/10_errors.png", transparent=True)
 --- title: Backpropagation --- Backpropagation is the ubiquitous method for performing gradient descent in artificial neural networks. Ultimately it allows us to compute the derivative of a loss function with respect to every free parameter in the network (i.e. every weight and bias). It does so layer by layer. At the end of the day, it's just the chain rule, but there's a decent amount of bookkeeping so I want to write it out explicitly. Suppose we have an artificial neural network with \$\$N\$\$ layers. Let the number of neurons in layer \$\$n\$\$ be \$\$N_n\$\$. Let layer \$\$n\$\$ have weights \$\$W^n\$\$ (an \$\$N_n\$\$ by \$\$N_{n - 1}\$\$ matrix) and biases \$\$b^n\$\$ (a vector with \$\$N_n\$\$ components). Call its output \$\$a^n\$\$. Define the vector \$\$ z^n = W^n a^{N-1} + b^n \$\$ This is just the output of layer \$\$n\$\$ before the activation function \$\$\theta\$\$ is applied. So \$\$ a^n = \theta(z^n) \$\$ (where the function application is performed on each element independently). We'll treat layer zero as the inputs. So \$\$a^0\$\$ is defined, but not \$\$z^0\$\$, \$\$W^0\$\$, or \$\$b^0\$\$. Then \$\$a^1\$\$ through \$\$a^N\$\$ are the outputs of our layers. Finally, we have some loss function \$\$L\$\$, which I assume is a function only of the last layer's outputs (\$\$a^N\$\$). At the end of the day, to update the parameters of the network we need to know the partial derivative of \$\$L\$\$ with respect to all entries of all \$\$W^n\$\$ and \$\$b^n\$\$. To get there, it's helpful to consider as a stepping stone the partial derivatives of \$\$L\$\$ with respect to the entries of a particular \$\$z^n\$\$. I'll write this as \$\$\nabla_{z^n} L\$\$ (a vector with \$\$N_n\$\$ elements). In component form, \$\$ z^n_i = \sum_j W^n_{i, j} a^{n-1}_j + b^n_i \$\$ So \$\$\partial z_i^n / \partial b_i^n = 1\$\$ and by the chain rule, \$\$ \begin{aligned} \frac{\partial L}{\partial b_i^n} &= \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial b_i^n} \\ &= \frac{\partial L}{\partial z_i^n} \end{aligned} \$\$ Thus \$\$ \nabla_{b^n} L = \nabla_{z^n} L \$\$ Similarly, \$\$ \begin{aligned} \frac{\partial L}{\partial W_{i,j}^n} &= \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial W_{i,j}^n} \\ &= \frac{\partial L}{\partial z_i^n} a_j^{n-1} \end{aligned} \$\$ So \$\$ \nabla_{W^n} L = (\nabla_{z^n} L) (a^{n-1})^T \$\$ So it's easy to go from \$\$\nabla_{z^n} L\$\$ to \$\$\nabla_{z^n} W^n\$\$ and \$\$\nabla_{z^n} b^n\$\$. It's also easy to get from one \$\$\nabla_{z^n} L\$\$ to the next. In particular, \$\$ \begin{aligned} \frac{\partial L}{\partial a_j^{n-1}} &= \sum_{i} \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial a_j^{n-1}} \\ &= \sum_{i} \frac{\partial L}{\partial z_i^n} W_{i, j}^n \end{aligned} \$\$ So \$\$ \nabla_{a^{n-1}} L = (W^n)^T (\nabla_{z^n} L) \$\$ Finally, since \$\$a_i^{n-1} = \theta(z_i^{n-1})\$\$, \$\$ \begin{aligned} \frac{\partial L}{\partial z_i^{n-1}} &= \frac{\partial L}{\partial a_i^{n - 1}} \frac{\partial a_i^{n-1}}{\partial z_i^{n-1}} \\ &= \frac{\partial L}{\partial a_i^{n - 1}} \theta'(z_i^{n-1}) \\ \end{aligned} \$\$ and so \$\$ \nabla_{z^{n-1}} L = (W^n)^T (\nabla_{z^n} L) \odot (\nabla_{z^{n-1}} \theta) \$\$ Here \$\$\nabla_{z^{n-1}} \theta\$\$ means the derivative of \$\$\theta\$\$ evaluated at each \$\$z_i^{n-1}\$\$ (so strictly speaking it's more of a Jacobian than a gradient), and \$\$\odot\$\$ indicates the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) (i.e. element-wise product). So starting from the output of the network, \$\$ \nabla_{z^N} L = \nabla_{a^N} L \odot \nabla_{z^N} \theta \$\$ And from here we just apply the equation above repeatedly to compute \$\$\nabla_{z^{N - 1}} L\$\$, \$\$\nabla_{z^{N - 2}} L\$\$, etc. At each step we can easily compute \$\$\nabla_{b_n} L\$\$ and \$\$\nabla_{W^n} L\$\$ as well. When we get to the first layer, note that \$\$\nabla_{W^1} L\$\$ depends on the inputs of the network \$\$a^0\$\$, rather than the outputs of some other layers. To implement this efficiently, note that we don't need to store all the gradients we've computed so far. We just need to keep the most recent one, and have some memory in which to calculate the next. So if you allocate two arrays with as many elements as the largest layer has nodes, then you can keep reusing these for the whole computation. For standard gradient descent, updates to the weights and biases can be done in-place, so computing those gradients requires no additional storage.
 ... ... @@ -146,7 +146,8 @@ the floor of a flat valley. Better yet, if your line search is smart enough to q functions, then you just need to ensure you can't go downhill forever — i.e. \$\$H\$\$ is positive semidefinite. Moving along, define \$\$x_i = x_{i - 1} + \alpha_i u_i\$\$ as before. For any \$\$1 \leq i \leq n\$\$, Moving along, define \$\$x_i = x_{i - 1} + \alpha_i u_i\$\$ via line minimizations as before. For any \$\$1 \leq i \leq n\$\$, \$\$ \begin{aligned} ... ... @@ -191,8 +192,8 @@ u_i^T H u_j = 0 \$\$ for all \$\$1 \leq j < i \leq n\$\$. Such vectors are called *conjugate vectors*, from which this algorithm derives its name. (Though the careful reader will notice that it's not the gradients that are conjugate — and the vectors that are conjugate aren't gradients.) algorithm derives its name. (Though perhaps it's applied sloppily, since it's not the gradients themselves that are conjugate.) #### Base Case ... ...
_psets/10.md 0 → 100644
 --- title: Problem Set 10 (Functions) --- ## 1 {:.question} Find the first five diagonal Padé approximants [1/1], ..., [5/5] to \$\$e^x\$\$ around the origin. Remember that the numerator and denominator can be multiplied by a constant to make the numbers as convenient as possible. Evaluate the approximations at \$\$x = 1\$\$ and compare with the correct value of \$\$e = 2.718281828459045\$\$. How is the error improving with the order? How does that compare to the polynomial error? For a fixed function \$\$f\$\$ and integers \$\$N \geq 0\$\$ and \$\$M \geq 0\$\$, the Padé approximant is the function \$\$ [N/M]_ f(x) = \frac{\sum_{n = 0}^N a_n x^n}{1 + \sum_{m = 1}^M b_m x^m} \$\$ that matches as many terms as possible of the Taylor series of \$\$f\$\$. (We can define the approximant about any point, but without loss of generality we'll only consider the origin.) There are \$\$N + M + 1\$\$ parameters in the formula above (\$\$a_0\$\$, \$\$\ldots\$\$, \$\$b_N\$\$, and \$\$b_1\$\$, \$\$\ldots\$\$, \$\$b_M\$\$), so in general this means we can match \$\$f\$\$ up to order \$\$N + M + 1\$\$. (Though by coincidence, or otherwise, we may end up matching some higher order terms as well.) We can write the resulting system of equations explicitly by expanding the Taylor series of the approximant to order \$\$L = N + M\$\$ (since this gives us \$\$N + M + 1\$\$ terms). To make the equations simpler to write I will introduce a constant \$\$b_0 = 1\$\$. \$\$ \frac{\sum_{n = 0}^N a_n x^n}{\sum_{m = 0}^M b_m x^m} = \sum_{l = 0}^L c_l x^l \$\$ Then we multiply by the denominator. \$\$ \begin{aligned} \sum_{n = 0}^N a_n x^n &= \left(\sum_{m = 0}^M b_m x^m \right) \left( \sum_{l = 0}^L c_l x^l \right) \\ &= \sum_{m = 0}^M \sum_{l = 0}^L b_m c_l x^{m + l} \end{aligned} \$\$ By setting the different powers of \$\$x\$\$ equal, this gives us \$\$L + 1 = N + M + 1\$\$ equations. \$\$ \begin{cases} a_n = \sum_{m = 0}^{\min(n, M)} b_m c_{n - m} & \text{for } 0 \leq n \leq N \\ 0 = \sum_{m = 0}^{\min(n, M)} b_m c_{n - m} & \text{for } N < n \leq L = N + M \end{cases} \$\$ We know what \$\$c_0\$\$, \$\$\ldots\$\$, \$\$c_L\$\$ are, since they must match the Taylor series of \$\$f\$\$. (In particular, \$\$c_l = f^{(l)} / l!\$\$.) So by solving this system of equations we can determine all of the unknown parameters. In particular, the last \$\$M\$\$ equations (i.e. for \$\$N < n\$\$) can be solved to determine \$\$b_1\$\$, \$\$\ldots\$\$, \$\$b_M\$\$. Then the first \$\$N + 1\$\$ equations immediately give values for \$\$a_0\$\$, \$\$\ldots\$\$, \$\$a_N\$\$. The first step can be written as a system of \$\$M + 1\$\$ equations. \$\$ \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 & 0 \\ c_{N + 1} & c_{N} & c_{N - 1} & \cdots & c_{N - M + 2} & c_{N - M + 1} \\ c_{N + 2} & c_{N + 1} & c_{N} & \cdots & c_{N - M + 3} & c_{N - M + 2} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ c_{N + M - 1} & c_{N + M - 2} & c_{N + M - 3} & \cdots & c_{N} & c_{N - 1} \\ c_{N + M} & c_{N + M - 1} & c_{N + M - 2} & \cdots & c_{N + 1} & c_{N} \\ \end{bmatrix} \begin{bmatrix} b_0 \\ b_1 \\ b_2 \\ \vdots \\ b_{M - 1} \\ b_M \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \end{bmatrix} \$\$ Note that if \$\$N + 1 < M\$\$, then some upper diagonal chunk of the matrix will be zero, corresponding to those entries where the subscript of \$\$c\$\$ would be negative. Finally, there's no guarantee that this matrix will be nonsingular. (For example, consider the case where \$\$N = M\$\$ and all derivatives of \$\$f\$\$ up to \$\$N + M\$\$ are zero. Then all rows of the matrix except for the first will be zero.) In such a situation, one can try to throw out degenerate equations, and pull new ones from higher order terms (i.e. consider \$\$L > N + M\$\$). However it's possible that all additional rows will be degenerate; in this case, the approximant is legitimately underdetermined. This means it can match the function exactly, and you can reduce \$\$M\$\$ until your system is nonsingular. Ok, so now to actually answer the question. I wrote a SymPy [script](https://gitlab.cba.mit.edu/erik/nmm_2020_site/-/tree/master/_code/pset_10/py/pade_approximants.py) that uses the strategy we just derived to build approximants for me. For \$\$f(x) = e^x\$\$, I get the following approximations: \$\$ \begin{aligned} [1/1]_ f(x) &= \frac{1 + x/2}{1 - x/2} \\ [2/2]_ f(x) &= \frac{1 + x/2 + x^2/12}{1 - x/2 + x^2/12} \\ [3/3]_ f(x) &= \frac{1 + x/2 + x^2/10 + x^3/120}{1 - x/2 + x^2/10 - x^3/120} \\ [4/4]_ f(x) &= \frac{1 + x/2 + 3x^2/28 + x^3/84 + x^4/1,680}{1 - x/2 + 3x^2/28 - x^3/84 + x^4/1,680} \\ [5/5]_ f(x) &= \frac{1 + x/2 + x^2/9 + x^3/72 + x^4/1,008 + x^5/30,240}{1 - x/2 + x^2/9 - x^3/72 + x^4/1,008 - x^5/30,240} \\ \end{aligned} \$\$ These give the corresponding approximations for \$\$e\$\$: \$\$ \begin{aligned} [1/1]_ f(1) &= 3 &= 3.00000000000000 \\ [2/2]_ f(1) &= \frac{19}{7} &\approx 2.71428571428571 \\ [3/3]_ f(1) &= \frac{193}{71} &\approx 2.71830985915493 \\ [4/4]_ f(1) &= \frac{2,721}{1,001} &\approx 2.71828171828172 \\ [5/5]_ f(1) &= \frac{49,171}{18,089} &\approx 2.71828182873569 \end{aligned} \$\$ Polynomial approximations, on the other hand (to equivalent orders), give \$\$ \begin{aligned} \sum_{n = 0}^2 \frac{x^n}{n!} &= \frac{5}{2} &= 2.50000000000000 \\ \sum_{n = 0}^4 \frac{x^n}{n!} &= \frac{65}{24} &\approx 2.70833333333333 \\ \sum_{n = 0}^6 \frac{x^n}{n!} &= \frac{1,957}{720} &\approx 2.71805555555556 \\ \sum_{n = 0}^8 \frac{x^n}{n!} &= \frac{109,601}{40,320} &\approx 2.71827876984127 \\ \sum_{n = 0}^{10} \frac{x^n}{n!} &= \frac{9,864,101}{3,628,800} &\approx 2.71828180114638 \end{aligned} \$\$ Here are the different errors. ![errors](../assets/img/10_errors.png) ## 3 {:.question} Train a neural network on the output from an order 4 maximal LFSR and learn to reproduce it. How do the results depend on the network depth and architecture? I have previously implemented backpropagation from scratch, to train my own [VAEs](https://gitlab.cba.mit.edu/erik/gears/-/tree/master/gears/custom_networks). So rather than write this again, I wrote up a better [derivation of backpropagation](../notes/backpropagation.html) which I can use a reference for future modifications.

29.1 KB

Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment