Commit 2740192e authored by Erik Strand's avatar Erik Strand

Add tenth pset

parents 939034e2 6363b29c
import matplotlib.pyplot as plt
import numpy as np
import sympy as sp
x = sp.symbols("x", real=True)
def pade_approximant(function, N, M):
# Compute relevant taylor series terms
derivative = function
taylor_coefficients = [ function.subs(x, 0) ]
for i in range(1, N + M + 1):
derivative = sp.diff(derivative, x)
taylor_coefficients.append(derivative.subs(x, 0) / sp.factorial(i))
# Build matrix
one_one = [1] + [0] * M
matrix_rows = [one_one]
for i in range(1, M + 1):
new_row = []
for j in range(0, min(N + i, M) + 1):
new_row.append(taylor_coefficients[N + i - j])
for j in range(min(N + i, M) + 1, M + 1):
new_row.append(0)
matrix_rows.append(new_row)
A = sp.Matrix(matrix_rows)
b = sp.Matrix(one_one)
sp.pprint(A)
sp.pprint(b)
# Solve
answer = sp.linsolve((A, b))
b_coeffs = list(answer.args[0])
a_coeffs = [
sum(b_coeffs[m] * taylor_coefficients[n - m] for m in range(0, min(n, M) + 1))
for n in range(0, N + 1)
]
print(a_coeffs)
print(b_coeffs)
print("")
return lambda x_val: sum(a_coeffs[n] * x_val ** n for n in range(0, N + 1)) \
/ sum(b_coeffs[m] * x_val ** m for m in range(0, M + 1))
# print pade approximant values
function = sp.exp(x)
pade_approximations = [ pade_approximant(function, i, i) for i in range(1, 6) ]
for approx in pade_approximations:
print(approx(1))
print("")
for approx in pade_approximations:
print(approx(1.0))
print("")
pade_errors = [ abs(approx(1.0) - function.subs(x, 1.0)) for approx in pade_approximations ]
# Create polynomial approximations and print their values
poly_approximations = [
#lambda x_val: sum(taylor_coefficients[n] * x**n for n in range(0, order)) for order in [3, 5, 7, 9, 11]
sum(x**n / sp.factorial(n) for n in range(0, order)) for order in [3, 5, 7, 9, 11]
]
for approx in poly_approximations:
print(approx)
print(approx.subs(x, 1))
print(approx.subs(x, 1.0))
print("")
poly_errors = [ abs(approx.subs(x, 1.0) - function.subs(x, 1.0)) for approx in poly_approximations ]
# Graph errors
fig1 = plt.figure()
left, bottom, width, height = 0.1, 0.1, 0.8, 0.8
ax1 = fig1.add_axes([left, bottom, width, height])
x_vals = [3, 5, 7, 9, 11] # number of free parameters
ax1.set_yscale("log")
ax1.plot(x_vals, pade_errors, label="padé")
ax1.plot(x_vals, poly_errors, label="poly")
ax1.set_xlabel("free parameters")
ax1.set_ylabel("absolute error")
ax1.legend()
ax1.set_title("Absolute errors of Padé and polynomial approximations")
fig1.savefig("../../../assets/img/10_errors.png", transparent=True)
---
title: Backpropagation
---
Backpropagation is the ubiquitous method for performing gradient descent in artificial neural
networks. Ultimately it allows us to compute the derivative of a loss function with respect to every
free parameter in the network (i.e. every weight and bias). It does so layer by layer. At the end of
the day, it's just the chain rule, but there's a decent amount of bookkeeping so I want to write it
out explicitly.
Suppose we have an artificial neural network with $$N$$ layers. Let the number of neurons in layer
$$n$$ be $$N_n$$. Let layer $$n$$ have weights $$W^n$$ (an $$N_n$$ by $$N_{n - 1}$$ matrix) and biases
$$b^n$$ (a vector with $$N_n$$ components). Call its output $$a^n$$. Define the vector
$$
z^n = W^n a^{N-1} + b^n
$$
This is just the output of layer $$n$$ before the activation function $$\theta$$ is applied. So
$$
a^n = \theta(z^n)
$$
(where the function application is performed on each element independently).
We'll treat layer zero as the inputs. So $$a^0$$ is defined, but not $$z^0$$, $$W^0$$, or $$b^0$$.
Then $$a^1$$ through $$a^N$$ are the outputs of our layers. Finally, we have some loss function
$$L$$, which I assume is a function only of the last layer's outputs ($$a^N$$).
At the end of the day, to update the parameters of the network we need to know the partial
derivative of $$L$$ with respect to all entries of all $$W^n$$ and $$b^n$$. To get there, it's
helpful to consider as a stepping stone the partial derivatives of $$L$$ with respect to the entries
of a particular $$z^n$$. I'll write this as $$\nabla_{z^n} L$$ (a vector with $$N_n$$ elements).
In component form,
$$
z^n_i = \sum_j W^n_{i, j} a^{n-1}_j + b^n_i
$$
So $$\partial z_i^n / \partial b_i^n = 1$$ and by the chain rule,
$$
\begin{aligned}
\frac{\partial L}{\partial b_i^n} &= \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial b_i^n} \\
&= \frac{\partial L}{\partial z_i^n}
\end{aligned}
$$
Thus
$$
\nabla_{b^n} L = \nabla_{z^n} L
$$
Similarly,
$$
\begin{aligned}
\frac{\partial L}{\partial W_{i,j}^n} &= \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial W_{i,j}^n} \\
&= \frac{\partial L}{\partial z_i^n} a_j^{n-1}
\end{aligned}
$$
So
$$
\nabla_{W^n} L = (\nabla_{z^n} L) (a^{n-1})^T
$$
So it's easy to go from $$\nabla_{z^n} L$$ to $$\nabla_{z^n} W^n$$ and $$\nabla_{z^n} b^n$$.
It's also easy to get from one $$\nabla_{z^n} L$$ to the next. In particular,
$$
\begin{aligned}
\frac{\partial L}{\partial a_j^{n-1}} &= \sum_{i} \frac{\partial L}{\partial z_i^n} \frac{\partial z_i^n}{\partial a_j^{n-1}} \\
&= \sum_{i} \frac{\partial L}{\partial z_i^n} W_{i, j}^n
\end{aligned}
$$
So
$$
\nabla_{a^{n-1}} L = (W^n)^T (\nabla_{z^n} L)
$$
Finally, since $$a_i^{n-1} = \theta(z_i^{n-1})$$,
$$
\begin{aligned}
\frac{\partial L}{\partial z_i^{n-1}} &= \frac{\partial L}{\partial a_i^{n - 1}} \frac{\partial a_i^{n-1}}{\partial z_i^{n-1}} \\
&= \frac{\partial L}{\partial a_i^{n - 1}} \theta'(z_i^{n-1}) \\
\end{aligned}
$$
and so
$$
\nabla_{z^{n-1}} L = (W^n)^T (\nabla_{z^n} L) \odot (\nabla_{z^{n-1}} \theta)
$$
Here $$\nabla_{z^{n-1}} \theta$$ means the derivative of $$\theta$$ evaluated at each $$z_i^{n-1}$$
(so strictly speaking it's more of a Jacobian than a gradient), and $$\odot$$ indicates the
[Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) (i.e. element-wise
product).
So starting from the output of the network,
$$
\nabla_{z^N} L = \nabla_{a^N} L \odot \nabla_{z^N} \theta
$$
And from here we just apply the equation above repeatedly to compute $$\nabla_{z^{N - 1}} L$$,
$$\nabla_{z^{N - 2}} L$$, etc. At each step we can easily compute $$\nabla_{b_n} L$$ and
$$\nabla_{W^n} L$$ as well. When we get to the first layer, note that $$\nabla_{W^1} L$$ depends on
the inputs of the network $$a^0$$, rather than the outputs of some other layers.
To implement this efficiently, note that we don't need to store all the gradients we've computed so
far. We just need to keep the most recent one, and have some memory in which to calculate the next.
So if you allocate two arrays with as many elements as the largest layer has nodes, then you can
keep reusing these for the whole computation. For standard gradient descent, updates to the weights
and biases can be done in-place, so computing those gradients requires no additional storage.
......@@ -146,7 +146,8 @@ the floor of a flat valley. Better yet, if your line search is smart enough to q
functions, then you just need to ensure you can't go downhill forever — i.e. $$H$$ is positive
semidefinite.
Moving along, define $$x_i = x_{i - 1} + \alpha_i u_i$$ as before. For any $$1 \leq i \leq n$$,
Moving along, define $$x_i = x_{i - 1} + \alpha_i u_i$$ via line minimizations as before. For any
$$1 \leq i \leq n$$,
$$
\begin{aligned}
......@@ -191,8 +192,8 @@ u_i^T H u_j = 0
$$
for all $$1 \leq j < i \leq n$$. Such vectors are called *conjugate vectors*, from which this
algorithm derives its name. (Though the careful reader will notice that it's not the gradients that
are conjugate &mdash; and the vectors that are conjugate aren't gradients.)
algorithm derives its name. (Though perhaps it's applied sloppily, since it's not the gradients
themselves that are conjugate.)
#### Base Case
......
---
title: Problem Set 10 (Functions)
---
## 1
{:.question}
Find the first five diagonal Pad&eacute; approximants [1/1], ..., [5/5] to $$e^x$$ around the
origin. Remember that the numerator and denominator can be multiplied by a constant to make the
numbers as convenient as possible. Evaluate the approximations at $$x = 1$$ and compare with the
correct value of $$e = 2.718281828459045$$. How is the error improving with the order? How does that
compare to the polynomial error?
For a fixed function $$f$$ and integers $$N \geq 0$$ and $$M \geq 0$$, the Pad&eacute; approximant
is the function
$$
[N/M]_ f(x) = \frac{\sum_{n = 0}^N a_n x^n}{1 + \sum_{m = 1}^M b_m x^m}
$$
that matches as many terms as possible of the Taylor series of $$f$$. (We can define the approximant
about any point, but without loss of generality we'll only consider the origin.) There are $$N + M +
1$$ parameters in the formula above ($$a_0$$, $$\ldots$$, $$b_N$$, and $$b_1$$, $$\ldots$$,
$$b_M$$), so in general this means we can match $$f$$ up to order $$N + M + 1$$. (Though by
coincidence, or otherwise, we may end up matching some higher order terms as well.)
We can write the resulting system of equations explicitly by expanding the Taylor series of the
approximant to order $$L = N + M$$ (since this gives us $$N + M + 1$$ terms). To make the equations
simpler to write I will introduce a constant $$b_0 = 1$$.
$$
\frac{\sum_{n = 0}^N a_n x^n}{\sum_{m = 0}^M b_m x^m} = \sum_{l = 0}^L c_l x^l
$$
Then we multiply by the denominator.
$$
\begin{aligned}
\sum_{n = 0}^N a_n x^n &= \left(\sum_{m = 0}^M b_m x^m \right) \left( \sum_{l = 0}^L c_l x^l \right) \\
&= \sum_{m = 0}^M \sum_{l = 0}^L b_m c_l x^{m + l}
\end{aligned}
$$
By setting the different powers of $$x$$ equal, this gives us $$L + 1 = N + M + 1$$ equations.
$$
\begin{cases}
a_n = \sum_{m = 0}^{\min(n, M)} b_m c_{n - m} & \text{for } 0 \leq n \leq N \\
0 = \sum_{m = 0}^{\min(n, M)} b_m c_{n - m} & \text{for } N < n \leq L = N + M
\end{cases}
$$
We know what $$c_0$$, $$\ldots$$, $$c_L$$ are, since they must match the Taylor series of $$f$$. (In
particular, $$c_l = f^{(l)} / l!$$.) So by solving this system of equations we can determine all of
the unknown parameters. In particular, the last $$M$$ equations (i.e. for $$N < n$$) can be solved
to determine $$b_1$$, $$\ldots$$, $$b_M$$. Then the first $$N + 1$$ equations immediately give
values for $$a_0$$, $$\ldots$$, $$a_N$$.
The first step can be written as a system of $$M + 1$$ equations.
$$
\begin{bmatrix}
1 & 0 & 0 & \cdots & 0 & 0 \\
c_{N + 1} & c_{N} & c_{N - 1} & \cdots & c_{N - M + 2} & c_{N - M + 1} \\
c_{N + 2} & c_{N + 1} & c_{N} & \cdots & c_{N - M + 3} & c_{N - M + 2} \\
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\
c_{N + M - 1} & c_{N + M - 2} & c_{N + M - 3} & \cdots & c_{N} & c_{N - 1} \\
c_{N + M} & c_{N + M - 1} & c_{N + M - 2} & \cdots & c_{N + 1} & c_{N} \\
\end{bmatrix}
\begin{bmatrix}
b_0 \\
b_1 \\
b_2 \\
\vdots \\
b_{M - 1} \\
b_M
\end{bmatrix}
=
\begin{bmatrix}
1 \\
0 \\
0 \\
\vdots \\
0 \\
0
\end{bmatrix}
$$
Note that if $$N + 1 < M$$, then some upper diagonal chunk of the matrix will be zero, corresponding
to those entries where the subscript of $$c$$ would be negative.
Finally, there's no guarantee that this matrix will be nonsingular. (For example, consider the case
where $$N = M$$ and all derivatives of $$f$$ up to $$N + M$$ are zero. Then all rows of the matrix
except for the first will be zero.) In such a situation, one can try to throw out degenerate
equations, and pull new ones from higher order terms (i.e. consider $$L > N + M$$). However it's
possible that all additional rows will be degenerate; in this case, the approximant is legitimately
underdetermined. This means it can match the function exactly, and you can reduce $$M$$ until your
system is nonsingular.
Ok, so now to actually answer the question. I wrote a SymPy
[script](https://gitlab.cba.mit.edu/erik/nmm_2020_site/-/tree/master/_code/pset_10/py/pade_approximants.py)
that uses the strategy we just derived to build approximants for me. For $$f(x) = e^x$$, I get the
following approximations:
$$
\begin{aligned}
[1/1]_ f(x) &= \frac{1 + x/2}{1 - x/2} \\
[2/2]_ f(x) &= \frac{1 + x/2 + x^2/12}{1 - x/2 + x^2/12} \\
[3/3]_ f(x) &= \frac{1 + x/2 + x^2/10 + x^3/120}{1 - x/2 + x^2/10 - x^3/120} \\
[4/4]_ f(x) &= \frac{1 + x/2 + 3x^2/28 + x^3/84 + x^4/1,680}{1 - x/2 + 3x^2/28 - x^3/84 + x^4/1,680} \\
[5/5]_ f(x) &= \frac{1 + x/2 + x^2/9 + x^3/72 + x^4/1,008 + x^5/30,240}{1 - x/2 + x^2/9 - x^3/72 + x^4/1,008 - x^5/30,240} \\
\end{aligned}
$$
These give the corresponding approximations for $$e$$:
$$
\begin{aligned}
[1/1]_ f(1) &= 3 &= 3.00000000000000 \\
[2/2]_ f(1) &= \frac{19}{7} &\approx 2.71428571428571 \\
[3/3]_ f(1) &= \frac{193}{71} &\approx 2.71830985915493 \\
[4/4]_ f(1) &= \frac{2,721}{1,001} &\approx 2.71828171828172 \\
[5/5]_ f(1) &= \frac{49,171}{18,089} &\approx 2.71828182873569
\end{aligned}
$$
Polynomial approximations, on the other hand (to equivalent orders), give
$$
\begin{aligned}
\sum_{n = 0}^2 \frac{x^n}{n!} &= \frac{5}{2} &= 2.50000000000000 \\
\sum_{n = 0}^4 \frac{x^n}{n!} &= \frac{65}{24} &\approx 2.70833333333333 \\
\sum_{n = 0}^6 \frac{x^n}{n!} &= \frac{1,957}{720} &\approx 2.71805555555556 \\
\sum_{n = 0}^8 \frac{x^n}{n!} &= \frac{109,601}{40,320} &\approx 2.71827876984127 \\
\sum_{n = 0}^{10} \frac{x^n}{n!} &= \frac{9,864,101}{3,628,800} &\approx 2.71828180114638
\end{aligned}
$$
Here are the different errors.
![errors](../assets/img/10_errors.png)
## 3
{:.question}
Train a neural network on the output from an order 4 maximal LFSR and learn to reproduce it. How do
the results depend on the network depth and architecture?
I have previously implemented backpropagation from scratch, to train my own
[VAEs](https://gitlab.cba.mit.edu/erik/gears/-/tree/master/gears/custom_networks). So rather than
write this again, I wrote up a better [derivation of backpropagation](../notes/backpropagation.html)
which I can use a reference for future modifications.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment