Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
What's new
7
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Open sidebar
Erik Strand
nmm_2020_site
Commits
2740192e
Commit
2740192e
authored
May 04, 2020
by
Erik Strand
Browse files
Options
Browse Files
Download
Plain Diff
Add tenth pset
parents
939034e2
6363b29c
Changes
5
Hide whitespace changes
Inline
Sidebyside
Showing
5 changed files
with
361 additions
and
3 deletions
+361
3
_code/pset_10/py/pade_approximants.py
_code/pset_10/py/pade_approximants.py
+81
0
_notes/backpropagation.md
_notes/backpropagation.md
+124
0
_notes/optimization.md
_notes/optimization.md
+4
3
_psets/10.md
_psets/10.md
+152
0
assets/img/10_errors.png
assets/img/10_errors.png
+0
0
No files found.
_code/pset_10/py/pade_approximants.py
0 → 100644
View file @
2740192e
import
matplotlib.pyplot
as
plt
import
numpy
as
np
import
sympy
as
sp
x
=
sp
.
symbols
(
"x"
,
real
=
True
)
def
pade_approximant
(
function
,
N
,
M
):
# Compute relevant taylor series terms
derivative
=
function
taylor_coefficients
=
[
function
.
subs
(
x
,
0
)
]
for
i
in
range
(
1
,
N
+
M
+
1
):
derivative
=
sp
.
diff
(
derivative
,
x
)
taylor_coefficients
.
append
(
derivative
.
subs
(
x
,
0
)
/
sp
.
factorial
(
i
))
# Build matrix
one_one
=
[
1
]
+
[
0
]
*
M
matrix_rows
=
[
one_one
]
for
i
in
range
(
1
,
M
+
1
):
new_row
=
[]
for
j
in
range
(
0
,
min
(
N
+
i
,
M
)
+
1
):
new_row
.
append
(
taylor_coefficients
[
N
+
i

j
])
for
j
in
range
(
min
(
N
+
i
,
M
)
+
1
,
M
+
1
):
new_row
.
append
(
0
)
matrix_rows
.
append
(
new_row
)
A
=
sp
.
Matrix
(
matrix_rows
)
b
=
sp
.
Matrix
(
one_one
)
sp
.
pprint
(
A
)
sp
.
pprint
(
b
)
# Solve
answer
=
sp
.
linsolve
((
A
,
b
))
b_coeffs
=
list
(
answer
.
args
[
0
])
a_coeffs
=
[
sum
(
b_coeffs
[
m
]
*
taylor_coefficients
[
n

m
]
for
m
in
range
(
0
,
min
(
n
,
M
)
+
1
))
for
n
in
range
(
0
,
N
+
1
)
]
print
(
a_coeffs
)
print
(
b_coeffs
)
print
(
""
)
return
lambda
x_val
:
sum
(
a_coeffs
[
n
]
*
x_val
**
n
for
n
in
range
(
0
,
N
+
1
))
\
/
sum
(
b_coeffs
[
m
]
*
x_val
**
m
for
m
in
range
(
0
,
M
+
1
))
# print pade approximant values
function
=
sp
.
exp
(
x
)
pade_approximations
=
[
pade_approximant
(
function
,
i
,
i
)
for
i
in
range
(
1
,
6
)
]
for
approx
in
pade_approximations
:
print
(
approx
(
1
))
print
(
""
)
for
approx
in
pade_approximations
:
print
(
approx
(
1.0
))
print
(
""
)
pade_errors
=
[
abs
(
approx
(
1.0
)

function
.
subs
(
x
,
1.0
))
for
approx
in
pade_approximations
]
# Create polynomial approximations and print their values
poly_approximations
=
[
#lambda x_val: sum(taylor_coefficients[n] * x**n for n in range(0, order)) for order in [3, 5, 7, 9, 11]
sum
(
x
**
n
/
sp
.
factorial
(
n
)
for
n
in
range
(
0
,
order
))
for
order
in
[
3
,
5
,
7
,
9
,
11
]
]
for
approx
in
poly_approximations
:
print
(
approx
)
print
(
approx
.
subs
(
x
,
1
))
print
(
approx
.
subs
(
x
,
1.0
))
print
(
""
)
poly_errors
=
[
abs
(
approx
.
subs
(
x
,
1.0
)

function
.
subs
(
x
,
1.0
))
for
approx
in
poly_approximations
]
# Graph errors
fig1
=
plt
.
figure
()
left
,
bottom
,
width
,
height
=
0.1
,
0.1
,
0.8
,
0.8
ax1
=
fig1
.
add_axes
([
left
,
bottom
,
width
,
height
])
x_vals
=
[
3
,
5
,
7
,
9
,
11
]
# number of free parameters
ax1
.
set_yscale
(
"log"
)
ax1
.
plot
(
x_vals
,
pade_errors
,
label
=
"padé"
)
ax1
.
plot
(
x_vals
,
poly_errors
,
label
=
"poly"
)
ax1
.
set_xlabel
(
"free parameters"
)
ax1
.
set_ylabel
(
"absolute error"
)
ax1
.
legend
()
ax1
.
set_title
(
"Absolute errors of Padé and polynomial approximations"
)
fig1
.
savefig
(
"../../../assets/img/10_errors.png"
,
transparent
=
True
)
_notes/backpropagation.md
0 → 100644
View file @
2740192e

title
:
Backpropagation

Backpropagation is the ubiquitous method for performing gradient descent in artificial neural
networks. Ultimately it allows us to compute the derivative of a loss function with respect to every
free parameter in the network (i.e. every weight and bias). It does so layer by layer. At the end of
the day, it's just the chain rule, but there's a decent amount of bookkeeping so I want to write it
out explicitly.
Suppose we have an artificial neural network with $$N$$ layers. Let the number of neurons in layer
$$n$$ be $$N_n$$. Let layer $$n$$ have weights $$W^n$$ (an $$N_n$$ by $$N_{n  1}$$ matrix) and biases
$$b^n$$ (a vector with $$N_n$$ components). Call its output $$a^n$$. Define the vector
$$
z^n = W^n a^{N1} + b^n
$$
This is just the output of layer $$n$$ before the activation function $$
\t
heta$$ is applied. So
$$
a^n =
\t
heta(z^n)
$$
(where the function application is performed on each element independently).
We'll treat layer zero as the inputs. So $$a^0$$ is defined, but not $$z^0$$, $$W^0$$, or $$b^0$$.
Then $$a^1$$ through $$a^N$$ are the outputs of our layers. Finally, we have some loss function
$$L$$, which I assume is a function only of the last layer's outputs ($$a^N$$).
At the end of the day, to update the parameters of the network we need to know the partial
derivative of $$L$$ with respect to all entries of all $$W^n$$ and $$b^n$$. To get there, it's
helpful to consider as a stepping stone the partial derivatives of $$L$$ with respect to the entries
of a particular $$z^n$$. I'll write this as $$
\n
abla_{z^n} L$$ (a vector with $$N_n$$ elements).
In component form,
$$
z^n_i =
\s
um_j W^n_{i, j} a^{n1}_j + b^n_i
$$
So $$
\p
artial z_i^n /
\p
artial b_i^n = 1$$ and by the chain rule,
$$
\b
egin{aligned}
\f
rac{
\p
artial L}{
\p
artial b_i^n} &=
\f
rac{
\p
artial L}{
\p
artial z_i^n}
\f
rac{
\p
artial z_i^n}{
\p
artial b_i^n}
\\
&=
\f
rac{
\p
artial L}{
\p
artial z_i^n}
\e
nd{aligned}
$$
Thus
$$
\n
abla_{b^n} L =
\n
abla_{z^n} L
$$
Similarly,
$$
\b
egin{aligned}
\f
rac{
\p
artial L}{
\p
artial W_{i,j}^n} &=
\f
rac{
\p
artial L}{
\p
artial z_i^n}
\f
rac{
\p
artial z_i^n}{
\p
artial W_{i,j}^n}
\\
&=
\f
rac{
\p
artial L}{
\p
artial z_i^n} a_j^{n1}
\e
nd{aligned}
$$
So
$$
\n
abla_{W^n} L = (
\n
abla_{z^n} L) (a^{n1})^T
$$
So it's easy to go from $$
\n
abla_{z^n} L$$ to $$
\n
abla_{z^n} W^n$$ and $$
\n
abla_{z^n} b^n$$.
It's also easy to get from one $$
\n
abla_{z^n} L$$ to the next. In particular,
$$
\b
egin{aligned}
\f
rac{
\p
artial L}{
\p
artial a_j^{n1}} &=
\s
um_{i}
\f
rac{
\p
artial L}{
\p
artial z_i^n}
\f
rac{
\p
artial z_i^n}{
\p
artial a_j^{n1}}
\\
&=
\s
um_{i}
\f
rac{
\p
artial L}{
\p
artial z_i^n} W_{i, j}^n
\e
nd{aligned}
$$
So
$$
\n
abla_{a^{n1}} L = (W^n)^T (
\n
abla_{z^n} L)
$$
Finally, since $$a_i^{n1} =
\t
heta(z_i^{n1})$$,
$$
\b
egin{aligned}
\f
rac{
\p
artial L}{
\p
artial z_i^{n1}} &=
\f
rac{
\p
artial L}{
\p
artial a_i^{n  1}}
\f
rac{
\p
artial a_i^{n1}}{
\p
artial z_i^{n1}}
\\
&=
\f
rac{
\p
artial L}{
\p
artial a_i^{n  1}}
\t
heta'(z_i^{n1})
\\
\e
nd{aligned}
$$
and so
$$
\n
abla_{z^{n1}} L = (W^n)^T (
\n
abla_{z^n} L)
\o
dot (
\n
abla_{z^{n1}}
\t
heta)
$$
Here $$
\n
abla_{z^{n1}}
\t
heta$$ means the derivative of $$
\t
heta$$ evaluated at each $$z_i^{n1}$$
(so strictly speaking it's more of a Jacobian than a gradient), and $$
\o
dot$$ indicates the
[
Hadamard product
](
https://en.wikipedia.org/wiki/Hadamard_product_(matrices
)
) (i.e. elementwise
product).
So starting from the output of the network,
$$
\n
abla_{z^N} L =
\n
abla_{a^N} L
\o
dot
\n
abla_{z^N}
\t
heta
$$
And from here we just apply the equation above repeatedly to compute $$
\n
abla_{z^{N  1}} L$$,
$$
\n
abla_{z^{N  2}} L$$, etc. At each step we can easily compute $$
\n
abla_{b_n} L$$ and
$$
\n
abla_{W^n} L$$ as well. When we get to the first layer, note that $$
\n
abla_{W^1} L$$ depends on
the inputs of the network $$a^0$$, rather than the outputs of some other layers.
To implement this efficiently, note that we don't need to store all the gradients we've computed so
far. We just need to keep the most recent one, and have some memory in which to calculate the next.
So if you allocate two arrays with as many elements as the largest layer has nodes, then you can
keep reusing these for the whole computation. For standard gradient descent, updates to the weights
and biases can be done inplace, so computing those gradients requires no additional storage.
_notes/optimization.md
View file @
2740192e
...
...
@@ 146,7 +146,8 @@ the floor of a flat valley. Better yet, if your line search is smart enough to q
functions, then you just need to ensure you can't go downhill forever
—
i.e. $$H$$ is positive
semidefinite.
Moving along, define $$x_i = x_{i  1} +
\a
lpha_i u_i$$ as before. For any $$1
\l
eq i
\l
eq n$$,
Moving along, define $$x_i = x_{i  1} +
\a
lpha_i u_i$$ via line minimizations as before. For any
$$1
\l
eq i
\l
eq n$$,
$$
\b
egin{aligned}
...
...
@@ 191,8 +192,8 @@ u_i^T H u_j = 0
$$
for all $$1
\l
eq j < i
\l
eq n$$. Such vectors are called
*conjugate vectors*
, from which this
algorithm derives its name. (Though
the careful reader will notice that
it's not the gradients
that
are conjugate
—
and the vector
s that are conjugate
aren't gradients
.)
algorithm derives its name. (Though
perhaps it's applied sloppily, since
it's not the gradients
themselve
s that are conjugate.)
#### Base Case
...
...
_psets/10.md
0 → 100644
View file @
2740192e

title
:
Problem Set 10 (Functions)

## 1
{:.question}
Find the first five diagonal Pad
é
approximants [1/1], ..., [5/5] to $$e^x$$ around the
origin. Remember that the numerator and denominator can be multiplied by a constant to make the
numbers as convenient as possible. Evaluate the approximations at $$x = 1$$ and compare with the
correct value of $$e = 2.718281828459045$$. How is the error improving with the order? How does that
compare to the polynomial error?
For a fixed function $$f$$ and integers $$N
\g
eq 0$$ and $$M
\g
eq 0$$, the Pad
é
approximant
is the function
$$
[N/M]_ f(x) =
\f
rac{
\s
um_{n = 0}^N a_n x^n}{1 +
\s
um_{m = 1}^M b_m x^m}
$$
that matches as many terms as possible of the Taylor series of $$f$$. (We can define the approximant
about any point, but without loss of generality we'll only consider the origin.) There are $$N + M +
1$$ parameters in the formula above ($$a_0$$, $$
\l
dots$$, $$b_N$$, and $$b_1$$, $$
\l
dots$$,
$$b_M$$), so in general this means we can match $$f$$ up to order $$N + M + 1$$. (Though by
coincidence, or otherwise, we may end up matching some higher order terms as well.)
We can write the resulting system of equations explicitly by expanding the Taylor series of the
approximant to order $$L = N + M$$ (since this gives us $$N + M + 1$$ terms). To make the equations
simpler to write I will introduce a constant $$b_0 = 1$$.
$$
\f
rac{
\s
um_{n = 0}^N a_n x^n}{
\s
um_{m = 0}^M b_m x^m} =
\s
um_{l = 0}^L c_l x^l
$$
Then we multiply by the denominator.
$$
\b
egin{aligned}
\s
um_{n = 0}^N a_n x^n &=
\l
eft(
\s
um_{m = 0}^M b_m x^m
\r
ight)
\l
eft(
\s
um_{l = 0}^L c_l x^l
\r
ight)
\\
&=
\s
um_{m = 0}^M
\s
um_{l = 0}^L b_m c_l x^{m + l}
\e
nd{aligned}
$$
By setting the different powers of $$x$$ equal, this gives us $$L + 1 = N + M + 1$$ equations.
$$
\b
egin{cases}
a_n =
\s
um_{m = 0}^{
\m
in(n, M)} b_m c_{n  m} &
\t
ext{for } 0
\l
eq n
\l
eq N
\\
0 =
\s
um_{m = 0}^{
\m
in(n, M)} b_m c_{n  m} &
\t
ext{for } N < n
\l
eq L = N + M
\e
nd{cases}
$$
We know what $$c_0$$, $$
\l
dots$$, $$c_L$$ are, since they must match the Taylor series of $$f$$. (In
particular, $$c_l = f^{(l)} / l!$$.) So by solving this system of equations we can determine all of
the unknown parameters. In particular, the last $$M$$ equations (i.e. for $$N < n$$) can be solved
to determine $$b_1$$, $$
\l
dots$$, $$b_M$$. Then the first $$N + 1$$ equations immediately give
values for $$a_0$$, $$
\l
dots$$, $$a_N$$.
The first step can be written as a system of $$M + 1$$ equations.
$$
\b
egin{bmatrix}
1 & 0 & 0 &
\c
dots & 0 & 0
\\
c_{N + 1} & c_{N} & c_{N  1} &
\c
dots & c_{N  M + 2} & c_{N  M + 1}
\\
c_{N + 2} & c_{N + 1} & c_{N} &
\c
dots & c_{N  M + 3} & c_{N  M + 2}
\\
\v
dots &
\v
dots &
\v
dots &
\d
dots &
\v
dots &
\v
dots
\\
c_{N + M  1} & c_{N + M  2} & c_{N + M  3} &
\c
dots & c_{N} & c_{N  1}
\\
c_{N + M} & c_{N + M  1} & c_{N + M  2} &
\c
dots & c_{N + 1} & c_{N}
\\
\e
nd{bmatrix}
\b
egin{bmatrix}
b_0
\\
b_1
\\
b_2
\\
\v
dots
\\
b_{M  1}
\\
b_M
\e
nd{bmatrix}
=
\b
egin{bmatrix}
1
\\
0
\\
0
\\
\v
dots
\\
0
\\
0
\e
nd{bmatrix}
$$
Note that if $$N + 1 < M$$, then some upper diagonal chunk of the matrix will be zero, corresponding
to those entries where the subscript of $$c$$ would be negative.
Finally, there's no guarantee that this matrix will be nonsingular. (For example, consider the case
where $$N = M$$ and all derivatives of $$f$$ up to $$N + M$$ are zero. Then all rows of the matrix
except for the first will be zero.) In such a situation, one can try to throw out degenerate
equations, and pull new ones from higher order terms (i.e. consider $$L > N + M$$). However it's
possible that all additional rows will be degenerate; in this case, the approximant is legitimately
underdetermined. This means it can match the function exactly, and you can reduce $$M$$ until your
system is nonsingular.
Ok, so now to actually answer the question. I wrote a SymPy
[
script
](
https://gitlab.cba.mit.edu/erik/nmm_2020_site//tree/master/_code/pset_10/py/pade_approximants.py
)
that uses the strategy we just derived to build approximants for me. For $$f(x) = e^x$$, I get the
following approximations:
$$
\b
egin{aligned}
[1/1]_ f(x) &=
\f
rac{1 + x/2}{1  x/2}
\\
[2/2]_ f(x) &=
\f
rac{1 + x/2 + x^2/12}{1  x/2 + x^2/12}
\\
[3/3]_ f(x) &=
\f
rac{1 + x/2 + x^2/10 + x^3/120}{1  x/2 + x^2/10  x^3/120}
\\
[4/4]_ f(x) &=
\f
rac{1 + x/2 + 3x^2/28 + x^3/84 + x^4/1,680}{1  x/2 + 3x^2/28  x^3/84 + x^4/1,680}
\\
[5/5]_ f(x) &=
\f
rac{1 + x/2 + x^2/9 + x^3/72 + x^4/1,008 + x^5/30,240}{1  x/2 + x^2/9  x^3/72 + x^4/1,008  x^5/30,240}
\\
\e
nd{aligned}
$$
These give the corresponding approximations for $$e$$:
$$
\b
egin{aligned}
[1/1]_ f(1) &= 3 &= 3.00000000000000
\\
[2/2]_ f(1) &=
\f
rac{19}{7} &
\a
pprox 2.71428571428571
\\
[3/3]_ f(1) &=
\f
rac{193}{71} &
\a
pprox 2.71830985915493
\\
[4/4]_ f(1) &=
\f
rac{2,721}{1,001} &
\a
pprox 2.71828171828172
\\
[5/5]_ f(1) &=
\f
rac{49,171}{18,089} &
\a
pprox 2.71828182873569
\e
nd{aligned}
$$
Polynomial approximations, on the other hand (to equivalent orders), give
$$
\b
egin{aligned}
\s
um_{n = 0}^2
\f
rac{x^n}{n!} &=
\f
rac{5}{2} &= 2.50000000000000
\\
\s
um_{n = 0}^4
\f
rac{x^n}{n!} &=
\f
rac{65}{24} &
\a
pprox 2.70833333333333
\\
\s
um_{n = 0}^6
\f
rac{x^n}{n!} &=
\f
rac{1,957}{720} &
\a
pprox 2.71805555555556
\\
\s
um_{n = 0}^8
\f
rac{x^n}{n!} &=
\f
rac{109,601}{40,320} &
\a
pprox 2.71827876984127
\\
\s
um_{n = 0}^{10}
\f
rac{x^n}{n!} &=
\f
rac{9,864,101}{3,628,800} &
\a
pprox 2.71828180114638
\e
nd{aligned}
$$
Here are the different errors.
![
errors
](
../assets/img/10_errors.png
)
## 3
{:.question}
Train a neural network on the output from an order 4 maximal LFSR and learn to reproduce it. How do
the results depend on the network depth and architecture?
I have previously implemented backpropagation from scratch, to train my own
[
VAEs
](
https://gitlab.cba.mit.edu/erik/gears//tree/master/gears/custom_networks
)
. So rather than
write this again, I wrote up a better
[
derivation of backpropagation
](
../notes/backpropagation.html
)
which I can use a reference for future modifications.
assets/img/10_errors.png
0 → 100644
View file @
2740192e
29.1 KB
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment