Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
What's new
7
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Open sidebar
Erik Strand
nmm_2020_site
Commits
a5f0e860
Commit
a5f0e860
authored
Apr 30, 2020
by
Erik Strand
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Add notes on backpropagation
parent
c8dc0c76
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
124 additions
and
0 deletions
+124
-0
_notes/backpropagation.md
_notes/backpropagation.md
+124
-0
No files found.
_notes/backpropagation.md
0 → 100644
View file @
a5f0e860
---
title
:
Backpropagation
---
Backpropagation is the ubiquitous method for performing gradient descent in artificial neural
networks. Ultimately it allows us to compute the derivative of a loss function with respect to every
free parameter in the network (i.e. every weight and bias). It does so layer by layer. At the end of
the day, it's just the chain rule, but there's a decent amount of bookkeeping so I want to write it
out explicitly.
Suppose we have an artificial neural network with $$N$$ layers. Let the number of neurons in layer
$$n$$ be $$N_n$$. Let layer $$n$$ have weights $$W^n$$ (an $$N_n$$ by $$N_{n - 1}$$ matrix) and biases
$$b^n$$ (a vector with $$N_n$$ components). Call its output $$a^n$$. Define the vector
$$
z^n = W^n a^{N-1} + b^n
$$
This is just the output of layer $$n$$ before the activation function $$
\t
heta$$ is applied. So
$$
a^n =
\t
heta(z^n)
$$
(where the function application is performed on each element independently).
We'll treat layer zero as the inputs. So $$a^0$$ is defined, but not $$z^0$$, $$W^0$$, or $$b^0$$.
Then $$a^1$$ through $$a^N$$ are the outputs of our layers. Finally, we have some loss function
$$L$$, which I assume is a function only of the last layer's outputs ($$a^N$$).
At the end of the day, to update the parameters of the network we need to know the partial
derivative of $$L$$ with respect to all entries of all $$W^n$$ and $$b^n$$. To get there, it's
helpful to consider as a stepping stone the partial derivatives of $$L$$ with respect to the entries
of a particular $$z^n$$. I'll write this as $$
\n
abla_{z^n} L$$ (a vector with $$N_n$$ elements).
In component form,
$$
z^n_i =
\s
um_j W^n_{i, j} a^{n-1}_j + b^n_i
$$
So $$
\p
artial z_i^n /
\p
artial b_i^n = 1$$ and by the chain rule,
$$
\b
egin{aligned}
\f
rac{
\p
artial L}{
\p
artial b_i^n} &=
\f
rac{
\p
artial L}{
\p
artial z_i^n}
\f
rac{
\p
artial z_i^n}{
\p
artial b_i^n}
\\
&=
\f
rac{
\p
artial L}{
\p
artial z_i^n}
\e
nd{aligned}
$$
Thus
$$
\n
abla_{b^n} L =
\n
abla_{z^n} L
$$
Similarly,
$$
\b
egin{aligned}
\f
rac{
\p
artial L}{
\p
artial W_{i,j}^n} &=
\f
rac{
\p
artial L}{
\p
artial z_i^n}
\f
rac{
\p
artial z_i^n}{
\p
artial W_{i,j}^n}
\\
&=
\f
rac{
\p
artial L}{
\p
artial z_i^n} a_j^{n-1}
\e
nd{aligned}
$$
So
$$
\n
abla_{W^n} L = (
\n
abla_{z^n} L) (a^{n-1})^T
$$
So it's easy to go from $$
\n
abla_{z^n} L$$ to $$
\n
abla_{z^n} W^n$$ and $$
\n
abla_{z^n} b^n$$.
It's also easy to get from one $$
\n
abla_{z^n} L$$ to the next. In particular,
$$
\b
egin{aligned}
\f
rac{
\p
artial L}{
\p
artial a_j^{n-1}} &=
\s
um_{i}
\f
rac{
\p
artial L}{
\p
artial z_i^n}
\f
rac{
\p
artial z_i^n}{
\p
artial a_j^{n-1}}
\\
&=
\s
um_{i}
\f
rac{
\p
artial L}{
\p
artial z_i^n} W_{i, j}^n
\e
nd{aligned}
$$
So
$$
\n
abla_{a^{n-1}} L = (W^n)^T (
\n
abla_{z^n} L)
$$
Finally, since $$a_i^{n-1} =
\t
heta(z_i^{n-1})$$,
$$
\b
egin{aligned}
\f
rac{
\p
artial L}{
\p
artial z_i^{n-1}} &=
\f
rac{
\p
artial L}{
\p
artial a_i^{n - 1}}
\f
rac{
\p
artial a_i^{n-1}}{
\p
artial z_i^{n-1}}
\\
&=
\f
rac{
\p
artial L}{
\p
artial a_i^{n - 1}}
\t
heta'(z_i^{n-1})
\\
\e
nd{aligned}
$$
and so
$$
\n
abla_{z^{n-1}} L = (W^n)^T (
\n
abla_{z^n} L)
\o
dot (
\n
abla_{z^{n-1}}
\t
heta)
$$
Here $$
\n
abla_{z^{n-1}}
\t
heta$$ means the derivative of $$
\t
heta$$ evaluated at each $$z_i^{n-1}$$
(so strictly speaking it's more of a Jacobian than a gradient), and $$
\o
dot$$ indicates the
[
Hadamard product
](
https://en.wikipedia.org/wiki/Hadamard_product_(matrices
)
) (i.e. element-wise
product).
So starting from the output of the network,
$$
\n
abla_{z^N} L =
\n
abla_{a^N} L
\o
dot
\n
abla_{z^N}
\t
heta
$$
And from here we just apply the equation above repeatedly to compute $$
\n
abla_{z^{N - 1}} L$$,
$$
\n
abla_{z^{N - 2}} L$$, etc. At each step we can easily compute $$
\n
abla_{b_n} L$$ and
$$
\n
abla_{W^n} L$$ as well. When we get to the first layer, note that $$
\n
abla_{W^1} L$$ depends on
the inputs of the network $$a^0$$, rather than the outputs of some other layers.
To implement this efficiently, note that we don't need to store all the gradients we've computed so
far. We just need to keep the most recent one, and have some memory in which to calculate the next.
So if you allocate two arrays with as many elements as the largest layer has nodes, then you can
keep reusing these for the whole computation. For standard gradient descent, updates to the weights
and biases can be done in-place, so computing those gradients requires no additional storage.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment