Back Propagation Example and Theory

Back‑Propagation in a Tiny 2‑2‑1 Neural Network

A step‑by‑step walk‑through

(These notes are generated by an LLM based on my worked example and presentation in class.)

1. What we are trying to achieve

A feed‑forward neural network maps an input vector x to an output y by applying a sequence of linear transformations (weights + biases) followed by non‑linear activation functions. During training we are given target values $y^{*}$ and we want the network’s prediction y to be as close as possible to the target.

The standard way to achieve this is to (i) define a loss that measures the mismatch, (ii) compute how the loss changes when each weight changes (the gradient), and (iii) move every weight a little bit in the direction that reduces the loss (gradient descent).

The back‑propagation algorithm is the systematic way of computing those gradients for all weights efficiently, by re‑using intermediate quantities that were already computed in the forward pass.

2. The concrete network we will use

Layer	Nodes	Notation (for a single training example)
Input	2	$x_{1},x_{2}$
Hidden	2	pre‑activation $u_{1},u_{2}$; post‑activation (output of the hidden nodes) $z_{1},z_{2}$
Output	1	pre‑activation $u_{3}$; network output $y$

2.1 Equations (scalar form)

\[\begin{aligned} \text{Hidden node }N_{1}&: & u_{1}&=w_{11}x_{1}+w_{12}x_{2}+b_{1}, & z_{1}&=A(u_{1})\\[4pt] \text{Hidden node }N_{2}&: & u_{2}&=w_{21}x_{1}+w_{22}x_{2}+b_{2}, & z_{2}&=A(u_{2})\\[4pt] \text{Output node }N_{3}&: & u_{3}&=w_{31}z_{1}+w_{32}z_{2}+b_{3}, & y &=A(u_{3}) \end{aligned}\]

All three nodes use the same activation function

\[A(t)=\frac{t}{10}\qquad\Longrightarrow\qquad A'(t)=\frac{1}{10}\]

The derivative is a constant because $A$ is just a linear scaling. This choice makes the algebra of back‑prop very transparent while preserving the essential ideas.

2.2 Loss

We use the classic quadratic (mean‑squared‑error) loss

\[L=\frac12\,(y-y^{*})^{2},\]

where $y^{*}$ is the desired target value for this training example.

3. Forward Pass – computing the network’s prediction

The forward pass evaluates the equations above from left‑to‑right, i.e. from inputs to output, while caching every intermediate quantity (the $u$’s and the $z$’s). Those cached numbers are later needed for the backward sweep.

3.1 Numbers we will plug in

Parameter	Value
Weights	$w_{11}=1$ , $w_{12}=2$ , $w_{21}=2$ , $w_{22}=1$ , $w_{31}=1$ , $w_{32}=1$
Biases	$b_{1}=1$ , $b_{2}=1$ , $b_{3}=1$
Input	$x_{1}=2$ , $x_{2}=-3$
Target	$y^{*}=2$
Learning rate	$\lambda =10$ (used later for weight updates)

3.2 Compute hidden activations

$\begin{aligned} u_{1}&=1\cdot2+2\cdot(-3)+1 = 2-6+1 = -3, \\ z_{1}&=A(-3)=\frac{-3}{10}= -0.3.\$6pt] u_{2}&=2\cdot2+1\cdot(-3)+1 = 4-3+1 = 2,
z_{2}&=A(2)=\frac{2}{10}= 0.2. \end{aligned} $$

3.3 Compute output

\[\begin{aligned} u_{3}&=1\cdot(-0.3) + 1\cdot 0.2 + 1 = -0.3+0.2+1 = 0.9,\\ y &=A(0.9)=\frac{0.9}{10}=0.09. \end{aligned}\]

3.4 Loss

\[L = \frac12 (0.09-2)^2 = \frac12 (-1.91)^2 \approx 1.82405.\]

We have now quantified how far the network is from the desired output.

4. The “Why” of Back‑Propagation

When the loss is a function of many parameters, $\displaystyle \frac{\partial L}{\partial w_{ij}}$ tells us how much the loss would change if we nudged that weight a tiny amount.

If we could compute all these derivatives, gradient descent would simply be

\[w_{ij}^{\text{new}} = w_{ij}^{\text{old}} - \lambda \;\frac{\partial L}{\partial w_{ij}} .\]

The key challenge is that the loss is nested inside many functions (the activations). Computing each partial derivative naïvely would duplicate a lot of work. Back‑prop avoids the duplication by applying the chain rule systematically along the computational graph.

4.1 Chain rule reminder (scalar case)

If a quantity $L$ depends on an intermediate variable $u$ that itself depends on a parameter $w$,

\[\frac{\partial L}{\partial w}= \frac{\partial L}{\partial u}\;\frac{\partial u}{\partial w}.\]

When a node has multiple children (e.g. a hidden node feeds into several later nodes) we must sum the contributions, because the total change in the loss is the sum of the changes coming from each child.

Visualization: imagine the forward graph as a river flowing from left (input) to right (output). Back‑prop is the water that splits its flow at each junction and runs upstream, carrying a “signal” that tells each upstream node how responsible it is for the current loss.

5. Deriving the gradients step‑by‑step

Below we walk through the mathematics that the algorithm carries out automatically. At each step we write (i) what we need to know and (ii) how we get it.

5.1 Output‑layer “delta”

The delta of a node is defined as the derivative of the loss with respect to that node’s pre‑activation value (the $u$ before the activation).

For the output node $N_{3}$:

\[\begin{aligned} \delta_{3} &= \frac{\partial L}{\partial u_{3}} = \frac{\partial L}{\partial y}\;\frac{\partial y}{\partial u_{3}} \quad\text{(chain rule)}\\[4pt] &= (y-y^{*})\; A'(u_{3}) . \end{aligned}\]

Why does $\partial L/\partial y = y-y^{}$?* Because $L=\tfrac12(y-y^{})^{2}$ → differentiate w.r.t. $y$ → $y-y^{}$.

Why is $\partial y/\partial u_{3}=A’(u_{3})$? Because $y=A(u_{3})$ by definition of the activation.

Plugging in the numbers:

\[\delta_{3}= (0.09-2)\times\frac1{10}= -1.91\times0.1 = -0.191 .\]

5.1.1 From $\delta_{3}$ to weight‑gradients at the output

Recall the output node equation

\[u_{3}=w_{31}z_{1}+w_{32}z_{2}+b_{3}.\]

Each weight appears linearly in $u_{3}$, therefore

\[\frac{\partial u_{3}}{\partial w_{31}}=z_{1},\qquad \frac{\partial u_{3}}{\partial w_{32}}=z_{2},\qquad \frac{\partial u_{3}}{\partial b_{3}}=1 .\]

Now combine with $\delta_{3}$ (again using the chain rule):

\[\boxed{ \begin{aligned} \frac{\partial L}{\partial w_{31}} &= \delta_{3}\,z_{1},\\ \frac{\partial L}{\partial w_{32}} &= \delta_{3}\,z_{2},\\ \frac{\partial L}{\partial b_{3}} &= \delta_{3}. \end{aligned}}\]

Numeric values:

\[\begin{aligned} \frac{\partial L}{\partial w_{31}} &= -0.191 \times (-0.3)= +0.0573,\\ \frac{\partial L}{\partial w_{32}} &= -0.191 \times 0.2 = -0.0382,\\ \frac{\partial L}{\partial b_{3}} &= -0.191 . \end{aligned}\]

5.2 Hidden‑layer deltas

A hidden node receives error information from all downstream nodes that depend on it. In our small network there is only one downstream node (the output), but the same formula generalises to many.

5.2.1 Hidden node $N_{1}$

We want

\[\delta_{1}= \frac{\partial L}{\partial u_{1}} .\]

Because $u_{1}$ influences the loss only through the output node (through $z_{1}=A(u_{1})$ and then $u_{3}$), the chain rule gives

\[\begin{aligned} \delta_{1} &= \frac{\partial L}{\partial u_{1}} = \frac{\partial L}{\partial u_{3}} \;\frac{\partial u_{3}}{\partial z_{1}}\;\frac{\partial z_{1}}{\partial u_{1}}\\[4pt] &= \underbrace{\delta_{3}}_{\displaystyle\frac{\partial L}{\partial u_{3}}} \underbrace{w_{31}}_{\displaystyle\frac{\partial u_{3}}{\partial z_{1}}} \underbrace{A'(u_{1})}_{\displaystyle\frac{\partial z_{1}}{\partial u_{1}}}. \end{aligned}\]

Since the activation derivative is constant $1/10$,

\[\boxed{\delta_{1}= \delta_{3}\,w_{31}\,A'(u_{1}) = \delta_{3}\,w_{31}\,\frac1{10}} .\]

Plugging numbers:

\[\delta_{1}= -0.191 \times 1 \times 0.1 = -0.0191 .\]

5.2.2 Hidden node $N_{2}$ (identical reasoning)

\[\boxed{\delta_{2}= \delta_{3}\,w_{32}\,A'(u_{2}) = \delta_{3}\,w_{32}\,\frac1{10}} .\] \[\delta_{2}= -0.191 \times 1 \times 0.1 = -0.0191 .\]

Interpretation: The hidden‑layer error is proportional to the downstream error ($\delta_{3}$), but scaled down by the weight that links the two layers and by the slope of the activation at the hidden node. If a weight is large, the hidden node carries more of the output error; if the activation is flat (small derivative) the hidden node’s contribution is dampened.

5.3 Gradients for the hidden‑layer weights

Hidden node $N_{1}$ computes

\[u_{1}=w_{11}x_{1}+w_{12}x_{2}+b_{1}.\]

Thus

\[\frac{\partial u_{1}}{\partial w_{11}}=x_{1},\qquad \frac{\partial u_{1}}{\partial w_{12}}=x_{2},\qquad \frac{\partial u_{1}}{\partial b_{1}}=1 .\]

Combine with $\delta_{1}$ :

\[\boxed{ \begin{aligned} \frac{\partial L}{\partial w_{11}} &= \delta_{1}\,x_{1},\\ \frac{\partial L}{\partial w_{12}} &= \delta_{1}\,x_{2},\\ \frac{\partial L}{\partial b_{1}} &= \delta_{1}. \end{aligned}}\]

Numeric values:

\[\begin{aligned} \frac{\partial L}{\partial w_{11}} &= -0.0191 \times 2 = -0.0382,\\ \frac{\partial L}{\partial w_{12}} &= -0.0191 \times (-3)= +0.0573,\\ \frac{\partial L}{\partial b_{1}} &= -0.0191 . \end{aligned}\]

The same pattern holds for node $N_{2}$:

\[\boxed{ \begin{aligned} \frac{\partial L}{\partial w_{21}} &= \delta_{2}\,x_{1},\\ \frac{\partial L}{\partial w_{22}} &= \delta_{2}\,x_{2},\\ \frac{\partial L}{\partial b_{2}} &= \delta_{2}. \end{aligned}}\] \[\begin{aligned} \frac{\partial L}{\partial w_{21}} &= -0.0191 \times 2 = -0.0382,\\ \frac{\partial L}{\partial w_{22}} &= -0.0191 \times (-3)= +0.0573,\\ \frac{\partial L}{\partial b_{2}} &= -0.0191 . \end{aligned}\]

6. Collecting the results – a “back‑propagation checklist”

Layer	Symbol	Gradient (∂L/∂parameter)	Numerical value
Output	$w_{31}$	$\delta_{3}\,z_{1}$	$+0.0573$
	$w_{32}$	$\delta_{3}\,z_{2}$	$-0.0382$
	$b_{3}$	$\delta_{3}$	$-0.191$
Hidden‑1	$w_{11}$	$\delta_{1}\,x_{1}$	$-0.0382$
	$w_{12}$	$\delta_{1}\,x_{2}$	$+0.0573$
	$b_{1}$	$\delta_{1}$	$-0.0191$
Hidden‑2	$w_{21}$	$\delta_{2}\,x_{1}$	$-0.0382$
	$w_{22}$	$\delta_{2}\,x_{2}$	$+0.0573$
	$b_{2}$	$\delta_{2}$	$-0.0191$

Notice the recurring pattern:

Delta = upstream‑delta × weight × activation‑derivative.
Weight gradient = delta × the neuron’s own input.

7. Updating the parameters (gradient descent)

The parameter‑update rule for each scalar weight/bias is

\[\theta^{\text{new}} = \theta^{\text{old}} - \lambda\;\frac{\partial L}{\partial \theta},\]

where $\lambda$ is the learning rate (step size). Substituting the numbers:

7.1 Output layer

\[\begin{aligned} w_{31}^{\text{new}} &= 1 - 10 \times (+0.0573) = 0.427,\\ w_{32}^{\text{new}} &= 1 - 10 \times (-0.0382) = 1 + 0.382 = 1.382,\\ b_{3}^{\text{new}} &= 1 - 10 \times (-0.191) = 1 + 1.91 = 2.91. \end{aligned}\]

7.2 Hidden layer – node 1

\[\begin{aligned} w_{11}^{\text{new}} &= 1 - 10 \times (-0.0382) = 1 + 0.382 = 1.382,\\ w_{12}^{\text{new}} &= 2 - 10 \times (+0.0573) = 2 - 0.573 = 1.427,\\ b_{1}^{\text{new}} &= 1 - 10 \times (-0.0191) = 1 + 0.191 = 1.191. \end{aligned}\]

7.3 Hidden layer – node 2

\[\begin{aligned} w_{21}^{\text{new}} &= 2 - 10 \times (-0.0382) = 2 + 0.382 = 2.382,\\ w_{22}^{\text{new}} &= 1 - 10 \times (+0.0573) = 1 - 0.573 = 0.427,\\ b_{2}^{\text{new}} &= 1 - 10 \times (-0.0191) = 1 + 0.191 = 1.191. \end{aligned}\]

All parameters have been nudged in the direction that lowers the loss (the negative gradient).

Why does a large learning rate $\lambda=10$ work here? Because the gradients are tiny (order $10^{-2}$). Multiplying by 10 yields updates of size $\approx 0.4$, which is reasonable for the scale of the weights. In networks where gradients can be much larger, a smaller $\lambda$ would be required to avoid “overshooting” the minimum.

8. Verify that learning actually happened

8.1 Forward pass with the updated weights

Hidden node 1

\[\begin{aligned} u_{1}^{\text{new}} &= 1.382\cdot2 + 1.427\cdot(-3) + 1.191 \\ &= 2.764 - 4.281 + 1.191 = -0.326,\\ z_{1}^{\text{new}} &= \frac{-0.326}{10}= -0.0326 . \end{aligned}\]

Hidden node 2

\[\begin{aligned} u_{2}^{\text{new}} &= 2.382\cdot2 + 0.427\cdot(-3) + 1.191\\ &= 4.764 - 1.281 + 1.191 = 4.674,\\ z_{2}^{\text{new}} &= \frac{4.674}{10}= 0.4674 . \end{aligned}\]

Output node 3

\[\begin{aligned} u_{3}^{\text{new}} &= 0.427\cdot(-0.0326) + 1.382\cdot 0.4674 + 2.91\\ &= -0.0139 + 0.6459 + 2.91 = 3.542,\\ y^{\text{new}} &= \frac{3.542}{10}=0.3542 . \end{aligned}\]

Loss after the update

\[L^{\text{new}} = \frac12\,(0.3542-2)^2 = \frac12(-1.6458)^2 \approx 1.3543 .\]

8.2 What changed?

Quantity	Before	After
Network output y	0.09	0.3542
Loss L	1.82405	1.3543
Reduction in loss	–	0.4698

The loss fell by roughly 26 % in a single step. The network’s prediction moved noticeably closer to the target value (though it is still far away because the activation scales the signal down by a factor of 10). Repeating the forward‑backward‑update cycle many times will drive the loss toward zero (or at least a local minimum).

9. Deeper insight – why the algorithm works

Local error signals (deltas). Each node compresses all downstream error information into a single scalar $\delta$. This is why we can propagate “backwards” without having to keep separate copies of the loss for each weight.
Linearity of the weight‑gradient relationship. Because each weight appears linearly in its node’s pre‑activation, the gradient w.r.t. that weight is simply the product of the node’s delta and the node’s own input. This is the core computational cheapness of back‑prop.
Chain rule as a “book‑keeping” device. The chain rule tells us how to “multiply” the local sensitivity (activation derivative) with the upstream sensitivity (delta) and the local connection strength (weight). In a deep network the same multiplication is repeated layer after layer, creating a product of many derivatives – the vanishing/exploding gradient phenomenon that appears in deeper architectures.
Vector‑matrix formulation (optional). In practice we write the whole forward pass as
\[\begin{aligned} \mathbf{z}^{(1)} &= A\big(\mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)}\big)\\ \mathbf{y} &= A\big(\mathbf{W}^{(2)}\mathbf{z}^{(1)} + \mathbf{b}^{(2)}\big), \end{aligned}\]
and the backward pass as
\[\begin{aligned} \boldsymbol{\delta}^{(2)} &= ( \mathbf{y} - \mathbf{y}^{*})\odot A'(\mathbf{u}^{(2)}) ,\\ \boldsymbol{\delta}^{(1)} &= \big(\mathbf{W}^{(2)^\top}\boldsymbol{\delta}^{(2)}\big)\odot A'(\mathbf{u}^{(1)}), \end{aligned}\]
where $\odot$ denotes element‑wise multiplication. The scalar example we worked through is just the 1‑dimensional version of these matrix equations.

10. Practical considerations & extensions

Issue	Why it matters	Typical remedy
Learning‑rate selection	Too large → divergent updates; too small → painfully slow learning.	Use a schedule (decrease over epochs) or adaptive methods (AdaGrad, Adam).
Activation choice	Linear activations keep gradients constant but severely limit expressive power.	Replace $A(t)=\frac{t}{10}$ with non‑linear functions (sigmoid, tanh, ReLU).
Scaling of inputs/outputs	Our output is divided by 10, which forces the network to produce tiny numbers; the target (2) is far larger, making learning harder.	Pre‑scale targets to the same range as the activation (e.g., train on 0‑1 outputs).
Batch vs. single example	Updating after each example (stochastic gradient descent) injects noise that can help escape shallow minima.	Use mini‑batches (e.g., size 32) for a compromise between noise and computational efficiency.
Bias handling	Biases act like weights connected to a constant input of 1.	Treat them exactly like other weights; our derivation already does this.
Gradient‑checking	Good sanity check for implementation bugs.	Numerically approximate $\frac{\partial L}{\partial \theta}$ with a tiny perturbation and compare to back‑prop result.

11. Summary – the big picture

Forward pass computes activations and caches every intermediate value.
Output delta $\delta_{\text{out}} = (y-y^{*})A’(u_{\text{out}})$ captures how the loss changes with the output pre‑activation.
Back‑propagate the delta through each layer: $\delta_{\ell}= \big(\mathbf{W}^{(\ell+1)\!\top}\,\boldsymbol{\delta}^{(\ell+1)}\big)\odot A'(\mathbf{u}^{(\ell)}).$ In a scalar network this reduces to $\delta_{\ell}= \delta_{\ell+1}\,w_{\ell+1}\,A’(u_{\ell})$.
Gradients for all weights and biases are simply the product of the node’s delta and its input (or 1 for a bias).
Gradient‑descent update nudges each parameter opposite its gradient, scaled by a learning rate.

The example walked through every algebraic step, showing exactly how the chain rule turns the loss into a set of simple arithmetic operations that can be carried out efficiently on any hardware. Understanding each miniature step demystifies back‑propagation and equips you to reason about more complex networks, vectorised implementations, and the many tricks (momentum, regularisation, batch normalisation) that build on this core algorithm.