Building a Neural Network From Scratch - Just Numpy
Building a Neural Network
Just Numpy
A two-layer network trained on MNIST digit recognition — forward propagation, backpropagation, and gradient descent, all from first principles.
Table of Contents
The Problem
In this article, I build a simple two-layer neural network from scratch and train it on the MNIST digit recognition dataset. Rather than relying on high-level frameworks, the focus is on understanding what actually happens inside a neural network — from forward propagation to backpropagation and parameter updates.
The problem is simple digit classification using the famous MNIST dataset: given a handwritten digit image, predict which number (0–9) is written.
The Math — Representing Input Data
Each MNIST image has a resolution of 28×28 pixels, which means every image can be flattened into a vector of 784 pixel values. If we have m training images, we stack these vectors to form a matrix X.
Initially this gives shape (m × 784) — each row is one image. We transpose it to (784 × m) so each column is one image, enabling efficient vectorized matrix multiplication over all samples at once.
Network Architecture
With the input properly represented, we define the structure of the network. It has three layers:
- 1
Input layer — 784 units. One for each pixel in the flattened 28×28 image. Values feed directly into the network as the input vector per sample.
- 2
Hidden layer — 10 neurons. Each neuron is fully connected to all 784 input pixels, learning a weighted combination of the entire image. After computing a weighted sum plus bias, we apply ReLU — a non-linear activation that lets the network learn complex patterns. The hidden layer acts as a feature extractor, responding to meaningful patterns like strokes, edges, or curves.
- 3
Output layer — 10 neurons. One per digit class (0–9). Each neuron produces a raw score, passed through softmax, converting everything into probabilities that sum to 1. Highest probability wins.
Activation Functions
Two activation functions do the heavy lifting:
Forward Propagation
Forward propagation is the process by which the network takes an input image and produces a prediction. It's four sequential computations:
- 1
Z¹ = W¹X + b¹ — Each hidden neuron looks at all 784 pixels, multiplies them by learned weights, and adds a bias. This lets the network measure how strongly certain pixel patterns appear in the image.
- 2
A¹ = ReLU(Z¹) — ReLU keeps positive values and sets negative values to zero. This introduces non-linearity, so the network can learn complex patterns rather than behaving like a simple linear model.
- 3
Z² = W²A¹ + b² — The features learned by the hidden layer are combined to produce a raw score for each digit class.
- 4
A² = softmax(Z²) — Softmax converts raw scores into probabilities that sum to one. The digit with the highest probability is the final prediction.
Pythondef ReLU(Z): return np.maximum(Z, 0) def softmax(Z): return np.exp(Z) / sum(np.exp(Z)) def forward_prop(W1, b1, W2, b2, X): Z1 = W1.dot(X) + b1 A1 = ReLU(Z1) Z2 = W2.dot(A1) + b2 A2 = softmax(Z2) return Z1, A1, Z2, A2
Backward Propagation
Backward propagation is how the network learns from its mistakes. After forward propagation produces a prediction, we compare it with the true label and propagate the error backward to determine how each parameter should be adjusted.
We compute six gradients — two weight gradients, two bias gradients, and two pre-activation error terms:
- 1
dZ² = A² − Y — The error in the output layer. Since we use softmax with cross-entropy loss, the gradient simplifies neatly to the difference between predicted probabilities and true labels. This tells us how much each output neuron over- or under-estimated its prediction.
- 2
dW² = (1/m) dZ² A¹ᵀ — Measures how much each hidden neuron contributed to the output error, averaged over all training examples.
- 3
db² = (1/m) Σ dZ² — Biases affect all inputs equally, so their gradients are simply the average error across the batch.
- 4
dZ¹ = W²ᵀ dZ² ∗ g′(Z¹) — The transpose of W² redistributes output error back to hidden neurons. Multiplied element-wise by the ReLU derivative — only neurons that were active during the forward pass receive a gradient update. This is essentially deactivating the activation function in reverse.
- 5
dW¹ = (1/m) dZ¹ Xᵀ — Measures how much each input pixel contributed to the hidden-layer error, letting the network learn which pixel patterns caused incorrect predictions.
- 6
db¹ = (1/m) Σ dZ¹ — Average error per hidden neuron, used to update the hidden layer's biases.
Pythondef ReLU_deriv(Z): return Z > 0 # 1 where Z > 0, else 0 def one_hot(Y): one_hot_Y = np.zeros((Y.size, Y.max() + 1)) one_hot_Y[np.arange(Y.size), Y] = 1 return one_hot_Y.T # shape: (10, m) def backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y): one_hot_Y = one_hot(Y) dZ2 = A2 - one_hot_Y # (10, m) dW2 = 1 / m * dZ2.dot(A1.T) # (10, 10) db2 = 1 / m * np.sum(dZ2) dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1) # (10, m) dW1 = 1 / m * dZ1.dot(X.T) # (10, 784) db1 = 1 / m * np.sum(dZ1) return dW1, db1, dW2, db2
Parameter Updates
After computing gradients, the final step is gradient descent — adjusting each parameter a small step in the direction that reduces loss.
- 1
W² = W² − α dW² —
dW²tells us how changing each weight affects the loss. The learning rate α controls the step size. Subtracting the gradient moves weights toward values that reduce prediction error. - 2
b² = b² − α db² — Biases shift the output independently of the input. Updating them lets the network correct systematic overconfidence or underconfidence in certain digit classes.
- 3
W¹ = W¹ − α dW¹ — Updates the connections between input pixels and hidden neurons, strengthening those that helped correct predictions and weakening those that contributed to errors.
- 4
b¹ = b¹ − α db¹ — Adjusts how easily each hidden neuron activates. Together, these four updates shift the network's behaviour after every training step.
Pythondef update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha): W1 = W1 - alpha * dW1 b1 = b1 - alpha * db1 W2 = W2 - alpha * dW2 b2 = b2 - alpha * db2 return W1, b1, W2, b2
The Training Loop
The full training loop ties everything together. For each iteration: forward propagation → backpropagation → parameter update. Accuracy is printed every 10 steps.
Pythondef get_predictions(A2): return np.argmax(A2, 0) # digit with highest probability per sample def get_accuracy(predictions, Y): return np.sum(predictions == Y) / Y.size def gradient_descent(X, Y, alpha, iterations): W1, b1, W2, b2 = init_params() # random initialisation for i in range(iterations): Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X) dW1, db1, dW2, db2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y) W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha) if i % 10 == 0: print(f"Iteration: {i}") predictions = get_predictions(A2) print(f"Accuracy: {get_accuracy(predictions, Y):.4f}") return W1, b1, W2, b2
Results
With an accuracy of around 85%, this model can definitely be improved — by introducing more layers, tuning activation functions, or adjusting hyperparameters. The goal here is a solid fundamental understanding, not peak performance.
Final accuracy: ~85% on MNIST
Architecture: 784 → 10 (ReLU) → 10 (Softmax) · Optimizer: SGD · Framework: NumPy only
Full code on Kaggle. Credit to Samson Zhang for his video on the topic.