Building a Neural Network From Scratch — Just Numpy

Deep Learning · From Scratch

Building a Neural Network
Just Numpy

A two-layer network trained on MNIST digit recognition — forward propagation, backpropagation, and gradient descent, all from first principles.

By Sai Saketh · Published in Towards AI · December 2025

Table of Contents

The Problem
The Math — Representing Input Data
Network Architecture
Forward Propagation
Backward Propagation
Parameter Updates
The Training Loop

The Problem

In this article, I build a simple two-layer neural network from scratch and train it on the MNIST digit recognition dataset. Rather than relying on high-level frameworks, the focus is on understanding what actually happens inside a neural network — from forward propagation to backpropagation and parameter updates.

The problem is simple digit classification using the famous MNIST dataset: given a handwritten digit image, predict which number (0–9) is written.

Why from scratch? Libraries like PyTorch and TensorFlow abstract away the mechanics. Building it in pure NumPy forces you to confront exactly what a neural network computes at every step — there's nowhere to hide.

The Math — Representing Input Data

Each MNIST image has a resolution of 28×28 pixels, which means every image can be flattened into a vector of 784 pixel values. If we have m training images, we stack these vectors to form a matrix X.

Initially this gives shape (m × 784) — each row is one image. We transpose it to (784 × m) so each column is one image, enabling efficient vectorized matrix multiplication over all samples at once.

Fig 1 — Input Matrix: Each Column Represents One Sample

Network Architecture

With the input properly represented, we define the structure of the network. It has three layers:

1
Input layer — 784 units. One for each pixel in the flattened 28×28 image. Values feed directly into the network as the input vector per sample.
2
Hidden layer — 10 neurons. Each neuron is fully connected to all 784 input pixels, learning a weighted combination of the entire image. After computing a weighted sum plus bias, we apply ReLU — a non-linear activation that lets the network learn complex patterns. The hidden layer acts as a feature extractor, responding to meaningful patterns like strokes, edges, or curves.
3
Output layer — 10 neurons. One per digit class (0–9). Each neuron produces a raw score, passed through softmax, converting everything into probabilities that sum to 1. Highest probability wins.

Fig 2 — Network Architecture: 784 → 10 (ReLU) → 10 (Softmax)

Activation Functions

Two activation functions do the heavy lifting:

Fig 3 — ReLU (hidden layer) and Softmax (output layer)

Forward Propagation

Forward propagation is the process by which the network takes an input image and produces a prediction. It's four sequential computations:

Z¹ = W¹X + b¹ — weighted sum at hidden layer (shape: 10 \times m) A¹ = ReLU(Z¹) — apply activation, kill negatives Z² = W²A¹ + b² — weighted sum at output layer (shape: 10 \times m) A² = softmax(Z²) — probability distribution over 10 classes

1
Z¹ = W¹X + b¹ — Each hidden neuron looks at all 784 pixels, multiplies them by learned weights, and adds a bias. This lets the network measure how strongly certain pixel patterns appear in the image.
2
A¹ = ReLU(Z¹) — ReLU keeps positive values and sets negative values to zero. This introduces non-linearity, so the network can learn complex patterns rather than behaving like a simple linear model.
3
Z² = W²A¹ + b² — The features learned by the hidden layer are combined to produce a raw score for each digit class.
4
A² = softmax(Z²) — Softmax converts raw scores into probabilities that sum to one. The digit with the highest probability is the final prediction.

Pythondef ReLU(Z):
    return np.maximum(Z, 0)

def softmax(Z):
    return np.exp(Z) / sum(np.exp(Z))

def forward_prop(W1, b1, W2, b2, X):
    Z1 = W1.dot(X) + b1
    A1 = ReLU(Z1)
    Z2 = W2.dot(A1) + b2
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2

Fig 4 — Forward Propagation Data Flow

Backward Propagation

Backward propagation is how the network learns from its mistakes. After forward propagation produces a prediction, we compare it with the true label and propagate the error backward to determine how each parameter should be adjusted.

We compute six gradients — two weight gradients, two bias gradients, and two pre-activation error terms:

dZ² = A² - Y — output error (softmax + cross-entropy simplifies cleanly) dW² = (1/m) dZ² \cdot A¹ᵀ — how output weights contributed to error db² = (1/m) Σ dZ² — average bias error, output layer dZ¹ = W²ᵀ \cdot dZ² * g'(Z¹) — error back to hidden layer, gated by ReLU derivative dW¹ = (1/m) dZ¹ \cdot Xᵀ — how input weights contributed to hidden error db¹ = (1/m) Σ dZ¹ — average bias error, hidden layer

1
dZ² = A² − Y — The error in the output layer. Since we use softmax with cross-entropy loss, the gradient simplifies neatly to the difference between predicted probabilities and true labels. This tells us how much each output neuron over- or under-estimated its prediction.
2
dW² = (1/m) dZ² A¹ᵀ — Measures how much each hidden neuron contributed to the output error, averaged over all training examples.
3
db² = (1/m) Σ dZ² — Biases affect all inputs equally, so their gradients are simply the average error across the batch.
4
dZ¹ = W²ᵀ dZ² ∗ g′(Z¹) — The transpose of W² redistributes output error back to hidden neurons. Multiplied element-wise by the ReLU derivative — only neurons that were active during the forward pass receive a gradient update. This is essentially deactivating the activation function in reverse.
5
dW¹ = (1/m) dZ¹ Xᵀ — Measures how much each input pixel contributed to the hidden-layer error, letting the network learn which pixel patterns caused incorrect predictions.
6
db¹ = (1/m) Σ dZ¹ — Average error per hidden neuron, used to update the hidden layer's biases.

Pythondef ReLU_deriv(Z):
    return Z > 0    # 1 where Z > 0, else 0

def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max() + 1))
    one_hot_Y[np.arange(Y.size), Y] = 1
    return one_hot_Y.T     # shape: (10, m)

def backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y):
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y                    # (10, m)
    dW2 = 1 / m * dZ2.dot(A1.T)            # (10, 10)
    db2 = 1 / m * np.sum(dZ2)
    dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1)  # (10, m)
    dW1 = 1 / m * dZ1.dot(X.T)            # (10, 784)
    db1 = 1 / m * np.sum(dZ1)
    return dW1, db1, dW2, db2

Fig 5 — Backward Propagation: Error Flows Right → Left via Chain Rule

Why does dZ² simplify so cleanly? The combination of softmax and cross-entropy loss produces an unusually clean gradient: dZ² = A² − Y. The derivative of cross-entropy with respect to the softmax input cancels beautifully — a well-known result that makes implementation much cleaner than it would otherwise be.

Parameter Updates

After computing gradients, the final step is gradient descent — adjusting each parameter a small step in the direction that reduces loss.

W² \leftarrow W² - α \cdot dW² b² \leftarrow b² - α \cdot db² W¹ \leftarrow W¹ - α \cdot dW¹ b¹ \leftarrow b¹ - α \cdot db¹

1
W² = W² − α dW² — dW² tells us how changing each weight affects the loss. The learning rate α controls the step size. Subtracting the gradient moves weights toward values that reduce prediction error.
2
b² = b² − α db² — Biases shift the output independently of the input. Updating them lets the network correct systematic overconfidence or underconfidence in certain digit classes.
3
W¹ = W¹ − α dW¹ — Updates the connections between input pixels and hidden neurons, strengthening those that helped correct predictions and weakening those that contributed to errors.
4
b¹ = b¹ − α db¹ — Adjusts how easily each hidden neuron activates. Together, these four updates shift the network's behaviour after every training step.

Pythondef update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1
    W2 = W2 - alpha * dW2
    b2 = b2 - alpha * db2
    return W1, b1, W2, b2

The Training Loop

The full training loop ties everything together. For each iteration: forward propagation → backpropagation → parameter update. Accuracy is printed every 10 steps.

Pythondef get_predictions(A2):
    return np.argmax(A2, 0)    # digit with highest probability per sample

def get_accuracy(predictions, Y):
    return np.sum(predictions == Y) / Y.size

def gradient_descent(X, Y, alpha, iterations):
    W1, b1, W2, b2 = init_params()   # random initialisation
    for i in range(iterations):
        Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X)
        dW1, db1, dW2, db2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y)
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha)
        if i % 10 == 0:
            print(f"Iteration: {i}")
            predictions = get_predictions(A2)
            print(f"Accuracy:  {get_accuracy(predictions, Y):.4f}")
    return W1, b1, W2, b2

Fig 6 — The Training Cycle

Results

With an accuracy of around 85%, this model can definitely be improved — by introducing more layers, tuning activation functions, or adjusting hyperparameters. The goal here is a solid fundamental understanding, not peak performance.

Final accuracy: ~85% on MNIST

Architecture: 784 → 10 (ReLU) → 10 (Softmax) · Optimizer: SGD · Framework: NumPy only

Full code on Kaggle. Credit to Samson Zhang for his video on the topic.

What to try next To push accuracy further: add more hidden layers, try wider hidden layers (e.g. 128 neurons), experiment with tanh or sigmoid activations, add dropout for regularisation, or swap vanilla SGD for Adam. Each is a straightforward extension on this same NumPy foundation.

Building a Neural Network From Scratch - Just Numpy

Building a Neural NetworkJust Numpy

The Problem

The Math — Representing Input Data

Network Architecture

Activation Functions

Forward Propagation

Backward Propagation

Parameter Updates

The Training Loop

Results

Building a Neural Network
Just Numpy