Building a Neural Network From Scratch - Just Numpy

24-12-2025
Building a Neural Network From Scratch — Just Numpy
Deep Learning · From Scratch

Building a Neural Network
Just Numpy

A two-layer network trained on MNIST digit recognition — forward propagation, backpropagation, and gradient descent, all from first principles.

By Sai Saketh  ·  Published in Towards AI  ·  December 2025

The Problem

In this article, I build a simple two-layer neural network from scratch and train it on the MNIST digit recognition dataset. Rather than relying on high-level frameworks, the focus is on understanding what actually happens inside a neural network — from forward propagation to backpropagation and parameter updates.

The problem is simple digit classification using the famous MNIST dataset: given a handwritten digit image, predict which number (0–9) is written.

Why from scratch? Libraries like PyTorch and TensorFlow abstract away the mechanics. Building it in pure NumPy forces you to confront exactly what a neural network computes at every step — there's nowhere to hide.

The Math — Representing Input Data

Each MNIST image has a resolution of 28×28 pixels, which means every image can be flattened into a vector of 784 pixel values. If we have m training images, we stack these vectors to form a matrix X.

Initially this gives shape (m × 784) — each row is one image. We transpose it to (784 × m) so each column is one image, enabling efficient vectorized matrix multiplication over all samples at once.

Fig 1 — Input Matrix: Each Column Represents One Sample
28 × 28 image = 784 pixels one sample flatten x₁ · · · x₇₈₄ (784 × 1) stack m cols X (m × 784) naïve — rows = images Tᵀ X (784 × m) each column = one image ✓ use this Why transpose? W · X processes all m samples in one matmul No Python loop needed → major speed-up

Network Architecture

With the input properly represented, we define the structure of the network. It has three layers:

  • 1
    Input layer — 784 units. One for each pixel in the flattened 28×28 image. Values feed directly into the network as the input vector per sample.
  • 2
    Hidden layer — 10 neurons. Each neuron is fully connected to all 784 input pixels, learning a weighted combination of the entire image. After computing a weighted sum plus bias, we apply ReLU — a non-linear activation that lets the network learn complex patterns. The hidden layer acts as a feature extractor, responding to meaningful patterns like strokes, edges, or curves.
  • 3
    Output layer — 10 neurons. One per digit class (0–9). Each neuron produces a raw score, passed through softmax, converting everything into probabilities that sum to 1. Highest probability wins.
Fig 2 — Network Architecture: 784 → 10 (ReLU) → 10 (Softmax)
Input Layer 784 units Hidden Layer 10 neurons + ReLU Output Layer 10 neurons + Softmax x₁ x₂ x₃ x₇₈₄ W¹, b¹ (10 × 784) h₁ h₂ h₃ h₁₀ ReLU W², b² (10 × 10) 0 1 2 9 Softmax P("0") = 0.02 P("1") = 0.01 P("7") = 0.91 ✓ sum = 1.0

Activation Functions

Two activation functions do the heavy lifting:

Fig 3 — ReLU (hidden layer) and Softmax (output layer)
ReLU(z) = max(0, z) z f −2 0 +2 f = 0 f = z Negatives → 0. Positives → unchanged. Non-saturating · fast · enables deep learning Softmax: eᶻⁱ / Σeᶻʲ 0.08 digit 0 0.79 digit 7 0.05 digit 1 0.03 digit 9 All values ∈ [0, 1] and sum to 1. argmax → predicted digit

Forward Propagation

Forward propagation is the process by which the network takes an input image and produces a prediction. It's four sequential computations:

Z¹ = W¹X + b¹ — weighted sum at hidden layer  (shape: 10 × m)
A¹ = ReLU(Z¹) — apply activation, kill negatives
Z² = W²A¹ + b² — weighted sum at output layer  (shape: 10 × m)
A² = softmax(Z²)— probability distribution over 10 classes
  • 1
    Z¹ = W¹X + b¹ — Each hidden neuron looks at all 784 pixels, multiplies them by learned weights, and adds a bias. This lets the network measure how strongly certain pixel patterns appear in the image.
  • 2
    A¹ = ReLU(Z¹) — ReLU keeps positive values and sets negative values to zero. This introduces non-linearity, so the network can learn complex patterns rather than behaving like a simple linear model.
  • 3
    Z² = W²A¹ + b² — The features learned by the hidden layer are combined to produce a raw score for each digit class.
  • 4
    A² = softmax(Z²) — Softmax converts raw scores into probabilities that sum to one. The digit with the highest probability is the final prediction.
Pythondef ReLU(Z):
    return np.maximum(Z, 0)

def softmax(Z):
    return np.exp(Z) / sum(np.exp(Z))

def forward_prop(W1, b1, W2, b2, X):
    Z1 = W1.dot(X) + b1
    A1 = ReLU(Z1)
    Z2 = W2.dot(A1) + b2
    A2 = softmax(Z2)
    return Z1, A1, Z2, A2
Fig 4 — Forward Propagation Data Flow
X 784 × m W¹X+b¹ 10 × m ReLU 10 × m W²A¹+b² 10 × m softmax probs 10 × m argmax(A²) predicted digit

Backward Propagation

Backward propagation is how the network learns from its mistakes. After forward propagation produces a prediction, we compare it with the true label and propagate the error backward to determine how each parameter should be adjusted.

We compute six gradients — two weight gradients, two bias gradients, and two pre-activation error terms:

dZ² = A² − Y — output error (softmax + cross-entropy simplifies cleanly)
dW² = (1/m) dZ² · A¹ᵀ — how output weights contributed to error
db² = (1/m) Σ dZ² — average bias error, output layer
dZ¹ = W²ᵀ · dZ² ∗ g′(Z¹) — error back to hidden layer, gated by ReLU derivative
dW¹ = (1/m) dZ¹ · Xᵀ — how input weights contributed to hidden error
db¹ = (1/m) Σ dZ¹ — average bias error, hidden layer
  • 1
    dZ² = A² − Y — The error in the output layer. Since we use softmax with cross-entropy loss, the gradient simplifies neatly to the difference between predicted probabilities and true labels. This tells us how much each output neuron over- or under-estimated its prediction.
  • 2
    dW² = (1/m) dZ² A¹ᵀ — Measures how much each hidden neuron contributed to the output error, averaged over all training examples.
  • 3
    db² = (1/m) Σ dZ² — Biases affect all inputs equally, so their gradients are simply the average error across the batch.
  • 4
    dZ¹ = W²ᵀ dZ² ∗ g′(Z¹) — The transpose of W² redistributes output error back to hidden neurons. Multiplied element-wise by the ReLU derivative — only neurons that were active during the forward pass receive a gradient update. This is essentially deactivating the activation function in reverse.
  • 5
    dW¹ = (1/m) dZ¹ Xᵀ — Measures how much each input pixel contributed to the hidden-layer error, letting the network learn which pixel patterns caused incorrect predictions.
  • 6
    db¹ = (1/m) Σ dZ¹ — Average error per hidden neuron, used to update the hidden layer's biases.
Pythondef ReLU_deriv(Z):
    return Z > 0    # 1 where Z > 0, else 0

def one_hot(Y):
    one_hot_Y = np.zeros((Y.size, Y.max() + 1))
    one_hot_Y[np.arange(Y.size), Y] = 1
    return one_hot_Y.T     # shape: (10, m)

def backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y):
    one_hot_Y = one_hot(Y)
    dZ2 = A2 - one_hot_Y                    # (10, m)
    dW2 = 1 / m * dZ2.dot(A1.T)            # (10, 10)
    db2 = 1 / m * np.sum(dZ2)
    dZ1 = W2.T.dot(dZ2) * ReLU_deriv(Z1)  # (10, m)
    dW1 = 1 / m * dZ1.dot(X.T)            # (10, 784)
    db1 = 1 / m * np.sum(dZ1)
    return dW1, db1, dW2, db2
Fig 5 — Backward Propagation: Error Flows Right → Left via Chain Rule
X input Z¹ / A¹ hidden Z² / A² output Loss Cross-Entropy −Σ Y · log(A²) dZ² = A² − Y → compute dW², db² dZ¹ = W²ᵀdZ²∗g′ → compute dW¹, db¹ done ← error flows right to left ← chain rule at each layer ← only active ReLU neurons    receive a gradient signal
Why does dZ² simplify so cleanly? The combination of softmax and cross-entropy loss produces an unusually clean gradient: dZ² = A² − Y. The derivative of cross-entropy with respect to the softmax input cancels beautifully — a well-known result that makes implementation much cleaner than it would otherwise be.

Parameter Updates

After computing gradients, the final step is gradient descent — adjusting each parameter a small step in the direction that reduces loss.

W² ← W² − α · dW²
b² ← b² − α · db²
W¹ ← W¹ − α · dW¹
b¹ ← b¹ − α · db¹
  • 1
    W² = W² − α dW²dW² tells us how changing each weight affects the loss. The learning rate α controls the step size. Subtracting the gradient moves weights toward values that reduce prediction error.
  • 2
    b² = b² − α db² — Biases shift the output independently of the input. Updating them lets the network correct systematic overconfidence or underconfidence in certain digit classes.
  • 3
    W¹ = W¹ − α dW¹ — Updates the connections between input pixels and hidden neurons, strengthening those that helped correct predictions and weakening those that contributed to errors.
  • 4
    b¹ = b¹ − α db¹ — Adjusts how easily each hidden neuron activates. Together, these four updates shift the network's behaviour after every training step.
Pythondef update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
    W1 = W1 - alpha * dW1
    b1 = b1 - alpha * db1
    W2 = W2 - alpha * dW2
    b2 = b2 - alpha * db2
    return W1, b1, W2, b2

The Training Loop

The full training loop ties everything together. For each iteration: forward propagation → backpropagation → parameter update. Accuracy is printed every 10 steps.

Pythondef get_predictions(A2):
    return np.argmax(A2, 0)    # digit with highest probability per sample

def get_accuracy(predictions, Y):
    return np.sum(predictions == Y) / Y.size

def gradient_descent(X, Y, alpha, iterations):
    W1, b1, W2, b2 = init_params()   # random initialisation
    for i in range(iterations):
        Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X)
        dW1, db1, dW2, db2 = backward_prop(Z1, A1, Z2, A2, W1, W2, X, Y)
        W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha)
        if i % 10 == 0:
            print(f"Iteration: {i}")
            predictions = get_predictions(A2)
            print(f"Accuracy:  {get_accuracy(predictions, Y):.4f}")
    return W1, b1, W2, b2
Fig 6 — The Training Cycle
init_params() random W, b — runs once forward_prop() X → Z¹→ A¹→ Z²→ A² backward_prop() compute dW, db update_params() W −= α·dW, b −= α·db log accuracy every 10 iterations repeat until converged

Results

With an accuracy of around 85%, this model can definitely be improved — by introducing more layers, tuning activation functions, or adjusting hyperparameters. The goal here is a solid fundamental understanding, not peak performance.

Final accuracy: ~85% on MNIST

Architecture: 784 → 10 (ReLU) → 10 (Softmax)  ·  Optimizer: SGD  ·  Framework: NumPy only

Full code on Kaggle. Credit to Samson Zhang for his video on the topic.

What to try next To push accuracy further: add more hidden layers, try wider hidden layers (e.g. 128 neurons), experiment with tanh or sigmoid activations, add dropout for regularisation, or swap vanilla SGD for Adam. Each is a straightforward extension on this same NumPy foundation.