2022-10-23

Backpropagation

What is Backpropagation

Backpropagation is essential to the training process of deep learning models, as it allows for the efficient calculation of gradients necessary for updating the weights and biases of neural networks. By minimizing the error between predicted outputs and actual outputs, backpropagation enables neural networks to learn and adapt to new data, ultimately improving their performance.

The Backpropagation Algorithm

In this chapter, I will delve into the backpropagation algorithm and its role in training deep neural networks. We will cover the essential mathematical concepts, such as the chain rule and gradient descent, before discussing the algorithm in detail and providing pseudocode for its implementation.

The Chain Rule in Calculus

The chain rule is a fundamental concept in calculus that allows us to compute the derivative of a composite function. It plays a crucial role in backpropagation, as it helps us compute the gradients of the loss function with respect to the weights and biases in the network.

\frac{dy}{dx} = \frac{dy}{dg(x)} \cdot \frac{dg(x)}{dx}

Given a composite function $y = f(g(x))$ , the chain rule states that the derivative of $y$ with respect to $x$ is the product of the derivative of $f$ with respect to $g(x)$ and the derivative of $g(x)$ with respect to $x$ :

The Loss Function

The loss function, also known as the cost function or objective function, quantifies the difference between the predicted outputs and the actual outputs (targets) of the neural network. Common loss functions include mean squared error (MSE) for regression tasks and cross-entropy for classification tasks.

For example, the mean squared error is defined as:

L = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2

where $N$ is the number of samples, $y_i$ is the actual output, and $\hat{y}_i$ is the predicted output.

Gradient Descent

Gradient descent is an optimization algorithm used to minimize the loss function by iteratively updating the weights and biases in the network. The core idea is to compute the gradient of the loss function with respect to the parameters and then update the parameters by taking a step proportional to the negative of the gradient.

The general update rule for gradient descent is:

w_{t+1} = w_t - \eta \frac{\partial L}{\partial w_t}

where $w_t$ is the current weight, $w_{t+1}$ is the updated weight, $\eta$ is the learning rate, and $\frac{\partial L}{\partial w_t}$ is the gradient of the loss function with respect to the weight.

The Algorithm in Detail

The backpropagation algorithm consists of two main steps: the forward pass and the backward pass.

Forward Pass
In the forward pass, the input data is passed through the network to compute the predicted outputs. This involves computing the weighted sum of inputs and biases for each neuron and applying the activation function to generate the neuron's output.
Backward Pass
In the backward pass, the gradients of the loss function with respect to the weights and biases are computed using the chain rule. The gradients are then used to update the weights and biases using gradient descent.

The backpropagation algorithm can be summarized as follows:

Perform a forward pass to compute the predicted outputs.
Compute the loss using the loss function.
Calculate the gradient of the loss function with respect to the output layer's activation.
Compute the gradients of the loss function with respect to the weights and biases in the network using the chain rule.
Update the weights and biases using gradient descent.

Derivation of Backpropagation

In this chapter, I will derive the backpropagation algorithm step-by-step, using the chain rule and the gradient descent optimization method. We will first define the necessary notation and then proceed to derive the update equations for weights and biases in the network.

Notation

Let's define the following notation for our neural network:

$L$ : The number of layers in the network.
$N_l$ : The number of neurons in layer $l$ .
$w^l_{jk}$ : The weight connecting neuron $k$ in layer $l-1$ to neuron $j$ in layer $l$ .
$b^l_j$ : The bias for neuron $j$ in layer $l$ .
$a^l_j$ : The activation of neuron $j$ in layer $l$ .
$z^l_j$ : The weighted sum of inputs and bias for neuron $j$ in layer $l$ , defined as $z^l_j = \sum_k w^l_{jk} a^{l-1}_k + b^l_j$ .
$f^l$ : The activation function for neurons in layer $l$ .
$L(\mathbf{y}, \mathbf{\hat{y}})$ : The loss function, where $\mathbf{y}$ represents the actual outputs and $\mathbf{\hat{y}}$ represents the predicted outputs.

Derivation of the Backpropagation Algorithm

To derive the backpropagation algorithm, we need to compute the gradients of the loss function with respect to the weights and biases. We will start by calculating the gradient with respect to the output layer's activations.

Step 1: Compute the gradient of the loss function with respect to the output layer's activations

\frac{\partial L}{\partial a^{L}_j}

Step 2: Calculate the error term for neurons in the output layer

We define the error term for neuron $j$ in layer $L$ as $\delta^L_j$ :

\delta^L_j = \frac{\partial L}{\partial z^{L}_j}

Using the chain rule, we can express $\delta^L_j$ as:

\delta^L_j = \frac{\partial L}{\partial a^{L}_j} \cdot f'^{L}(z^{L}_j)

Step 3: Calculate the error term for neurons in hidden layers

For hidden layers, we can compute the error term using the error term of the subsequent layer (layer $l+1$ ) and the chain rule:

\delta^l_j = \frac{\partial L}{\partial z^{l}_j} = \sum_k \frac{\partial L}{\partial z^{l+1}_k} \cdot \frac{\partial z^{l+1}_k}{\partial z^{l}_j} = \sum_k \delta^{l+1}_k \cdot \frac{\partial z^{l+1}_k}{\partial z^{l}_j}

Applying the chain rule once again:

\delta^l_j = \sum_k \delta^{l+1}_k \cdot w^{l+1}_{kj} \cdot f'^{l}(z^{l}_j)

Step 4: Compute the gradient of the loss function with respect to the weights

Now that we have the error terms for all neurons in the network, we can compute the gradient of the loss function with respect to the weights. Using the chain rule, we get:

\frac{\partial L}{\partial w^l_{jk}} = \delta^l_j \cdot a^{l-1}_k

Step 5: Compute the gradient of the loss function with respect to the biases

Similarly, we can compute the gradient of the loss function with respect to the biases:

\frac{\partial L}{\partial b^l_j} = \delta^l_j

Step 6: Update the weights and biases using gradient descent

With the gradients of the loss function with respect to the weights and biases, we can now update them using the gradient descent method:

w^l_{jk} \leftarrow w^l_{jk} - \eta \delta^l_j \cdot a^{l-1}_k

b^l_j \leftarrow b^l_j - \eta \delta^l_j

By iteratively applying these update equations during the training process, the backpropagation algorithm enables the neural network to learn and adapt to the input data, minimizing the loss function.

Simple Feedforward Neural Network with Backpropagation Using Python

Here is an example implementation of a simple feedforward neural network with backpropagation using Python and Numpy.

python

import numpy as np

# Activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Initialize network
def initialize_network(input_nodes, hidden_nodes, output_nodes):
    network = {
        "W1": np.random.randn(hidden_nodes, input_nodes) * 0.1,
        "b1": np.zeros((hidden_nodes, 1)),
        "W2": np.random.randn(output_nodes, hidden_nodes) * 0.1,
        "b2": np.zeros((output_nodes, 1))
    }
    return network

# Forward pass
def forward_pass(network, X):
    W1, b1, W2, b2 = network["W1"], network["b1"], network["W2"], network["b2"]
    Z1 = np.dot(W1, X) + b1
    A1 = sigmoid(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)

    return Z1, A1, Z2, A2

# Backward pass
def backward_pass(network, X, y, Z1, A1, Z2, A2, learning_rate):
    m = X.shape[1]
    dZ2 = A2 - y
    dW2 = (1 / m) * np.dot(dZ2, A1.T)
    db2 = (1 / m) * np.sum(dZ2, axis=1, keepdims=True)
    dZ1 = np.dot(network["W2"].T, dZ2) * sigmoid_derivative(Z1)
    dW1 = (1 / m) * np.dot(dZ1, X.T)
    db1 = (1 / m) * np.sum(dZ1, axis=1, keepdims=True)

    network["W1"] -= learning_rate * dW1
    network["b1"] -= learning_rate * db1
    network["W2"] -= learning_rate * dW2
    network["b2"] -= learning_rate * db2

# Training the network
def train_network(network, X, y, epochs, learning_rate):
    for i in range(epochs):
        Z1, A1, Z2, A2 = forward_pass(network, X)
        backward_pass(network, X, y, Z1, A1, Z2, A2, learning_rate)

# Example
input_nodes = 2
hidden_nodes = 3
output_nodes = 1
network = initialize_network(input_nodes, hidden_nodes, output_nodes)

# Training data (XOR problem)
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
y = np.array([[0, 1, 1, 0]])

epochs = 10000
learning_rate = 0.1

train_network(network, X, y, epochs, learning_rate)

# Testing
Z1, A1, Z2, A2 = forward_pass(network, X)
predictions = (A2 > 0.5).astype(int)
print("Predictions:", predictions)

This implementation demonstrates a simple feedforward neural network with one hidden layer. By adjusting the architecture and training parameters, this network can be adapted to solve more complex problems.

References

http://karpathy.github.io/neuralnets/

Perceptron

Type of activation function

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS