Multi Layer Perceptron (MLP)


The goal of this notebook is to show how to build, train and test a Neural Network. Along the way, several terms we come across while working with Neural Networks are discussed. A glossary of terms covered in this notebook are:

  1. Loss Functions
  2. Optimizers
  3. Batch size vs Epochs
  4. Training loss, Validation loss and Test loss
  5. Dropout: a way to avoid overfitting

Multi-Layer Perceptron, MNIST

We will train an MLP to classify images from the MNIST database hand-written digit database.

The process will be broken down into the following steps:

  1. Load and visualize the data
  2. Define a neural network
  3. Train the model
  4. Evaluate the performance of our trained model on a test dataset!

Before we begin, we have to import the necessary libraries for working with data and PyTorch.

# import libraries
import torch
import numpy as np

Load and Visualize the Data

Downloading may take a few moments, and you should see your progress as the data is loading. You may also choose to change the batch_size if you want to load more data at a time.

This cell will create DataLoaders for each of our datasets.

from torchvision import datasets
import torchvision.transforms as transforms
from import SubsetRandomSampler

# number of subprocesses to use for data loading
num_workers = 0
# how many samples per batch to load
batch_size = 20
# percentage of training set to use as validation
valid_size = 0.2

# convert data to torch.FloatTensor
transform = transforms.ToTensor()

# choose the training and test datasets
train_data = datasets.MNIST(root='data', train=True,
                                   download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
                                  download=True, transform=transform)

# obtain training indices that will be used for validation
num_train = len(train_data)
indices = list(range(num_train))
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

# prepare data loaders
train_loader =, batch_size=batch_size,
    sampler=train_sampler, num_workers=num_workers)
valid_loader =, batch_size=batch_size,
    sampler=valid_sampler, num_workers=num_workers)
test_loader =, batch_size=batch_size,

Visualize a Batch of Training Data

The first step in a classification task is to take a look at the data, make sure it is loaded in correctly, then make any initial observations about patterns in that data.

import matplotlib.pyplot as plt
%matplotlib inline

# obtain one batch of training images
dataiter = iter(train_loader)
images, labels =
images = images.numpy()

# plot the images in the batch, along with the corresponding labels
fig = plt.figure(figsize=(25, 4))
for idx in np.arange(20):
    ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[])
    ax.imshow(np.squeeze(images[idx]), cmap='gray')
    # print out the correct label for each image
    # .item() gets the value contained in a Tensor


View an Image in More Detail

img = np.squeeze(images[1])

fig = plt.figure(figsize = (12,12))
ax = fig.add_subplot(111)
ax.imshow(img, cmap='gray')
width, height = img.shape
thresh = img.max()/2.5
for x in range(width):
    for y in range(height):
        val = round(img[x][y],2) if img[x][y] !=0 else 0
        ax.annotate(str(val), xy=(y,x),
                    color='white' if img[x][y]<thresh else 'black')


Define the Network Architecture

The architecture will be responsible for seeing as input a 784-dim Tensor of pixel values for each image, and producing a Tensor of length 10 (our number of classes) that indicates the class scores for an input image. This particular example uses two hidden layers and dropout to avoid overfitting.

import torch.nn as nn
import torch.nn.functional as F

# define the NN architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        # number of hidden nodes in each layer (512)
        hidden_1 = 512
        hidden_2 = 512
        # linear layer (784 -> hidden_1)
        self.fc1 = nn.Linear(28 * 28, hidden_1)
        # linear layer (n_hidden -> hidden_2)
        self.fc2 = nn.Linear(hidden_1, hidden_2)
        # linear layer (n_hidden -> 10)
        self.fc3 = nn.Linear(hidden_2, 10)
        # dropout layer (p=0.2)
        # dropout prevents overfitting of data
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        # flatten image input
        x = x.view(-1, 28 * 28)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add hidden layer, with relu activation function
        x = F.relu(self.fc2(x))
        # add dropout layer
        x = self.dropout(x)
        # add output layer
        x = self.fc3(x)
        return x

# initialize the NN
model = Net()
  (fc1): Linear(in_features=784, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=512, bias=True)
  (fc3): Linear(in_features=512, out_features=10, bias=True)
  (dropout): Dropout(p=0.2)

Specify Loss Function and Optimizer

It’s recommended that you use cross-entropy loss for classification. If you look at the documentation (linked above), you can see that PyTorch’s cross entropy function applies a softmax funtion to the output layer and then calculates the log loss.

# specify loss function (categorical cross-entropy)
criterion = nn.CrossEntropyLoss()

# specify optimizer (stochastic gradient descent) and learning rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

Train the Network

The steps for training/learning from a batch of data are described in the comments below:

  1. Clear the gradients of all optimized variables
  2. Forward pass: compute predicted outputs by passing inputs to the model
  3. Calculate the loss
  4. Backward pass: compute gradient of the loss with respect to model parameters
  5. Perform a single optimization step (parameter update)
  6. Update average training loss

The following loop trains for 50 epochs; take a look at how the values for the training loss decrease over time. We want it to decrease while also avoiding overfitting the training data.

# number of epochs to train the model
n_epochs = 50

# initialize tracker for minimum validation loss
valid_loss_min = np.Inf # set initial "min" to infinity

for epoch in range(n_epochs):
    # monitor training loss
    train_loss = 0.0
    valid_loss = 0.0

    # train the model #
    model.train() # prep model for training
    for data, target in train_loader:
        # clear the gradients of all optimized variables
        # forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # calculate the loss
        loss = criterion(output, target)
        # backward pass: compute gradient of the loss with respect to model parameters
        # perform a single optimization step (parameter update)
        # update running training loss
        train_loss += loss.item()*data.size(0)

    # validate the model #
    model.eval() # prep model for evaluation
    for data, target in valid_loader:
        # forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # calculate the loss
        loss = criterion(output, target)
        # update running validation loss
        valid_loss += loss.item()*data.size(0)

    # print training/validation statistics
    # calculate average loss over an epoch
    train_loss = train_loss/len(train_loader.dataset)
    valid_loss = valid_loss/len(valid_loader.dataset)

    print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(

    # save model if validation loss has decreased
    if valid_loss <= valid_loss_min:
        print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
        valid_loss)), '')
        valid_loss_min = valid_loss
Epoch: 1 	Training Loss: 0.767620 	Validation Loss: 0.085004
Validation loss decreased (inf --> 0.085004).  Saving model ...
Epoch: 2 	Training Loss: 0.288777 	Validation Loss: 0.064357
Validation loss decreased (0.085004 --> 0.064357).  Saving model ...
Epoch: 3 	Training Loss: 0.231203 	Validation Loss: 0.052979
Validation loss decreased (0.064357 --> 0.052979).  Saving model ...
Epoch: 4 	Training Loss: 0.191698 	Validation Loss: 0.045988
Validation loss decreased (0.052979 --> 0.045988).  Saving model ...
Epoch: 5 	Training Loss: 0.161971 	Validation Loss: 0.039707
Validation loss decreased (0.045988 --> 0.039707).  Saving model ...
Epoch: 6 	Training Loss: 0.141026 	Validation Loss: 0.035701
Validation loss decreased (0.039707 --> 0.035701).  Saving model ...
Epoch: 7 	Training Loss: 0.123587 	Validation Loss: 0.031777
Validation loss decreased (0.035701 --> 0.031777).  Saving model ...
Epoch: 8 	Training Loss: 0.109185 	Validation Loss: 0.029636
Validation loss decreased (0.031777 --> 0.029636).  Saving model ...
Epoch: 9 	Training Loss: 0.098960 	Validation Loss: 0.027178
Validation loss decreased (0.029636 --> 0.027178).  Saving model ...
Epoch: 10 	Training Loss: 0.087829 	Validation Loss: 0.025448
Validation loss decreased (0.027178 --> 0.025448).  Saving model ...
Epoch: 11 	Training Loss: 0.081526 	Validation Loss: 0.024772
Validation loss decreased (0.025448 --> 0.024772).  Saving model ...
Epoch: 12 	Training Loss: 0.075072 	Validation Loss: 0.023862
Validation loss decreased (0.024772 --> 0.023862).  Saving model ...
Epoch: 13 	Training Loss: 0.069212 	Validation Loss: 0.021493
Validation loss decreased (0.023862 --> 0.021493).  Saving model ...
Epoch: 14 	Training Loss: 0.063953 	Validation Loss: 0.021073
Validation loss decreased (0.021493 --> 0.021073).  Saving model ...
Epoch: 15 	Training Loss: 0.059000 	Validation Loss: 0.020229
Validation loss decreased (0.021073 --> 0.020229).  Saving model ...
Epoch: 16 	Training Loss: 0.056524 	Validation Loss: 0.020079
Validation loss decreased (0.020229 --> 0.020079).  Saving model ...
Epoch: 17 	Training Loss: 0.052064 	Validation Loss: 0.019345
Validation loss decreased (0.020079 --> 0.019345).  Saving model ...
Epoch: 18 	Training Loss: 0.048936 	Validation Loss: 0.018713
Validation loss decreased (0.019345 --> 0.018713).  Saving model ...
Epoch: 19 	Training Loss: 0.045209 	Validation Loss: 0.018985
Epoch: 20 	Training Loss: 0.042700 	Validation Loss: 0.018208
Validation loss decreased (0.018713 --> 0.018208).  Saving model ...
Epoch: 21 	Training Loss: 0.039584 	Validation Loss: 0.017888
Validation loss decreased (0.018208 --> 0.017888).  Saving model ...
Epoch: 22 	Training Loss: 0.037877 	Validation Loss: 0.017454
Validation loss decreased (0.017888 --> 0.017454).  Saving model ...
Epoch: 23 	Training Loss: 0.035406 	Validation Loss: 0.017671
Epoch: 24 	Training Loss: 0.034247 	Validation Loss: 0.017521
Epoch: 25 	Training Loss: 0.031872 	Validation Loss: 0.016991
Validation loss decreased (0.017454 --> 0.016991).  Saving model ...
Epoch: 26 	Training Loss: 0.030163 	Validation Loss: 0.016554
Validation loss decreased (0.016991 --> 0.016554).  Saving model ...
Epoch: 27 	Training Loss: 0.028369 	Validation Loss: 0.017595
Epoch: 28 	Training Loss: 0.026245 	Validation Loss: 0.016682
Epoch: 29 	Training Loss: 0.025983 	Validation Loss: 0.017080
Epoch: 30 	Training Loss: 0.024357 	Validation Loss: 0.016169
Validation loss decreased (0.016554 --> 0.016169).  Saving model ...
Epoch: 31 	Training Loss: 0.022118 	Validation Loss: 0.016334
Epoch: 32 	Training Loss: 0.023228 	Validation Loss: 0.016612
Epoch: 33 	Training Loss: 0.020928 	Validation Loss: 0.016693
Epoch: 34 	Training Loss: 0.019909 	Validation Loss: 0.016322
Epoch: 35 	Training Loss: 0.018557 	Validation Loss: 0.016833
Epoch: 36 	Training Loss: 0.018037 	Validation Loss: 0.016070
Validation loss decreased (0.016169 --> 0.016070).  Saving model ...
Epoch: 37 	Training Loss: 0.017053 	Validation Loss: 0.015298
Validation loss decreased (0.016070 --> 0.015298).  Saving model ...
Epoch: 38 	Training Loss: 0.016680 	Validation Loss: 0.016685
Epoch: 39 	Training Loss: 0.015662 	Validation Loss: 0.016136
Epoch: 40 	Training Loss: 0.015871 	Validation Loss: 0.016163
Epoch: 41 	Training Loss: 0.014403 	Validation Loss: 0.015852
Epoch: 42 	Training Loss: 0.013686 	Validation Loss: 0.015913
Epoch: 43 	Training Loss: 0.013107 	Validation Loss: 0.016956
Epoch: 44 	Training Loss: 0.012698 	Validation Loss: 0.015649
Epoch: 45 	Training Loss: 0.012580 	Validation Loss: 0.015952
Epoch: 46 	Training Loss: 0.012093 	Validation Loss: 0.015912
Epoch: 47 	Training Loss: 0.011695 	Validation Loss: 0.015612
Epoch: 48 	Training Loss: 0.010706 	Validation Loss: 0.016048
Epoch: 49 	Training Loss: 0.010784 	Validation Loss: 0.015889
Epoch: 50 	Training Loss: 0.010427 	Validation Loss: 0.016143

Load the Model with the Lowest Validation Loss


Test the Trained Network

Finally, we test our best model on previously unseen test data and evaluate it’s performance. Testing on unseen data is a good way to check that our model generalizes well. It may also be useful to be granular in this analysis and take a look at how this model performs on each class as well as looking at its overall loss and accuracy.

# initialize lists to monitor test loss and accuracy
test_loss = 0.0
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))

model.eval() # prep model for evaluation

for data, target in test_loader:
    # forward pass: compute predicted outputs by passing inputs to the model
    output = model(data)
    # calculate the loss
    loss = criterion(output, target)
    # update test loss
    test_loss += loss.item()*data.size(0)
    # convert output probabilities to predicted class
    _, pred = torch.max(output, 1)
    # compare predictions to true label
    correct = np.squeeze(pred.eq(
    # calculate test accuracy for each object class
    for i in range(batch_size):
        label =[i]
        class_correct[label] += correct[i].item()
        class_total[label] += 1

# calculate and print avg test loss
test_loss = test_loss/len(test_loader.dataset)
print('Test Loss: {:.6f}\n'.format(test_loss))

for i in range(10):
    if class_total[i] > 0:
        print('Test Accuracy of %5s: %2d%% (%2d/%2d)' % (
            str(i), 100 * class_correct[i] / class_total[i],
            np.sum(class_correct[i]), np.sum(class_total[i])))
        print('Test Accuracy of %5s: N/A (no training examples)' % (classes[i]))

print('\nTest Accuracy (Overall): %2d%% (%2d/%2d)' % (
    100. * np.sum(class_correct) / np.sum(class_total),
    np.sum(class_correct), np.sum(class_total)))
Test Loss: 0.074413

Test Accuracy of     0: 99% (971/980)
Test Accuracy of     1: 98% (1123/1135)
Test Accuracy of     2: 96% (1001/1032)
Test Accuracy of     3: 97% (984/1010)
Test Accuracy of     4: 97% (960/982)
Test Accuracy of     5: 97% (870/892)
Test Accuracy of     6: 97% (935/958)
Test Accuracy of     7: 96% (993/1028)
Test Accuracy of     8: 97% (946/974)
Test Accuracy of     9: 97% (981/1009)

Test Accuracy (Overall): 97% (9764/10000)

Visualize Sample Test Results

This cell displays test images and their labels in this format: predicted (ground-truth). The text will be green for accurately classified examples and red for incorrect predictions.

# obtain one batch of test images
dataiter = iter(test_loader)
images, labels =

# get sample outputs
output = model(images)
# convert output probabilities to predicted class
_, preds = torch.max(output, 1)
# prep images for display
images = images.numpy()

# plot the images in the batch, along with predicted and true labels
fig = plt.figure(figsize=(25, 4))
for idx in np.arange(20):
    ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[])
    ax.imshow(np.squeeze(images[idx]), cmap='gray')
    ax.set_title("{} ({})".format(str(preds[idx].item()), str(labels[idx].item())),
                 color=("green" if preds[idx]==labels[idx] else "red"))




An artificial neuron or perceptron takes several inputs and performs a weighted summation to produce an output. The weight of the perceptron is determined during the training process and is based on the training data. The following is a diagram of the perceptron: perceptron

The inputs are weighted and summed as shown in the preceding image. The sum is then passed through a unit step function, in this case, for a binary classification problem. A perceptron can only learn simple functions by learning the weights from examples. The process of learning the weights is called training.

Activation Functions

The activation functions make neural nets nonlinear. An activation function decides whether a perceptron should fire or not. During training activation, functions play an important role in adjusting the gradients. An activation function such as sigmoid, shown in the next section, attenuates the values with higher magnitudes. This nonlinear behaviour of the activation function gives the deep nets to learn complex functions. Most of the activation functions are continuous and differential functions, except rectified unit at 0. A continuous function has small changes in output for every small change in input. A differential function has a derivative existing at every point in the domain.


Sigmoid can be considered a smoothened step function and hence differentiable. Sigmoid is useful for converting any value to probabilities and can be used for binary classification. The sigmoid maps input to a value in the range of 0 to 1, as shown in the following graph: sigmoid

The change in Y values with respect to X is going to be small, and hence, there will be vanishing gradients. After some learning, the change may be small. Another activation function called tanh, explained in next section, is a scaled version of sigmoid and avoids the problem of a vanishing gradient.


The hyperbolic tangent function, or tanh, is the scaled version of sigmoid. Like sigmoid, it is smooth and differentiable. The tanh maps input to a value in the range of -1 to 1, as shown in the following graph: tanh

The gradients are more stable than sigmoid and hence have fewer vanishing gradient problems. Both sigmoid and tanh fire all the time, making the ANN really heavy. The Rectified Linear Unit (ReLU) activation function, explained in the next section, avoids this pitfall by not firing at times.


ReLu can let big numbers pass through. This makes a few neurons stale and they don’t fire. This increases the sparsity, and hence, it is good. The ReLU maps input x to max (0, x), that is, they map negative inputs to 0, and positive inputs are output without any change as shown in the following graph: relu

Because ReLU doesn’t fire all the time, it can be trained faster. Since the function is simple, it is computationally the least expensive. Choosing the activation function is very dependent on the application. Nevertheless, ReLU works well for a large range of problems.

Artificial Neural Networks (ANN)

ANN is a collection of perceptrons and activation functions. The perceptrons are connected to form hidden layers or units. The hidden units form the nonlinear basis that maps the input layers to output layers in a lower-dimensional space, which is also called artificial neural networks. ANN is a map from input to output. The map is computed by weighted addition of the inputs with biases. The values of weight and bias values along with the architecture are called model.

The training process determines the values of these weights and biases. The model values are initialized with random values during the beginning of the training. The error is computed using a loss function by contrasting it with the ground truth. Based on the loss computed, the weights are tuned at every step. The training is stopped when the error cannot be further reduced. The training process learns the features during the training. The features are a better representation than the raw images. The following is a diagram of an artificial neural network, or multi-layer perceptron: mlp

Several inputs of x are passed through a hidden layer of perceptrons and summed to the output. The universal approximation theorem suggests that such a neural network can approximate any function. The hidden layer can also be called a dense layer. Every layer can have one of the activation functions described in the previous section. The number of hidden layers and perceptrons can be chosen based on the problem. There are a few more things that make this multilayer perceptron work for multi-class classification problems. A multi-class classification problem tries to discriminate more than ten categories. We will explore those terms in the following sections.

one-hot encoding

One-hot encoding is a way to represent the target variables or classes in case of a classification problem. The target variables can be converted from the string labels to one-hot encoded vectors. A one-hot vector is filled with 1 at the index of the target class but with 0 everywhere else. For example, if the target classes are cat and dog, they can be represented by [1, 0] and [0, 1], respectively. For 1,000 classes, one-hot vectors will be of size 1,000 integers with all zeros but 1. It makes no assumptions about the similarity of target variables. With the combination of one-hot encoding with softmax explained in the following section, multi-class classification becomes possible in ANN.


Softmax is a way of forcing the neural networks to output the sum of 1. Thereby, the output values of the softmax function can be considered as part of a probability distribution. This is useful in multi-class classification problems. Softmax is a kind of activation function with the speciality of output summing to 1. It converts the outputs to probabilities by dividing the output by summation of all the other values. The Euclidean distance can be computed between softmax probabilities and one-hot encoding for optimization. But the cross-entropy explained in the next section is a better cost function to optimize.

Cross Entropy

Cross-entropy compares the distance between the outputs of softmax and one-hot encoding. Cross-entropy is a loss function for which error has to be minimized. Neural networks estimate the probability of the given data to every class. The probability has to be maximized to the correct target label. Cross-entropy is the summation of negative logarithmic probabilities. Logarithmic value is used for numerical stability. Maximizing a function is equivalent to minimizing the negative of the same function.

In the next section, we will see the following regularization methods to avoid the overfitting of ANN:

  1. Dropout
  2. Batch Normalization
  3. L1 and L2 Regularization

Avoid Overfitting


Dropout is an effective way of regularizing neural networks to avoid the overfitting of ANN. During training, the dropout layer cripples the neural network by removing hidden units stochastically as shown in the following image: dropout

Note how the neurons are randomly trained. Dropout is also an efficient way of combining several neural networks. For each training case, we randomly select a few hidden units so that we end up with different architectures for each case. This is an extreme case of bagging and model averaging. Dropout layer should not be used during the inference as it is not necessary.

Batch Normalization

Batch normalization, or batch-norm, increase the stability and performance of neural network training. It normalizes the output from a layer with zero mean and a standard deviation of 1. This reduces overfitting and makes the network train faster. It is very useful in training complex neural networks.

L1 and L2 regularization

L1 penalizes the absolute value of the weight and tends to make the weights zero. L2 penalizes the squared value of the weight and tends to make the weight smaller during the training. Both the regularizes assume that models with smaller weights are better.

Training Neural Networks

Training ANN is tricky as it contains several parameters to optimize. The procedure of updating the weights is called backpropagation. The procedure to minimize the error is called optimization.


A backpropagation algorithm is commonly used for training artificial neural networks. The weights are updated from backward based on the error calculated as shown in the following image: backpropagation

After calculating the error, gradient descent can be used to calculate the weight updating, as explained in the next section.


The gradient descent algorithm performs multidimensional optimization. The objective is to reach the global maximum. Gradient descent is a popular optimization technique used in many machine-learning models. It is used to improve or optimize the model prediction. One implementation of gradient descent is called the stochastic gradient descent (SGD) and is becoming more popular (explained in the next section) in neural networks. Optimization involves calculating the error value and changing the weights to achieve that minimal error. The direction of finding the minimum is the negative of the gradient of the loss function. The gradient descent procedure is qualitatively shown in the following figure: gradientDescent The learning rate determines how big each step should be. Note that the ANN with nonlinear activations will have local minima. SGD works better in practice for optimizing non-convex cost functions.


SGD is the same as gradient descent, except that it is used for only partial data to train every time. The parameter is called mini-batch size. Theoretically, even one example can be used for training. In practice, it is better to experiment with various numbers. Visit to see a great visualization of gradient descent on convex and non-convex surfaces.