How to build a CNN in PyTorch ?
Glossary of terms used in CNN:
- convolutional neural network
- convolutional layers
- pooling layers
To define a neural network in PyTorch, you define the layers of a model in the function __init__ and define the forward behavior of a network that applyies those initialized layers to an input (x) in the function forward. In PyTorch we convert all inputs into the Tensor datatype, which is similar to a list data type in Python.
To create a convolutional layer in PyTorch, you must first import the necessary module:
import torch.nn as nn
Then, there is a two part process to defining a convolutional layer and defining the feedforward behavior of a model (how an input moves through the layers of a network).
First, you must define a Model class and fill in two functions.
__init__: You can define a convolutional layer in the init function using the following format:
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
forward: Then, you refer to that layer in the forward function! Here, I am passing in an input image x and applying a ReLU function to the output of this layer.
x = F.relu(self.conv1(x))
- in_channels: The number of inputs (in depth), 3 for an RGB image, for example.
- out_channels: The number of output channels, i.e. the number of filtered “images” a convolutional layer is made of or the number of unique, convolutional kernels that will be applied to an input.
- kernel_size: Number specifying both the height and width of the (square) convolutional kernel.
- stride: The stride of the convolution. If you don’t specify anything, stride is set to 1.
- padding: The border of 0’s around an input array. If you don’t specify anything, padding is set to 0.
There are many other tunable arguments that you can set to change the behavior of your convolutional layers. To read more about these, you can check the official documentation:
How to define Pooling Layers ?
Pooling layers take in a kernel_size and a stride. Typically the same value as is the down-sampling factor. For example, the following code will down-sample an input’s x-y dimensions, by a factor of 2:
self.pool = nn.MaxPool2d(2,2)
Then in the forward function, you need to have:
x = F.relu(self.conv1(x))
x = self.pool(x)
A Convolutional Neural Network is a special kind of Neural Network in that it can remember
spatial information. Traditional neural networks like MLPs, only look at individual inputs, but CNNs look at the image as a whole or in patches and analyse groups of pixels at a time. The key to preserving the spatial information is something called the
Convolutional Layer. A Convolutional Layer applies a series of different filters (also known as convolutional kernels) to an input image. The resulting filtered images have different appearances - some detect edges, some detect different colors that make up the different classes of an image.
When we talk about spatial patterns in an image, we are referring to color or shape.
CNNs are a kind of deep learning model that can learn to do things like image classification and object recognition. They keep track of spatial information and learn to extract features like the edges of objects in something called a convolutional layer. Below you’ll see an simple CNN structure, made of multiple layers, below, including this “convolutional layer”.
The convolutional layer is produced by applying a series of many different image filters, also known as convolutional kernels, to an input image.
In the example shown, 4 different filters produce 4 differently filtered output images. When we stack these images, we form a complete convolutional layer with a depth of 4.
For RGB images, you will have a 3D filter (stack of 3 2D filters), like shown below:
So, initially you start with an RGB image (stack of RGB 2D pixels) and a 3D filter. The 3D filter is then convolved with the RGB image. This will result in a feature map.
Now, if you extend this to multiple filters, then you will end up with multiple feature maps in the next layer. As shown here:
Now this is where it starts getting interesting:
You can think of each of the feature maps in a convolutional layer along the same lines as an image channel and stack them to get a 3D array.
Then, we can use this stacked 3D array of feature maps as input to another Convolutional Layer to discovers patterns within patterns that we discovered in the first convolutional layer.
Note: Number of filters will determine the depth of the convolutional layer.
You can control the behaviour of a Convolutional Layer by specifying the number of filters and the size of the filters. For instance, if you want to increase the number of nodes in a Conv Layer, then you increase the number of filters. To increase the size of the detected patterns, you can increase the size of your filter. But there are even more hyperparameters that you can tune. One of these parameters, is referred to as the stride of the convolution. The stride is the just the amount by which the filter slides over the image. Default stride is 1. As shown in the below image. The below image also shows that a stride of 1, will result in a convolutional layer with roughly the same width and height of the input image.
In the next image, we show the same, but with stacked feature maps.
If instead, we make the stride = 2, then the convolutional layer will shrink to roughly half the width and height of the image, as shown: (we say roughly because, it depends on what you do at the edge of the image)
Need for padding
Consider this example, where we have a filter of size 2 and a stride of 2 pixels.
So, how do we deal with the nodes where the filter extended outside the image ?
One approach is to discard them, if we choose this option, then our convolutional layer will have no information regarding some regions of the image. In our example shown below, the convolutional layer will have no information regarding the right and bottom pixels.
Second approach is to pad the right and bottom of the images with zeros. This will give the filter enough room to move, so that it can make use of all the pixels. As shown below
The second type of layer in a CNN is the Pooling Layer. These pooling layers often take the convolutional layers as input. Recall, that a Convolutional layer is a stack of feature maps. We have one feature map for each filter. A complicated dataset, with many classes will require a large number of filters - each responsible for finding a pattern in the image. More filters means a bigger stack. Which means, the dimensionality of our Convolutional layers can get large. Higher dimensionality means we need more parameters (aka filters/kernels) - which can lead to overfitting! THUS, we need a method for reducing this dimensionality. This is the role of pooling layers within a Convolutional Neural Network. (Answer to question: Why do we need pooling layers in CNNs ?)
Max Pooling layers
Max pooling layers take a stack of feature maps as input. As with Convolutional Layers, we need a window size and stride. In the example shown below a window size of 2x2 and a stride of 2 is used to illustrate this idea:
You move the window over the feature map and pick the max value in the window. The output of operation is a stack of the same number of feature maps, but each feature map is reduced in width and height.
The job of a CNN is to discover patterns contained in an image. A sequence of layers is responsible for this discovery.