CNN is mainly used for image recognition. CNN first learns to recognize the components of an image (e.g. lines, corners, curves, shapes, texture etc.) and then learns to combine these components (pooling) to recognize larger structures (e.g. faces, objects etc.).
Layers in CNN
1. Convolutional Layer
2. ReLU Layer
3. Pooling Layer
4. Normalization Layer
5. Fully connected Layer
Computers see an input image as an array of pixels. Numerical representation of the pixels is processed through many layers of a CNN. Each input image passes through a series of hidden layers like convolutional layers with filters (kernals), ReLU layers, pooling layers and fully connected layers. These hidden layers perform feature extraction from the image.
Convolution is the first layer to extract features from an input image. This layer uses a matrix filter and performs convolution operation to detect patterns in the image. Convolution of an image with different filters can perform operations such as edge detection, blur and sharpen by applying filters.
Convolution is a mathematical operation that happens between two matrices (image matrix and a filter or kernal) to form a third matrix as an output (convoluted matrix). This output is also called feature map matrix.
Filters (Kernels): Filters act as pattern detectors. Filters help in finding out edges, curves, corners, textures, colors, dark and light areas in the image and many other details like height, width, and depth etc. Kernels keep sliding over an entire image to extract different components or patterns of an image. First filters learn to extract simple features in initial convoluted layers, and later on these filters get more sophisticated in deeper layers and find out complex patterns.
We rotate this filter over an input matrix and get an output which is of less dimension.
Formula: Consider that our input matrix dimension is n X n. Filter size is f X f. Then our output matrix would be (n - f + 1) X (n - f + 1). Just replace n with 4, f with 3 and observe that the output matrix comes out to be 2 X 2.
Padding: We can observe that the input size is reduced from 4 X 4 to 2 X 2 after one convolution using 3 X 3 filter. This may lead to a problem. We may lose some information about edges and corners in the image. So, in order to preserve this information, we should use padding.
Type of Padding: We have two types of padding: Zero Padding and Valid Padding (no padding).
1. Zero Padding: Pad the image with zeros so that we don't lose any information about edges and corners.
In the above image, we have padded the input with zero. Now, if we use 3 X 3 filter over this, we get 4 X 4 output matrix (no reduction in dimensions) instead of 2 X 2.
2. Valid Padding: Drop the part of the image where the filter does not fit. This is called valid padding which keeps only valid part of the image. In this case, we compromise to lose some edge information in the image. We will get only 2 X 2 matrix in above example. We can go with this approach if we know that the information at edges is not that much useful and we can safely ignore that.
ReLU stands for Rectified Linear Unit for a non-linear operation. The output is ƒ(x) = max(0,x). ReLU’s main purpose is to introduce non-linearity in the ConvNet. It performs element-wise operation and set negative pixels to zero.
ReLU function is applied to the output matrix (feature map matrix) in the convolutional layer and converts it into a rectified feature map matrix.
We can also use other activation functions like tanh and sigmoid but generally ReLU performs better than other activation functions in many scenarios. So, by default, we consider ReLU over other activation functions.
Stride: Stride is the number of pixels shifts over the input matrix. Alternatively, stride can be thought as by how many pixels we want our filter to move as it slides across the image.
When the stride is 1, then we move the filters to 1 pixel at a time. When the stride is 2, then we move the filters to 2 pixels at a time and so on. We will use this concept in pooling layer.
Pooling layer is added after convolutional layer. Output of convolutional layer acts as an input to the pooling layer. Pooling layer does down-sampling of the image which reduces dimensionality by retaining important information. In this way, memory requirements are also reduced.
It does further feature extraction and detects multiple components of the image like edges, corners etc.
It converts the rectified feature map matrix to pooled feature map matrix.
1. Max Pooling: It takes the maximum value from the rectified feature map.
2. Min Pooling: It takes the minimum value from the rectified feature map.
3. Average Pooling: It takes the average of all the elements from the rectified feature map.
4. Sum Pooling: It takes the sum of all the elements from the rectified feature map.
In the above image, we are calculating max value from each block (see orange blocks) to create a new output matrix which is of lower dimension as compared to the original input matrix. In this way, we are retaining the most useful information and throwing away the useless information.
We can also specify padding parameter in pooling layer just like in convolutional layer.
Advantages of Pooling Layer
1. Reduces the resolution and dimensions and hence reduces computational complexity.
2. It also helps in reducing overfitting.
Normalization is a technique used to improve the performance and stability of the neural networks. It converts all inputs such that mean is zero and standard deviation is one.
Fully Connected Layer
Fully connected layers are used to connect every neuron in one layer to all the neurons in another layer. We flatten our pooled feature map matrix into vector and then feed that vector into a fully connected layer.
Hyperparameters in CNN
1. Number of convoluted layers
2. Number of kernels / filters in a convoluted layer
3. Kernel / Filter size in a convoluted layer
4. Padding in a convoluted layer (zero or valid padding)
For detailed list of hyperparameters in a neural network, please go through my this post.
1. Provide input image into convolution layer.
2. Choose parameters, apply filters with strides, padding if requires.
3. Perform convolution on the image. Output from this layer is called feature map matrix.
4. Apply ReLU activation to the matrix. Output from this layer is called rectified feature map matrix.
5. Perform pooling to reduce dimensionality. Output from this layer is called pooled feature map matrix
6. Add as many convolutional layers until satisfied
7. Flatten the output (convert pooled feature map matrix to vector) and feed into a fully connected layer
8. Output the class using an activation function (Logistic Regression with cost functions) and classifies images.