Convolutional layers
This section shows how convolutional layers work in greater depth. At a basic level, convolutional layers are nothing more than a set of filters. When you look at images while wearing glasses with a red tint, everything appears to have a red hue. Now, imagine if these glasses consisted of different tints embedded within them, maybe a red tint with one or more horizontal green tints. If you had such a pair of glasses, the effect would be to highlight certain aspects of the scene in front of you. Any part of the scene that had a green horizontal line would become more focused.
Convolutional layers apply a selection of patches (or convolutions) over the previous layer’s output. For example, for a face recognition task, the first layer’s patches identify basic features in the image, for example, an edge or a diagonal line. The patches are moved across the image to match different parts of the image. Here is an example of a 3 x 3 convolutional block applied across a 6 x 6 image:
The values in the convolutional block are multiplied element by element (that is, not matrix multiplication), and the values are added to give a single value. Here is an example:
In this example, our convolutional block is a diagonal pattern. The first block in the image (A1:C3) is also a diagonal pattern, so when we multiply the elements and sum them, we get a relatively large value of 6.3. In comparison, the second block in the image (D4:F6) is a horizontal line pattern, so we get a much smaller value.
It can be difficult to visualize how convolutional layers work across the entire image, so the following R Shiny application will show it more clearly. This application is included in the code for this book in the Chapter5/server.R file. Open this file in RStudio and select Run app. Once the application is loaded, select Convolutional Layers from the left menu bar. The application loads the first 100 images from the MNIST dataset, which we will use later for our first deep learning image classification task. The images are grayscale images of size 28 x 28 of handwritten digits 0 to 9. Here is a screenshot of the application with the fourth image selected, which is a four:
Once loaded, you can use the slider to browse through the images. In the top-right corner, there are four choices of convolutional layers to apply to the image. In the previous screenshot, a horizontal line convolutional layer is selected and we can see what this looks like in the text box in the top right corner. When we apply the convolutional filter to the input image on the left, we can see that the resulting image on the right is almost entirely grey, except for where the horizontal line was in the original image. Our convolutional filter has matched the parts in the image that have a horizontal line. If we change the convolutional filter to a vertical line, we get the following result:
Now, we can see that, after the convolution is applied, the vertical lines in the original image are highlighted in the resultant image on the right. In effect, applying these filters is a type of feature extraction. I encourage you to use the application and browse through images and see how different convolutions apply to images of the different categories.
This is the basis of convolutional filters, and while it is a simple concept, it becomes powerful when you start doing two things:
- Combining many convolutional filters to create convolutional layers
- Applying another set of convolutional filters (that is, a convolutional layer) to the output of a previous convolutional layer
This may take some time to get your head around. If I apply a filter to an image and then apply a filter to that output, what do I get? And if I then apply that a third time, that is, apply a filter to an image and then apply a filter to that output, and then apply a filter to that output, what do I get? The answer is that each subsequent layer combines identified features from the previous layers to find even more complicated patterns, for example, corners, arcs, and so on. Later layers find even richer features such as a circle with an arc over it, indicating the eye of a person.
There are two parameters that are used to control the movement of the convolution: padding and strides. In the following diagram, we can see that the original image is of size 6 x 6, while there are 4 x 4 subgraphs. We have therefore reduced the data representation from a 6 x 6 matrix to a 4 x 4 matrix. When we apply a convolution of size c1, c2 to data of size n, m, the output will be n-c1+1, m-c2+1. If we want our output to be the same size as our input, we can pad the input by adding zeros to borders of the images. For the previous example, we add a 1-pixel border around the entire image. The following diagram shows how the first 3 x 3 convolution would be applied to the image with padding:
The second parameter we can apply to convolutions is strides, which control the movement of the convolution. The default is 1, which means the convolution moves by 1 each time, first to the right and then down. In practice, this value is rarely changed, so we will not consider it further.
We now know that convolutions act like small feature generators, that they are applied across an input layer (which is image data for the first layer), and that subsequent convolution layers find even more complicated features. But how are they calculated? Do we need to carefully craft a set of convolutions manually to apply them to our model? The answer is no; these convolutions are automatically calculated for us through the magic of the gradient descent algorithm. The best patterns are found after many iterations through the training dataset.
It might surprise you at first. How can deep learning achieve human-level performance in image classification and how can we build deep learning models if we do not fully understand how they work? This question has divided the deep learning community, largely along the demarcation between industry and academia. Many (but not all) researchers believe that we should get a more fundamental understanding of how deep learning models work. Some researchers also believe that we can only develop the next generation of artificial intelligence applications by getting a better understanding of how current architectures work. At a recent NIPS conference (one of the oldest and most notable conferences for deep learning), deep learning was unfavorably compared to alchemy. Meanwhile, practitioners in the industry are not concerned with how deep learning works. They are more focused on building ever more complex deep learning architectures to maximize accuracy or performance.
Of course, this is a crude representation of the state of the industry; not all academics are inward looking and not all practitioners are just tweaking models to get small improvements. Deep learning is still relatively new (although the foundation blocks of neural networks have been known about for decades). But this tension does exist and has been around for awhile – for example, a popular deep learning architecture introduced Inception modules, which were named after the Inception movie. In the film, Leonardo DiCaprio leads a team that alter people’s thoughts and opinions by embedding themselves within people’s dreams. Initially, they go one layer deep, but then go deeper, in effect going to dreams within dreams. As they go deeper, the worlds get more complicated and the outcomes less certain. We will not go into detail here about what Inception modules are, but they combine convolutional and max pooling layers in parallel. The authors of the paper acknowledged the memory and computational cost of the model within the paper, but by naming the key component as an Inception module, they were subtly suggesting which side of the argument they were on.
After the breakthrough performance of the winner of the 2012 ImageNet competition, two researchers were unsatisfied that there was no insight into how the model worked. They decided to reverse-engineer the algorithm, attempting to show the input pattern that caused a given activation in the feature maps. This was a non-trivial task, as some layers used in the original model (for example, pooling layers) discarded information. Their paper showed the top 9 activations for each layer. Here is the feature visualization for the first layer:
The image is in two parts; on the left we can see the convolution (the paper only highlights 9 convolutions for each layer). On the right, we can see examples of patterns within images that match that convolution. For example, the convolution in the top-left corner is a diagonal edge detector. Here is the feature visualization for the second layer:
Again, the image on the left is an interpretation of the convolution, while the image on the right shows examples of image patches that activate for that convolution. Here, we are starting to see some combinatorial patterns. For example, in the top-left, we can see patterns with stripes. Even more interesting is the example in the second row and second column. Here, we see circle shapes, which can indicate an eyeball in a person or an animal. Now, let's move on to feature visualization for the third layer:
In the third layer, we are seeing some really interesting patterns. In the second row and second column, we have identified parts of a car (wheels). In the third row and third column, we have begun to identify peoples' faces. In the second row and fourth column, we are identifying text within the images.
In the paper, the authors show examples for more layers. I encourage you to read the paper to get further insight into how convolutional layers work.
In another image classification task, the model failed to work in practice. The task was to classify wolves versus dogs. The model failed in practice because the model was trained with data which had wolves in their natural habitat, that is, snow. Therefore, the model assumed its task was to differentiate between snow and dog. Any image of a wolf in another setting was wrongly classified.