Is it the end of CNNs? What is Multi-Layer Perceptron?
For many years, Convolutional Neural Networks (CNN) have been the cornerstone of any computer vision architecture. Image Analysis requires feature extraction and the key component that is responsible for that task is CNN. For example, given an image, the convolution layer detects features such as two eyes, long ears, four legs, a short tail, and so on. The CNN is responsible for numerous applications in computer vision such as Face Recognition - knowing the identity of the person in the image, Object Detection - determining the location of an object, and Video Analysis- classifying the behavior of an individual.
Despite the importance of CNN and its applications, the rise of Multi-layer Perceptron (MLP-Mixer) has opened the door to new architectures that show competitive performance without the need for CNN layers which makes us wonder about the future of CNNs in deep learning and computer vision applications.
In this article, we will give a quick brief about the newly released MLP-mix architecture and try to figure out if it threatens the position of CNNs in the field of computer vision. Especially when we know that the new proposed “MLP-Mixer” achieves very close results to the SOTA models trained on tons of data with almost 3x the speed without using any convolutions or any self-attention layers.
MLP for Computer Vision Applications
Last month, the Google AI team published a paper entitled "MLP-Mixer: An all-MLP Architecture for Vision" (code is available on GitHub) where they proposed a new computer vision architecture that depends only on multi-layer perceptrons. According to the team, "In this paper, we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs)"
Moreover, and interestingly, the new model achieves similar results compared to the state-of-the-art models trained on large datasets with almost 3x speed. “When trained on large datasets, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models,” claimed Google AI.
The following table shows the inference throughput of the MLP-mixer compared to SOTA models. The MLP-mixer clearly outperformed many models achieving 105 images/sec/core, while a model such as vision transformer is at 32 images/sec/core only.
The interesting thing is that MLP is the basic unit of deep neural network which makes the architecture of the MLP-mixer model really simple.How does MLP-mixer work?
While in CNNs, the image is passed to the network and the processing is performed on the pixels of the images, the MLP-mixer follows the patches approach. The image is divided into patches and then passed as an input to the network. The mixer of the model contains two types of layers. The first one mixes the channels of the image, while the other one performs a mix between patches to ensure communication between spatial features of the image.
Figure 1- Source: MLP-Mixer on arxiv
What makes MLP-mixer attract the attention
The previous image illustrates the architecture of the mixer layers. It contains two fully connected layers, GELU nonlinearity, skip-connections, layer norm, and linear classifier. That shows the simplicity of the model which depends essentially on matrix multiplication, while Convolution is more complex than the plain matrix multiplication in MLPs as it requires an additional cost reduction to matrix multiplication or specialized implementation. Furthermore, the MLP-mixer takes the same size of the input at each layer which eliminates the need of scaling down the image such as in CNN.
Conclusion
It is just the beginning and as the Google AI team says, "…we hope that our results spark further research, beyond the realms of established models based on convolutions and self-attention". So, we expect to see more research focused on this point resulting in more improved and advanced architectures to solve computer vision problems such as image classification and object detection which may create a new stream that pulls the trigger on CNNs.
Author: Moaaz Abdelrahman ElMarakby, Computer Vision Engineer