Editor’s note: Rowel is a speaker for ODSC APAC 2021. Be sure to check out his talk, “Vision Transformer and its Applications,” there!

Since the idea of using Attention in natural language processing (NLP) was introduced in 2017 [1], transformer-based models have dominated performance leaderboards in language tasks. Many attempts to transfer the success of attention to computer vision problems have failed. In 2020, a group of researchers from Google figured out how to make transformers work on vision tasks [2]. Pre-training on a very large dataset and fine-tuning on a target dataset proved to be the solution. For example, pre-training on 303M high-resolution images from JFT300M and fine-tuning on ImageNet1k target dataset shows performance that is comparable to state-of-the-art vision models like EfficientNet [3]. This attention-based model is called a Vision Transformer. Except for a different training technique, the vision transformer is practically the same as the original transformer proposed by Vaswani et al (2017) for NLP tasks [1]. Follow-up papers in using vision transformers on different downstream tasks [4, 5, 6, 7] demonstrated significant improvement in performance under a variety of metrics.


Proving that a transformer can also effectively work on vision problems got many people excited. Potentially, self-attention or simply attention is a good general-purpose architecture that can process different data formats such as text, images, audio, videos, and point clouds. In NLP, attention is a measure of the relevance between any two tokens in a sentence. For example, in the sentence “the quick brown fox jumps over the lazy dog”, the words “brown” and “fox” have a strong relationship together while the word  “brown” has nothing to do with the word “dog”.  We use the term “Attention” to measure the relevance between two words or tokens in a sentence. As shown in Figure 1, the attention is high between “brown” and “fox” while the attention is low between “brown” and “dog.”

Figure 1. Attention between two words in a sentence is a measure of relevance between them.

For almost 3 years since transformers made significant progress in NLP, computer vision people have been trying to figure out how to transfer the same concept of attention between two tokens in images. Using pixel as the unit of token is impractical as this would mean 50,176 in length sequences to process if we will train on 224×224 RGB ImageNet1k data. This is enormous and computationally impractical.  In the above example, our sentence is just made of 9 tokens. Some sentences may be longer but not in the range of thousands.

To reduce the sequence length, vision people proposed to use a patch to represent a token. For example, using a patch size of 16×16 in ImageNet1k data, the sequence length is now down to just 196. All of a sudden, model training becomes more computationally practical. Instead of sequences of words to train the attention-based model, sequences of patches are now used to train the vision transformer in a supervised manner. The same concept of relevance between two tokens can be extended to attention models in vision.  For example, Figure 2 shows that the attention between two patches in the bird area is high but the attention between the wall patch and the bird patch is low. The wall has little to do with the visual concept of a bird.

Vision Transformer

Figure 2. Attention between two bird patches is high while attention between any wall patch and any bird patch is low.

Perhaps, the greatest impact of the vision transformer is there is a strong indication that we can build a universal model architecture that can support any type of input data like text, image, audio, and video. The excitement emanates from new promising results on training models with multi-modal input data using transformers while avoiding heavy engineering and inefficiencies in the use of mixed architectures like RNNs for sequences and CNNs for visual data. The importance of multi-modal learning can not be over-emphasized since the world we live in is inherently multi-modal. For machines to be more useful to our society, our algorithms must learn how to reason with given multi-sensory inputs.

With attention-based models, we expect new breakthroughs to come out in the near future. We will see attention-based models that can reliably answer language queries about an image like “What is the time on the clock?” as shown in Figure 3 or “How much is a bottle of sparkling water?”. We will be fascinated with robots that can understand handwritten instructions or voice commands like “Can you help me find my car keys?”.  For machine learning practitioners, now is a good time to learn and apply transformers to your projects.

Figure 3. Example of data with two modalities (text and image).

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

[2] Dosovitskiy, Alexey, et al. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale.” International Conference on Learning Representations. 2020.

[3] Tan, Mingxing, and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks.” International Conference on Machine Learning. PMLR, 2019.

[4] Zheng, Sixiao, et al. “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[5] Xie, Enze, et al. “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers.” arXiv preprint arXiv:2105.15203. 2021.

[6] Atienza,  Rowel.    Vision transformer for fast and efficient scene text recognition.  InInternational Conference on Document Analysis and Recognition (ICDAR). 2021.

[7] Chen, Jieneng, et al. “Transunet: Transformers make strong encoders for medical image segmentation.” arXiv preprint arXiv:2102.04306. 2021.

Editor’s note: More info on Rowel’s upcoming ODSC APAC 2021 session: While the transformer has dominated most state-of-the-art models in natural language processing, CNN is still the preferred backbone in the field of computer vision. This changed in 2020 when Vision Transformer (ViT) demonstrated strong results in different downstream tasks. Since then, ViT has experienced rapid adoption and progress. In this talk, we will cover the network architecture, training, and applications of ViT. 

Rowel Atienza is a Professor and Scientist at the Electrical and Electronics Engineering Institute of the University of the Philippines, Diliman. He holds the Dado and Maria Banatao Institute Professorial Chair in Artificial Intelligence. He finished his Ph.D. at The Australian National University for his contribution in the field of active gaze tracking for human-robot interaction. His current research work focuses on computer vision, robotics, and AI. Rowel is the author of Advanced Deep Learning with TensorFlow 2 and Keras (1st and 2nd ed).