Image as a Second Language: A Review of Transformers in Vision

12 min readNov 20, 2023

Transformer-based architectures have emerged as the driving paradigm for innovation in natural language processing (NLP). Likewise, in the computer vision domain, convolutional neural networks (CNNs) and their many variants have long stood as the go-to architectures for tasks such as image classification, object detection, and semantic segmentation. However, the remarkable success of Transformers in NLP has encouraged researchers to investigate the potential applications of Transformers in vision tasks. In recent years, Transformers have already demonstrated significant capabilities in addressing a variety of computer vision problems. This article aims to illuminate this intersection by presenting a survey of notable state-of-the-art Transformer architectures tailored for computer vision. Also, this article offers an analysis of current challenges and a discussion of potential directions for further research.

The Transformer is the preferred standard architecture for a wide variety of NLP tasks, including sentiment analysis, machine translation, and text summarization. Following the Transformer inception in 2017 [1], models such as BERT [2] and GPT [3] excelled in natural language understanding (NLU) and natural language generation (NLG) tasks. At the core of the Transformer architecture is the multi-head self-attention mechanism, which allows the network to capture long-range dependencies and global contextual information of words in text sequences. This was a significant improvement over previous NLP models, which struggled to capture these relationships, particularly in longer text sequences. Another remarkable feature of the Transformer is the parallelization of the self-attention computation, which allows it to efficiently process vast amounts of data during training. Typically, Transformer models are pre-trained with huge corpora of text and subsequently fine-tuned on a smaller task-specific dataset.

On the other hand, convolutional neural networks (CNNs), are particularly suitable for processing images [4]. The convolution operation captures spatial representations of local features at each convolutional layer in a hierarchical manner. The output of a convolutional layer is a feature map, which is a compact representation of the input data that is easier to interpret. Two key attributes of CNNs are translation invariance, which means that they are not affected by small shifts in the input data, and scale invariance, meaning that CNNs can recognize objects at different scales [4]. CNNs are typically trained in a supervised manner with human-annotated images. The ImageNet database consisting of 14 million annotated images is the standard for image recognition and object detection research. Smaller subsets are normally used for training and benchmarking deep learning models.

The astounding success of Transformers inspired computer vision researchers to explore its use in vision problems. Some approaches implement a combination of CNNs with self-attention modules. Other solutions have successfully implemented pure Transformers with few modifications to the original architecture [6].

Self-attention Vision Transformers

The vision Transformer (ViT) model was introduced in 2020 by Dosovitskiy et al. [6], it was the first major attempt to customize a Transformer to process images with successful results. The ViT architecture consist of an encoder-only Transformer, and it was originally devised for image classification. Their approach to handling images is to split the image into fixed-size patches and then embed each patch into a 1D sequence of tokens. To retain the positional information of the image, a positional embedding is added to each patch. For image classification, ViT also adds an extra learnable token (similar to the [CLS] token in BERT) to each sequence of embedded patches. The resulting sequence is then fed to a standard Transformer encoder. At this point, the model will be pre-trained and subsequently fine-tuned to perform image classification. ViT demonstrated that it is possible to implement a pure Transformer for image classification, without CNN layers [6].

An important remark about training Transformers in general is that compared to CNNs, Transformers are “data-hungry” models. The original BERT model, for example, was pre-trained on the BooksCorpus consisting of 800 million words and on English Wikipedia with 2500 million words [2]. In contrast, ResNet-like CNNs are typically trained with smaller datasets, such as ImageNet-1K with 1.2 million training samples [5]. ViT was pre-trained on the JFT dataset, with 300 million labeled images. Notably, pre-training ViT only on the ImageNet-1K training set experiences a 13% drop in accuracy on the ImageNet test set [7]. Therefore, ViTs require much larger datasets in order to achieve a performance comparable to state-of-the-art CNNs. On a technical level, the reason is that convolutional architectures encode previous knowledge about the images, thus reducing the need for data, as opposed to Transformers that must learn that same knowledge from large datasets [6][7].

With a focus on training efficiency, DeiT (Data-efficient image Transformer) introduces a Transformer-specific teacher-student training strategy that yields effective results while training on the smaller ImageNet-1k dataset [8]. DeiT’s training strategy (also known as knowledge distillation) involves a large pre-trained ViT teacher model from which a smaller and more efficient ViT learns the important features in the data [9]. Interestingly, the authors report that with their knowledge distillation scheme, the student ViT learns more from a CNN than from another Transformer [8]. DeiT is a lightweight model well-suited for image classification.

To perform the more challenging task of object detection, YOLOS (You Only Look at One Sequence) [10] proposes a simple yet clever variation of the vanilla ViT. Essentially, YOLOS drops the [CLS] token used in ViT for classification and replaces it with 100 learnable [DET] tokens for object detection. Then during training, it produces an optimal two-part matching between the prediction of the 100 [DET] tokens and ground truth objects (also known as bipartite matching loss). The main goal of YOLOS was not to be a high-performance object detector but to demonstrate that task-agnostic pre-trained ViTs can be used for specific downstream vision tasks through fine-tuning.

The self-attention computation is known to be computationally demanding. In this regard, the Swin Transformer [11] implements a “shifted window” scheme that applies the self-attention computation only to non-overlapping windows. To enable cross-window connections, the Swin Transformer implements two successive self-attention blocks. The first block performs self-attention to a regular window partitioning, then the next block shifts those windows and performs window-based self-attention again. This scheme is efficient on vision problems, achieving competitive performance on the COCO object detection and ADE20K semantic segmentation benchmarks [11].

The Transformer architecture. On the left, the encoder module and the decoder on the right.

Hybrid Vision Transformers

While the original ViT was a pure transformer architecture [4], the authors suggested the idea of a hybrid approach [9]. Instead of splitting the image into patch embeddings, they proposed to obtain the input sequence from a CNN feature map, and then flatten the spatial dimensions to the Transformer’s input dimension.

In this line of work, DETR (Detection TRansformer) implements a combination of a CNN and a Transformer [12]. DETR’s architecture is quite simple, it consists of three modules that include a backbone CNN, an encoder-decoder Transformer, and a feed-forward network (FFN). The CNN used in the original model is a ResNet trained on ImageNet. Starting with an input image, the CNN module learns a 2D representation and collapses the output feature map into a 1D sequence. The encoder receives the sequence, and like ViT, a fixed positional encoding is added. The decoder then transforms the embeddings using multi-head self-attention and encoder-decoder attention. The final prediction is made in the FFN, which consists of a 3-layer perceptron.

Because of the computational complexity of the attention mechanism, the original ViT is limited to low-resolution images. Addressing this limitation, ViT-FRCNN proposes a ViT backbone that produces an input feature map to feed a regular CNN for object detection tasks. Essentially, ViT-FRCNN reinterprets the output sequences of the ViT module to create a 2D feature map. This feature map becomes the input for the CNN detector model. ViT-FRCNN’s approach demonstrates that a Transformer backbone model can retain sufficient spatial information for an object detection CNN. Their results further suggest that object detection tasks benefit from the commonly used massive pre-training paradigm in Transformers [13].

A novel approach that proposes a hybrid architecture is the Conformer. Introduced in 2021, the Conformer implements a dual-branch architecture: one CNN branch and a Transformer branch [14]. This approach benefits from both the Transformer’s ability to learn global representations and the CNN capturing important local features. To align the feature maps of the CNN branch with the patch embeddings of the Transformer branch, the authors propose a Feature Coupling Unit (FCU). The FCU couples the local features with the global representations interactively, enhancing the model’s ability to learn visual representations. The Conformer is a versatile model, able to perform image classification, object detection, and instance segmentation tasks.

Challenges Ahead

Significant advancements in deep learning for computer vision have been made in recent years, but many open challenges remain. This section offers a review of relevant challenges and discusses promising future directions.

General Purpose vs Task-specific models

NLP tasks are generally modeled with the inputs and outputs represented as sequences of tokens. In contrast, the output requirements for computer vision tasks are diverse. For example, the output in object detection is a bounding box with a class label for each detected object. In semantic segmentation, the model produces a mask of pixels corresponding to an object class. Because of these differences in output requirements, customized model architectures need to be developed. Adapting architectures for each specific task is a non-trivial endeavor.

In an attempt to unify all vision tasks under a single modeling framework, researchers have proposed to tokenize the task inputs (i.e., images) and outputs. The model takes the tokenized input and produces a tokenized output that can be decoded into the required task for visualization [15]. Future research along this line of work aims to develop general-purpose intelligent systems.

On the other hand, designing deep learning models for specific vision applications is just as crucial. Here the challenge lies in efficiently using the visual representations extracted from large pre-trained backbone models to obtain task-specific features, for example, in small object detection tasks [17]. A possible solution proposed by researchers involves developing tailored efficient decoders and streamlining the multi-stage approach to visual tasks [16].

Training Efficiency

Training large models with large datasets at the scale of Transformers is a computationally expensive task. Overall, the computer vision field can benefit from improvements in both CNNs and vision Transformers. Recent work suggests that CNNs when subjected to similar compute budgets and training data as Transformers, deliver comparable results. The authors in [18] used the JFT-4B dataset (with 4 billion samples) to pre-train an NFNet, which is a pure CNN architecture. After fine-tuning on ImageNet, the results show comparable performance to vision Transformers trained on similar compute budgets. This work suggests that contrary to accepted notions, CNNs can be made to match Transformers scaling properties. However, it also highlights the fact that computing power and training data are the most important factors for performance. These are still limiting factors and an open challenge for deep learning in general.

An important part in efficiently training deep neural networks is performed by the optimization algorithm, or optimizer. Adam stands out as the go-to optimizer for most vision models. In this line of research, a new technique to discover novel optimizers is presented in [19]. The researchers used a program search technique to discover new optimization algorithms. Their approach found an optimization algorithm with improved memory efficiency over Adam. To test this new algorithm, which they dubbed Lion, the authors pre-trained three different-sized ViTs on JFT-300M and fine-tuned them on ImageNet. Their best results report an increase of up to 2% accuracy on ImageNet with 5-times less pre-training computing. This work showcases the importance of continuing research on efficient training schemes.

Edge Computing Applications

With the growing adoption of edge computing devices in a wide range of markets, the demand for deployment of deep learning models on these devices also increases [16]. Typically, Internet of Things (IoT) devices used in edge computing are limited by power consumption, physical size, memory, and computing capabilities. Edge computing brings data storage and computation closer to the location of the device, making it impractical for resource-intensive models. Recent developments in this direction include model compression techniques. One such technique is quantization, which reduces memory requirements by minimizing the number of bits (e.g., 32-bit floating-point) to approximate the representation of weights in the network. Researchers in [20] demonstrated a promising application of quantization by training a CNN on a device with only 256Kb of memory.

The development of deep learning algorithms adapted to the constraints imposed by edge computing is an avenue for potential research, especially challenging for vision Transformers.

Conclusion

In the dynamic landscape of deep learning, the distinctive strengths of both the Transformer and CNN architectures have played pivotal roles. Overall, the Transformer architecture is a versatile tool that is revolutionizing the deep learning field and is likely to continue to be a major area of research. In recent years, a considerable amount of research has been published to study the intersection of Transformers and CNNs in computer vision tasks. This article presents a high-level review of some notable works on the subject; however, the literature is vast and the field is rapidly evolving. In recognizing the limitations of this review as a non-exhaustive presentation, the reader is encouraged to inquire further on the subject.

References:

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762v5, 2017.
[2] J. Devlin, MW. Chang, K. Lee, K. Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805v2, 2019.
[3] A. Radford, K. Narasimhan, T. Salimans, I. Sustkever. Improving language understanding by generative pre-training. In preprint, 2018.
[4] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard and L. Jackel. Backpropagation applied to handwritten zip code recognition. In Neural Computation, 1:541–551, 1989.
[5] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in NeurIPS, 25, 2012.
[6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929v2, 2021.
[7] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah. Transformers in vision: a survey. In ACM Comput. Surv, vol. 54, 10s, art. 200, 2022.
[8] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jegou. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.
[9] A. Khan, Z. Rauf, A. Sohail, A. R. Khan, H. Asif, A. Asif, et al. A survey of the vision transformers and its cnn-transformer based variants. arXiv preprint arXiv:2305.09880, 2023.
[10] Y. Fang, B. Liao, X. Wang, J. Fang, J. Qi, R. Wu, J. Niu and W. Liu. You only look at one sequence: rethinking transformer in vision through object detection. In Advances in NeurIPS, vol 34, pp 26183–26197, 2021.
[11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo. Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030v2, 2021.
[12] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-end object detection with transformers. arXiv preprint arXiv: 2005.12872v3, 2020.
[13] J. Beal, E. Kim, E. Tzeng, D. H. Park, A. Zhai, D. Kislyuk. Toward transformer-based object detection. arXiv preprint arXiv:2012.09958, 2020.
[14] Z. Peng, W. Huang, S. Gu, L. Xie, Y. Wang, J. Jiao, Q. Ye. Conformer: local features coupling global representations for visual recognition. In IEEE Trans. Pattern Anal. Mach. Intell., 2023.
[15] T. Chen, S. Saxena, L. Li, T. Lin, D. J. Fleet, G. E., Hinton. A unified sequence interface for vision tasks. In Advances in NeurIPS, vol. 35, pp. 31333–31346, 2022.
[16] Y. Wang, Y. Han, C. Wang, S. Song, Q. Tian, G. Huang. Computation-efficient deep learning for computer vision: a survey. arXiv preprint arXiv:2308.13998v1, 2023.
[17] A. M. Rekavandi, S. Rashidi, F. Boussaid, S. Hoefs, E. Akbas, M. Bennamoun. Transformers in small object detection: a benchmark and survey of state-of-the-art. arXiv preprint arXiv:2309.04902v1, 2023.
[18] S. L. Smith, A. Brock, L. Berrada, S. De. Convnets match vision transformers at scale. arXiv preprint arXiv:2310.16764v1, 2023.
[19] X. Chen, C. Liang, D. Huang, E. Real, K. Wang, Y. Liu, et al. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675v4, 2023.
[20] J. Lin, L. Zhu, W. Chen, W. Wang, C. Gan, S. Han. On-device training under 256kb memory. In NeurIPS, 2022.

I recommend checking Jay Alammar´s The Illustrated Transformer for everything you need to know about Transformers.

Image as a Second Language: A Review of Transformers in Vision

Self-attention Vision Transformers

Hybrid Vision Transformers

Challenges Ahead

Conclusion

Written by Ivan Lee

No responses yet