Decoding the Characteristics and Strengths of Vision Transformers (ViTs): Uncovering Hidden Properties and Insights in Representation Robustness

“`html


Understanding Vision Transformers (ViTs): Hidden properties, insights, and robustness of their representations


It is well-established that Vision Transformers (ViTs) can outperform convolutional neural networks (CNNs), such as ResNets in image recognition. But what are the factors that cause ViTs’ superior performance? To answer this, we investigate the learned representations of pretrained models.

In this article, we will explore various topics based on high-impact computer vision papers:

  1. The texture-shape cue conflict and the issues that come with supervised training on ImageNet.
  2. Several ways to learn robust and meaningful visual representations, like self-supervision and natural language supervision.
  3. The robustness of ViTs vs CNNs, as well as highlight the intriguing properties that emerge from trained ViTs.

Adversarial Attacks

Adversarial Attacks are well-known experiments that help us gain insight into the workings of a classification network. They are designed to fool neural networks by leveraging their gradients (Goodfellow et al. ). Instead of minimizing the loss by altering the weights, an adversarial perturbation changes the inputs to maximize the loss based on the computed gradients. Let’s look at the adversarial perturbations computed for a ViT and a ResNet model.

ViTs and ResNets process their inputs very differently. Source


Robustness

Robustness: We apply a perturbation to the input images (i.e. masking, blurring) and track the performance drop of the trained model. The smaller the performance degradation, the more robust the classifier! Robustness is measured in supervised setups, so the performance metric is usually classification accuracy. Furthermore, robustness can be defined with respect to model perturbations; for example by removing a few layers. But this is not so common. Note that our definition of robustness always includes a perturbation. The transformer can attend to all the tokens (16×16 image patches) at each block by design. The originally proposed ViT model from Dosovitskiy et al. already demonstrated that heads from early layers tend to attend to far-away pixels, while heads from later layers do not.

How heads of different layers attend to their surround pixels. Source: Dosovitskiy et al.


ImageNet-pretrained CNNs are biased towards texture

In their paper “Are we done with ImageNet?”, Beyer et al. argue whether existing model simply overfit to the idiosyncrasies of ImageNet’s labeling procedure. To delve deeper into the learned representations of pretrained models, we will focus on the infamous ResNet50 study by . More specifically, Geirhos et al. demonstrated that CNNs trained on ImageNet are strongly biased towards recognizing textures rather than shapes. Below is an excellent example of such a case:

Classification of a standard ResNet-50 of (a) a texture image (elephant skin: only texture cues); (b) a normal image of a cat (with both shape and texture cues), and (c) an image with a texture-shape cue conflict, generated by style transfer between the first two images. Source: Geirhos et al.


What’s wrong with ImageNet?

Brendel et al. provided sufficient experimental results to state that ImageNet can be “solved” (decently high accuracy) using only local information. In other words, it suffices to integrate evidence from many local texture features rather than going through the process of integrating and classifying global shapes. The problem? ImageNet learned features generalize poorly in the presence of strong perturbations. This severely limits the use of pretrained models in settings where shape features translate well, but texture features do not. One example of poor generalization is the Stylized ImageNet (SIN) dataset.

The SIN dataset. Left: reference image. Right: Example texture-free images that can be recognized only by shape. Geirhos et al.


Hand-crafted tasks: rotation prediction

Various hand-crafted pretext tasks have been proposed to improve the learned representations. Such pretext tasks can be used either for self-supervised pretraining or as auxiliary objectives. Self-supervised pretraining requires more resources and usually a larger dataset, while the auxiliary objective introduces a new hyperparameter to balance the contribution of the multiple losses. L=Lsupervised⁡+λLpretext⁡L = L_{\operatorname{supervised}} +\lambda L_{\operatorname{pretext}} L=Lsupervised​+λLpretext​For instance, Gidaris et al. used rotation prediction for self-supervised pretraining. The core intuition of rotation prediction (typically [0,90,180,270]) is that if someone is not aware of the objects depicted in the images, he cannot recognize the rotation that was applied to them.

Applied rotations. Source: Gidaris et al. ICLR 2018


Self-supervised joint-embedding architectures

DINO: self-distillation combined with Vision Transformers. Over the years, a plethora of joint-embedding architectures has been developed. In this blog post, we will focus on the recent work of Caron et al., namely DINO.

The DINO architecture. Source: Caron et al.


Pixel-insensitive representations: natural language supervision

In CLIP, Radford et al. scraped a 400M image-text description dataset from the web. Instead of having a single label (e.g. car) and encoding it as a one-hot vector we now have a sentence. Given that the label names are available for the downstream dataset one can do zero-shot classification, by leveraging the text transformer and taking the image-text pair with the maximum similarity. Notice how robust the model is compared to a supervised ResNet with respect to (w.r.t.) data perturbations like sketches.

Source: Radford et al.


Robustness of ViTs versus ResNets under multiple perturbations

Google AI has conducted extensive experiments to study the behavior of supervised trained models under different perturbation setups. In the standard supervised arena, Bhojanapalli et al. explored how ViTs and ResNets behave in terms of their robustness against perturbations to inputs as well as to model-based perturbations.

Source: Bhojanapalli et al.


Intriguing Properties of Vision Transformers

In this excellent work, Naseer et al. investigated the learned representations of ViTs in greater depth. Below are the main takeaways:

  1. ViTs are highly robust to occlusions, permutations, and distribution shifts.
  2. The robustness w.r.t. occlusions is not due to texture bias. ViTs are significantly less biased towards local textures, compared to CNNs.
  3. The emerged background segmentation masks are quite similar to DINO. This fact indicates that both DINO and the shape-distilled ViT (DeiT) learn shape-based representations.

Vision Transformers are Robust Learners

Sayak Paul and Pin-Yu Chen investigated the robustness of ViTs against corruptions, perturbations, distribution shifts, and natural adversarial examples. More importantly, they used a stronger CNN-based baseline called BiT. The core results are the following:

  1. A longer pretraining schedule and larger pretraining dataset improve robustness.
  2. Attention is key to robustness, which is consistent with all the presented works.
  3. ViTs have better robustness to occlusions.
  4. ViTs have a smoother loss landscape to input perturbations.



“`

Latest articles

Related articles