I was fortunate to attend the ICCV 2023 conference in Paris, where I collected papers and notes. I have decided to share my notes along with my favorite ones. Here are the best papers picked out along with their key ideas. If you like my notes below, share them on social media!
Towards understanding the connection between generative and discriminative learning
Key idea: A very new trend that I am extremely excited about is the connection between generative and discriminative modeling. Is there any shared representation between them?
The authors demonstrate the existence of matching neurons (rosetta neurons) across different models that express a shared concept, such as object contours, object parts, and colors. These concepts emerge without any supervision or manual annotations. Source
Yes! The paper “Rosetta Neurons: Mining the Common Units in a Model Zoo” showed that completely different models pre-trained with different objectives learn shared concepts, such as object contours, object parts, and colors, without supervision or manual annotations. I had only seen object-related concepts emerge on the self-attention maps of self-supervised vision transformers such as DINO so far. They further show that the activations look similar, even for StyleGAN2.
The process can be briefly described as follows: 1) use the trained generative model to produce images, 2) feed the image into a discriminative model and store all activation maps from all layers, 3) compute Pearson correlation averaged over images and spatial dimensions, 4) find mutual nearest neighbors between all activations of the two models, 5) cluster them.
Pre-pretraining: Combining visual self-supervised training with natural language supervision
Motivation: The masked autoencoder (MAE) randomly masks 75% of an image and trains the model to reconstruct the masked input image by minimizing the pixel reconstruction error. MAE has only been shown to scale with model size on ImageNet.
On the other hand, weakly supervised learning (WSL) meaning natural language supervision has a text description for each image. WSL is a middle-ground between supervised and self-supervised pretraining, where text annotations are used, such as CLIP
Key idea: While MAE thrives in dense vision tasks like segmentation, WSL learns abstract features and has a remarkable zero-shot performance. Can we find a way to get the best of both worlds?
Adapting a pre-trained model by refocusing its attention
Since foundational models are the way to go, finding clever ways to adapt them to various downstream tasks is a critical research avenue.
Researchers from UC Berkeley and Microsoft Research show that it can be achieved by a TOp-down Attention STeering (TOAST) approach in their paper “TOAST: Transfer Learning via Attention Steering”.
Key idea: Given a pretrained ViT backbone, they tune the additional linear layers of their method that act as feedback paths after the 1st forward pass. As such the model can redirect its attention to the task-relevant features and as shown below it can outperform standard fine-tuning (75.2 VS 60.2% accuracy).
Image and video segmentation using discrete diffusion generative models
Google DeepMind presented an intriguing work called “A Generalist Framework for Panoptic Segmentation of Images and Videos”.
Key idea: A diffusion model is proposed to model panoptic segmentation masks, with a simple architecture and generic loss function. Specifically for segmentation, we want the class and the instance ID, which are discrete targets. For this reason, the infamous Bit Diffusion was used.
The diffusion model is pretrained unconditionally to produce the segmentation mask and then the pretrained image encoder plus the diffusion model are jointly trained for conditional segmentation.
Crucially, by adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically.
Diffusion models for stochastic segmentation
In a proximal work, researchers from the University of Bern showed that categorical diffusion models can used for stochastic image segmentation in their work titled “Stochastic Segmentation with Conditional Categorical Diffusion Models”.
Key idea: More concretely, by pre-training on unconditional image generation, diffusion models are already capturing linear-separable representations within their intermediate layers, without modifications.
Diffusion Models as (Soft) Masked Autoencoders
Key idea: In this direction, the paper Diffusion Models as Masked Autoencoders proposes conditioning diffusion models on patch-based masked input. Specifically, the noising was taking place pixel-wise in standard diffusion, which can be regarded as soft pixel-wise masking. On the other hand, the masked autoencoder was receiving masked pixels, a type of hard masking as pixels are simply zeroed. By combining those two, the authors formulate diffusion models as masked autoencoders (DiffMAE).
Denoising Diffusion Autoencoders as Self-supervised Learners
Visual representation learning is improving from all different directions such as supervised learning, natural language weakly supervised learning, or self-supervised learning. And from now on with Diffusion Models!
Key idea: The paper “Denoising Diffusion Autoencoders are Unified Self-supervised Learners” found that even the standard unconditional diffusion models can be leveraged for representation learning similar to self-supervised models. More concretely, by pre-training on unconditional image generation, diffusion models are already capturing linear-separable representations within their intermediate layers, without modifications.
Leveraging DINO Attention Masks to the Maximum
Key idea: The authors propose a simple framework called Cut-and-LEaRn (CutLER). They leverage the property of self-supervised models to ‘discover’ objects without supervision (in their attention maps). They post-process those masks to train a state-of-the-art localization model without any human labels.
Generative Learning on Images: Can’t We Do Better than FID?
On the direction of alternative evaluations of generative models, I really like the approach from the paper “HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models” among other existing ones, mainly based on CLIP, and only applicable for text-conditional image generation.
Key idea: Measure image quality (fidelity) through text-to-text alignment using CLIP (the Image Captioner model G(I) in the figure below).
There are plenty of other high-quality papers presented at ICCV2023, and I encourage you to explore them as well! Furthermore, this was my first time at a conference, and I had a great experience. I believe that conferences are valuable opportunities for professionals to stay up-to-speed with the latest trends in their field.
Email me if you like the paper names and the directions I am following. I can send you my formatted notes.