arxiv Preprint - Vision Transformers Need Registers

In this episode we discuss Vision Transformers Need Registers
by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski. The paper discusses a solution to artifacts found in the feature maps of Vision Transformers (ViT) in low-informative background areas of images. By adding additional tokens called “registers” to the input sequence, the feature maps and attention maps are improved, leading to better visual processing. This solution is effective for both supervised and self-supervised ViT models and achieves state-of-the-art performance on self-supervised visual models. Additionally, the use of registers enables object discovery methods with larger models.

arxiv Preprint – Vision Transformers Need Registers