A few favorite recipes in computer vision & deep learning

This blog post enlists a few of my favorite recipes in deep learning in the context of computer vision (as of August 2020).
visual-representation-learning
self-supervised-learning
computer-vision
Published

August 2, 2020

A few days ago from the time of writing this blog post I tweeted -

twitter: https://twitter.com/RisingSayak/status/1285866356592635904?s=20


In this blog post, I will expand on this tweet to convey why these are my favorite recipes among other things.

The training frameworks I mentioned can be classified into two broad categories -


Note

that SimCLR [3] and SimCLRv2 [4] are two separate works.

So, why self-supervised learning anyway?

The field of self-supervised visual representation learning is progressing pretty fast. With recent advancements, the deep learning community has started to consider it as an alternative to the fully supervised models for tasks like image classification, object detection, image segmentation, etc. If you are unfamiliar with self-supervised learning check out this blog post by Jeremy Howard. So, why all the fuss around self-supervised visual representation learning?

This is because a self-supervised learning framework can benefit from unlabeled data. Essentially, you would frame a supervised learning task from a large unlabeled corpus and then train a model to learn that task. You see we are not using any explicit label information here. Instead, we are using the given data to form a supervised learning task, this is why it is self-supervised. You would then take the representations from the model (preferably from the encoder part of the model) and use them for downstream tasks. Representations learned using self-supervised learning frameworks like SimCLRv2, SwAV [5] transfer quite well to downstream (vision) tasks even with very less labeled data.

Colin Raffel beautifully summed up the recent progress in the field of self-supervised learning for computer vision -

twitter: https://twitter.com/colinraffel/status/1289315020199743488?s=20

BYOL [6] and SwAV have even beaten SimCLR -

One could argue that this is in comparison with SimCLR but not SimCLRv2. Well, that is because SimCLRv2 is not just about self-supervised learning, it is more than that -

This blog post, however, would only focus on the SimCLR part i.e. the left-most part of the figure above. If you are interested to know more about SimCLRv2, feel free to watch this video by Yannic Kilcher.

Why SimCLR(v2)?

Among all of the techniques that have been mentioned in the post so far, SimCLR is by far the most favorite of mine. This is because of its simplicity and the promise to shine more as one would scale up the base architecture and introduce more (unlabeled data). Moreover, framework wise SimCLR is very consistent with the recipes I would want to see in an effective visual representation learning framework. One of these primary recipes is to maximize the agreement between the semantically similar images. SwAV is also capable of doing this, in fact, it is currently the state-of-the-art (as of August 2020) in this domain. But simplicity wise SimCLR beats SwAV big time.

At a very high-level, SimCLR takes two different views of the same image and tries to maximize the agreement between these two views while minimizing the agreement between the views coming from other images. These different views are obtained by applying augmentation operations like random-resized crops, horizontal flips, color distortions, etc. Representations learned using this framework (and any self-supervised visual representation learning framework in general) can be used in different flavors -

  • You may have loads of unlabeled data and limited labeled data for your problem. You could use the unlabeled data and incorporate SimCLR for obtaining effective representations and use them to a downstream task where the limited labeled data might be required.
  • Representations obtained from datasets like ImageNet using SimCLR can be used in regular transfer learning settings.

As we can see in SimCLR, the loss function (normalized temperature-scaled cross-entropy loss) operates directly on the features computed by the projection head (MLP part). This makes SimCLR a compute-intensive framework.

On the other hand, SwAV operates by assigning the encoded representations of different views of the same image to clusters. The clusters are being assigned by keeping a differentiable codebook for the prototypes of the different types of images present in the given dataset. Training wise, SwAV tries to maximize the agreement between the clusters of semantically similar images. Operating on the clusters rather than the encoded representations is a lesser compute-intensive task.

SwAV might appear as a simpler framework than SimCLR but there are a number of different pieces to look after here:

  • Maintaining a differentiable prototype bank
  • Optimal transportation of the representations to form soft codes using the Sinkhorn-Knopp algorithm
  • Multi-crop data augmentation policy
  • Swapped prediction problem

When working in practical scenarios we often need to maintain a trade-off between technical feasibility and performance. To me, SimCLR cuts it through in terms of technical feasibility.

Update: Along with Ayush and generous amount of help from Mathilde Caron (first author of SwAV) we were finally able to minimally implement SwAV after realizing the improvements it brings to the table.

Returning to supervised learning

Given the almightly prowess of the self-supervised learning frameworks why even bother about supervised regimes?

Labels + contrastive loss = win-win

Supervised Contrastive Learning addresses a very important point about the self-supervised learning frameworks like SimCLR. In SimCLR, the positive pairs are generated by taking different views of the same image and the negative pairs are then randomly sampled from the other images present in a batch.

Here are some examples of negative pairs -

As discussed in the Supervised Contrastive Learning paper, this method of contrasting two different views of the same image can result in false negatives i.e. the samples belonging to the same class might get mapped differently in the embedding space. There’s no way for us to properly mitigate this issue without having access to the original labels. Hence, I mentioned if you have loads of labeled images, it’s better to use Supervised Contrastive Learning to capture meaningful representations.

Supervised Contrastive Learning extends how we train supervised models by introducing a two stage training framework. In the first stage, it uses the label information in the contrastive loss to learn to map the encoded representations effectively. In the second stage, it train a linear model on top of these encoded representations for the given supervised training objective.

In practice this works quite well -

You might be feeling very tempted to try out this on your labeled dataset. Truth be told - Supervised Contrastive Learning is also compute-intensive even for relatively small datasets. So, if you don’t have the training budget you might need to reconsider this option.

Maybe you have a relatively smaller labeled dataset and you just want to be able to fine-tune a good enough architecture which is still very useful for many practical scenarios. This is where BigTransfer can really shine.

Returning to supervised transfer learning

BigTransfer is from the family of classic supervised pre-training and transfer the learned representations to downstream tasks. ImageNet has been the choice for this kind of pre-training for a long time. But in BigTransfer, the authors use larger datasets such as ImageNet-21k, JFT along with ImageNet. In order to facilitate these larger datasets they scale up the model architectures along with longer pre-training. Their result speaks for itself -

To eliminate the dependence on batch statistics the authors make use of Group Normalization and Weight Standardization. Personally, I really liked this recipe because using overall large batch sizes to train larger models at scale is a common choice and using Batch Normalization there could have easily affected the performance of the models during the downstream tasks.

For fine-tuning, the authors propose a heuristics-based BiT-HyperRule which provides instructions on what augmentation policy to use, how many steps to train for, what learning rate schedule to use, etc.

![](https://github.com/sayakpaul/portfolio/raw/master/posts/bit_hyperrule.png ” BiT Hyper-rule (Source: BigTransfer blog post [10])“)

I found this strategy to be simple enough to be implemented and practised.

Resources

If you are interested to apply these techniques in your own works you may find the following resources to be helpful -

  • A Colab Notebook by the authors of SimCLRv2 that shows how to fine-tune with SimCLRv2.
  • A report by Sweta Shaw and myself that walks through Supervised Contrastive Learning along with Colab Notebooks.
  • A tutorial on BigTransfer by the authors of BigTransfer.

Here is a list of some other interesting works on transfer learning for computer vision -

twitter: https://twitter.com/RisingSayak/status/1285119290895548417

Acknowledgments

Amit’s visual guide on SimCLR and Thalles’s exploratory analysis on SimCLR helped me in developing a thorough understanding of SimCLR.

Thanks to Yannic Kilcher for his explanation videos on SimCLRv2 and Supervised Contrastive Learning. Those made the learning process smoother.

Thanks to Ting Chen (SimCLR author) for providing me with additional pointers on self-supervised learning in general.

Thanks to Jeremy Howard for his fast.ai lectures that continue to help me in approaching deep learning recipes with more common sense and practicality.

References

  1. Khosla, Prannay, et al. “Supervised Contrastive Learning.” ArXiv:2004.11362 [Cs, Stat], Apr. 2020. arXiv.org, http://arxiv.org/abs/2004.11362.
  2. Kolesnikov, Alexander, et al. “Big Transfer (BiT): General Visual Representation Learning.” ArXiv:1912.11370 [Cs], May 2020. arXiv.org, http://arxiv.org/abs/1912.11370.
  3. Chen, Ting, Simon Kornblith, Mohammad Norouzi, et al. “A Simple Framework for Contrastive Learning of Visual Representations.” ArXiv:2002.05709 [Cs, Stat], June 2020. arXiv.org, http://arxiv.org/abs/2002.05709.
  4. Chen, Ting, Simon Kornblith, Kevin Swersky, et al. “Big Self-Supervised Models Are Strong Semi-Supervised Learners.” ArXiv:2006.10029 [Cs, Stat], June 2020. arXiv.org, http://arxiv.org/abs/2006.10029.
  5. Caron, Mathilde, et al. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments.” ArXiv:2006.09882 [Cs], July 2020. arXiv.org, http://arxiv.org/abs/2006.09882.
  6. Grill, Jean-Bastien, et al. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.” ArXiv:2006.07733 [Cs, Stat], June 2020. arXiv.org, http://arxiv.org/abs/2006.07733.
  7. “Advancing Self-Supervised and Semi-Supervised Learning with SimCLR.” Google AI Blog, http://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html.
  8. Facebookresearch/Swav. 2020. Facebook Research, 2020. GitHub, https://github.com/facebookresearch/swav.
  9. Exploring SimCLR: A Simple Framework for Contrastive Learning of Visual Representations - Thalles’ Blog. https://sthalles.github.io/simple-self-supervised-learning/.
  10. BigTransfer (BiT): State-of-the-Art Transfer Learning for Computer Vision. https://blog.tensorflow.org/2020/05/bigtransfer-bit-state-of-art-transfer-learning-computer-vision.html.