Computer Vision

Computer Vision is everything related to Camera. In recent years however, there are advances with LiDAR technology and so the more general problem is referred to as Perception.

See Camera for historical context.

How can humans actually see?

I really enjoyed Cyrill Stachniss course, which taught me this stuff really well. The first lecture on cameras was particularly helpful.



Reading Group

Vision Transformer (ViT)

Masked Autoencoders

In 2D, we always use CNN, and sometimes Transformer.



  • COCO (200,000)

I heard somewhere that usually you don’t train an entire CNN from scratch, since that requires millions of labeled data that you don’t have. And rather, what you do, is build off a backbone trained neural network and use it for your own tasks. YES! I found it, this is the idea of Transfer Learning <- See page for actual results

Tips for doing well on benchmarks/winning competitions

Taken from CS231n course.

  • Ensembling (this is never used in production because it is too computationally expensive)
    • Train several networks (3-15 networks) independently and average their outputs
  • Multi-crop at test time
    • Run classifier on multiple versions of test images and average results (ensemble)
  • Use architectures of networks published in the literatures
  • Use open source implementations if possible (because they have figured out the finnicky details, like learning rate parameters)
  • Use pretrained models and fine-tune on your dataset