Computer Vision
Computer Vision is everything related to Camera. In recent years however, there are advances with LiDAR technology and so the more general problem is referred to as Perception.
See Camera for historical context.
How can humans actually see?
I really enjoyed Cyrill Stachniss course, which taught me this stuff really well. The first lecture on cameras was particularly helpful.
Resources
- Courses
- CS231N by Stanford
- Deep Learning for Computer Vision by Justin Johnson
- Deep Learning Specialization by Andrew Ng
- Slides by F1TENTH is a good overview of Classical Methods and Deep Learning Methods (video format also available online)
- These slides https://www.cs.cmu.edu/~16385/s17/
Concepts
- Scale-Invariant Feature Transform (SIFT)
- Image Processing
- Feature Detection
- Semantic Segmentation
- Object Detection
- Jaccard Index
Reading Group
Vision Transformer (ViT)
- An image is worth 16x16 words paper
- Deblurring Images (SRCNN) https://arxiv.org/abs/1501.00029
Masked Autoencoders
In 2D, we always use CNN, and sometimes Transformer.
PV-RCNN
Datasets
- COCO (200,000)
I heard somewhere that usually you don’t train an entire CNN from scratch, since that requires millions of labeled data that you don’t have. And rather, what you do, is build off a backbone trained neural network and use it for your own tasks. YES! I found it, this is the idea of Transfer Learning <- See page for actual results
Tips for doing well on benchmarks/winning competitions
Taken from CS231n course.
- Ensembling (this is never used in production because it is too computationally expensive)
- Train several networks (3-15 networks) independently and average their outputs
- Multi-crop at test time
- Run classifier on multiple versions of test images and average results (ensemble)
- Use architectures of networks published in the literatures
- Use open source implementations if possible (because they have figured out the finnicky details, like learning rate parameters)
- Use pretrained models and fine-tune on your dataset