Bird-Eye View

cameras a pointed in camera

this is super cool. Just projecting out images does not work very well (it works on short distances, but definitely not over long distances).

sticking together the predictions is non-trivial.

Tesla: occupancy tracker

Camera to BEV Transform

taking Notes. You should first understand how Camera transforms work.

They basically generate these distributions over each pixel, guess how far the car is. With map segmentation, they are able to predict the depth of a particular car. Bevfusion is a very promising paper, except I don’t know how good it is going to be running in real-time.

The BEVFusion paper takes inspiration from two other models:

I think this M^2BEV is really similar and new to BEVFusion, by people at UofT, which builds on top of their LSS project.

What makes BEV Segmentation hard? https://www.youtube.com/watch?v=oL5ISk6BnDE&ab_channel=JonahPhilion

  • So what he is saying is that you can approach it just like a traditional Semantic Segmentation problem, where the input is multiple channels, and the output is the BEV map
  • That works, but then if you camera moves a little bit, it doesn’t work anymore. You want these to generalize
  • LSS is robust to calibration error

BEVFusion Paper

Wow, actually a lot of the things in this paper seem to really generalize, it’s the same as Tesla.

It’s actually a really powerful architecture, though there is no Radar. They train lidar and camera separately features, convert them both separately to BEV and then merging them into a common BEV feature representation (which I don’t know how)

  • For LiDAR, this is straightforward since you know the absolute positioning of each point
    • To convert the BEV, you just squash the Z-axis
  • For Camera, they have a Camera-To-BEV transformation that is super efficient

Model Backbones:

  • Swin Transformer for image backbone
    • Uses the FPN so you can you multi-scale camera features
  • VoxelNet for LiDAR backbone

Then, you have these heads for different tasks.