Visual SLAM

Visual uses cameras to simultaneously localize and construct a map of the environment in real-time.

Tesla doesn’t seem to use visual SLAM. Instead, it does planning using an Occupancy Network.


You basically try to figure out where the features align.

Visual SLAM Concepts

Visual SLAM Implementations

Classical Visual SLAM Stack

Typical Visual SLAM Workflow

A typical visual SLAM workflow includes the following steps:

  1. Sensor data acquisition. (cameras, and optionally motor encoders, IMUs, etc).
  2. Visual Odometry (frontend): VO’s task is to estimate the camera movement between adjacent frames (ego-motion) and generate a rough local map.
  3. Backend filtering/optimization (backend). Receives poses from VO and loop closing, and then applies optimization to generate a fully optimized trajectory and map through Bundle Adjustment
  4. Loop Closing. Determines whether the robot has returned to its previous position in order to reduce the accumulated drift. If a loop is detected, it will provide information to the backend for further optimization.
  5. Reconstruction (optional). It constructs a task-specific map based on the estimated camera trajectory.

frontend more relevant to computer vision topics (image feature extraction and matching) backend state estimation research area

Backend vs. frontend?

Like I think the frontend also does local mapping, because just calculating it over two keyframes is too noisy. So we need to window it with multiple keyframes.

SLAM Formalization

See SLAM Formalization.


I think Sachin showed me this image instead