Sensor Fusion

Sensor Fusion is combining two or more data sources in a way that generates a better understanding of the system.

Factors	Camera	LiDAR	Radar
Range	~	~	$✓$
Resolution	$✓$	~	$\times$
Distance Accuracy	~	$✓$	$✓$
Velocity	~	$\times$	$✓$
Color Perception (e.g. traffic lights)	$✓$	$\times$	$\times$
Object Detection	~	$✓$	$✓$
Object Classification	$✓$	~	$\times$
Lane Detection	$✓$	$\times$	$\times$
Obstacle Edge Detection	$✓$	$✓$	$\times$
Illumination Conditions	$\times$	$✓$	$✓$
Weather Conditions	$\times$	~	$✓$
Source: Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review

Videos

Personal Thoughts about challenges:

Creating the ground truth labels for multi-modalities is super expensive. I guess we can get these in simulation. Whatever modality works the best is good.

For low-level fusion,

Geometric Fusion

Input Fusion
- Map input to 2D or 3D
- Feed into a NN
Late Fusion
- Extra features from both modalities
- Map features in 2D or 3D
- Feed into a NN
- Feature dimension is lower than input dimension

Neural Fusion

The problems:

Early Fusion/Low-Level Fusion (LLF) (fusing raw data)
- Combine the signals then learn a single predictor. However, this method is more difficult to exploit the specific properties of each sensor
Late Fusion/High-Level Fusion (HLF) (fusing objects)
- Each sensor is processed independently, and the resulting feature maps are combined into one
- A classifier produces a prediction from this hybrid map
Feature-Level Fusion/Mid-Level Fusion (MLF)
- Build some intermediate representations and learn a predictor
- Intermediate feature maps are generated from each sensor and then a new CNN branch generates prediction
- Difficult to train, since there are lots of parameters, and back-propagation is done in two directions. Uses the idea of Bird-Eyes View
- MLF appears to be insufficient to achieve a SAE Level 4 or Level 5 AD system due to its limited sense of the environment and loss of contextual information
Sequential (Progressive) Fusion
- Use signals one after the other to obtain a prediction
- Ex: FrustumNet, using frustrums

Imagine 2d vision: 100x100 = 10,000 dims

Working in 3d vision is extremely sparse. you have 100x100x100 = 1,000,000 dims

🛠️ Steven Gong