Multi-Task Learning

I did not even know this was a thing… holy crap.

Repos to look into:

import torch
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
# load model
model = torch.hub.load('datvuthanh/hybridnets', 'hybridnets', pretrained=True)
box = (0, 0, 640, 384)
img ="image.png").crop(box)
if img.mode != 'RGB':
	img = img.convert('RGB')
img = torch.tensor(np.expand_dims(np.transpose(np.asarray(img), axes=[2,0,1]), 0)) / torch.tensor(255.0)
features, regression, classification, anchors, segmentation = model(img)

Below is a rant/explanation of some of my thoughts about direction with this BEVFusion:

I’ve been looking at the paper and the architecture a bit more and I think this model has a lot of potential, because it’s built for support various tasks, whether its semantic segmentation or object detection (we call that multi-task learning), so I think this model goes beyond just doing “object detection”.

Tesla also takes a multi-task learning approach, except they are camera-only, if you have time take a look at this talk by Andrej Karpathy about Multi-Task Learning at Tesla:

In essence, to build an actual urban self-driving car, not only do you need object detection, but you also want for example, semantic segmentation, to identify which parts of the road are drivable. Within object detection, there are also many subtasks, meta-information that might be useful. So you have:

  1. Object Detection
    1. Predicting Traffic Signs
    2. Prediction Traffic Lights
      1. What colour is the traffic light?
    3. Predicting Cars
      1. Is the car’s door open or closed?
      2. Left blinker or right blinker on?
      3. What velocity is it moving at? Is it static or moving?
      4. What kind of car? Ambulance, Truck, etc.
    4. Detecting traffic cones
    5. Predicting Pedestrians
      1. Are they looking at their phone? Are they paying attention to the road?
      2. Are they walking or standing still? (action classification)
    6. Detecting Road Markings
  2. Semantic Segmentation
    1. Which parts of the road is drivable?
      1. Where are the lane lines?
      2. Where are the road curbs?
      3. Where are the crosswalks?

You end up with all these things you need to predict. At WATonomous, you have one module for object detection (where you draw bounding boxes), and one module for semantic segmentation (we don’t even do this anymore).

  • But as we want to develop a more advanced self-driving car, to help the car truly understand, there are these things it needs to understand and perceive.

Training separate networks for each task is computationally expensive (just imagine running 30 different neural networks on our car in real-time).

  • Say you detect a car, and then you need to run another neural network that takes in that bounding box to detect whether a blinker is on..?

However, the insight is that these networks usually share features (for example knowing how to detect edges is useful both for detecting pedestrians and cars), so the idea is that you have some sort of shared backbone, and train multiple heads individually.

This is what we currently do at WATonomous: we have a single predictor (YOLO), which predicts multiple classes. However, there are a few limitations:

  • It’s hard to add a new class
  • Not all tasks is about drawing bounding boxes. You might want to do semantic segmentation (especially for the curbs), or regression

It’s unclear how the network weighs everything. If you have 100 classes, and weigh each of them equally, that’s not fair right?

Also, some tasks are very easy, while other tasks are very hard. How does the Loss Function work then?

There is label imbalance. So then that’s why you want just want to train a network separately.

So you change the oversampling ratio.

And their network is training in realtime.

Next up, we need to actually work in BEV representation. We want to have multi-task learning, but we also want to do this in some sort of 3D space. Because just projecting camera bounding boxes is not good enough.

So if you have 8 cameras, you don’t want to just do object detection on each of these images, and fuse them later. You want to fuse them a little bit earlier.