WATonomous Perception

TODO: get in touch with Aryan

Perception will do all of the following (taken from Multi-Task Learning):

Object Detection
1. Traffic Signs (usually from HD Map)
2. Traffic Lights (usually from HD Map)
3. Cars
  - What velocity is it moving at? Is it static or moving?
  - Left blinker or right blinker on? $\to$ to help predict other vehicle’s trajectories
  - What kind of car? $\to$ most important is emergency Vehicle (Ambulance) since we need to yield to them, CARLA AD Challenge has a punishment of 0.7 for not yielding
  - Other lower priority: Is the car’s door open or closed?
4. Traffic Cones
5. Pedestrians, see what Zoox can do
  - Are they looking at their phone? Are they paying attention to the road? Are they walking or standing still? (action classification) What kind of human (children, adult, senior)?
6. Road Markings (usually from HD Map)
Semantic Segmentation
1. Which parts of the road is drivable? (usually from HD Map)
  1. Where are the lane lines?
  2. Where are the road curbs?
  3. Where are the crosswalks?

Implementation

This is where we want to work towards by end of Apr 2024.

Obstacle vs. detection: I think it doesn’t make sense to categorize between obstacle vs. non-obstacle. Semantically, for the model that we train, it is just a “detection”.

Other things to think about

Predictions in what Coordinate Frame? Is perception, or later in the stack supposed to take care of this. What reference frames?

You are assuming that you have a single sensor coming in. What happens when you have mutliple camera? Look at Self-Driving Car Companies

Robustness of predictions?

How are you gonna do Multi-Task Learning with all these sensors?

More channeled ROS2 topics? Right now, I’m thinking of putting all the relevant info into string label

Future directions / for people who want to read papers and not code as much

This is interesting work. But arguably not as high priority.

ONNX,
Bird-eye view things
2D / 3D Occupancy Grid
Using LSTM so the car has a concept of time
(Research-focused): Domain Adaptation / Transfer Learning / Sim2Real “generalize to the real-world”
Object Tracking (to be completed once Object Detection is done. or not.)
Multi-Task Learning (more refined classifications, type of car. Door is open or not? Pedestrian action?)
Better benchmarking of our models. Not just using an existing model

Current Perception Stack (on monorepo_v1)

https://drive.google.com/file/d/1XAOEZ1mQ4vm3iRDFr7V6nC529kjzT3nR/view?usp=sharing

Object Detection
- 2D: YOLOv5 (pretrained) → Use for Traffic Sign Detection as backup
- 3D: PointPillars, PointPillars, SECOND (too slow), Frustum PointNet (didn’t work)
  - Before: We did 2D object detection, then with the bbox generated we use project them (frustum) and apply Euclidean Clustering to select the best cluster
Traffic Light Detection
- YOLO to find the traffic light
- OpenCV color filtering (Finding Contour + Finding Direction)
Traffic Sign Detection
- SSD for detection (but I see Alvin using YOLOv5?)
- Cut / Paste / Train method to generate synthesized data
Lane Detection
- Fisheye camera
- Old method: 4-step process with Semantic Segmentation
- New Method: End to end with Ultra Fast Lane Detection: https://arxiv.org/pdf/2004.11757.pdf

Where we’re going: https://drive.google.com/file/d/1VMyHuNRETZ5gWRdH9H7obLK6-alnZBPx/view?usp=sharing

Update 2023-01-27: I think complexifying all of this by introducing all these modalities is stupid. You should be problem oriented, what predictions are you trying to make? Refer to hand-drawn chart above.

Old Notes (DEPRECATED)

Immediate Focus:

Implement the Bird-Eye View stuff with BEVFusion
- Idea is to then generate this with CARLA, get the ground truth of the bird-eye view
- Publish paper on this? End to end with bird eye view labels simulation from CARLA
Learn how EfficientDet works (BiFPN), which is what Tesla does
- https://github.com/google/automl/blob/master/efficientdet/tutorial.ipynb
- There’s also this Swin Transformer used in BEVFusion
CARLA Synthetic Data Generation
- Genrate Curb Dataset
- Generate BEV dataset
Curb Detection (after CARLA is done), see https://arxiv.org/pdf/2110.03968v1.pdf
- So I found this repo HybridNet, which does Multi-Task Learning, and I managed to implement it combined with ONNX thanks to this guy’s repository. And he has a bunch, so I think I will follow this template

Download the ONNX version of model weight
Get the template from IbaiGorordo here
Run inference!

Personal Notes

To be an expert in perception, I need to:

Be able to write YOLO from scratch
Write PointPillars and all these detection algorithms from scratch
Write Transformers from scratch, GANs as well
Understand Sensor Fusion and how those are combined
- Convert to Bird-Eye View

Future Research directions:

Generating data for our models to train on (Sim2Real)
Lane Detection, look into this paper WATonomous wrote: https://arxiv.org/pdf/2202.07133.pdf
Camera Calibration
Better classifications, like Action Classification for pedestrians and cars (toggling lights, etc.)
Monocular Depth (implement from scratch)
- https://github.com/nianticlabs/monodepth2

NO, I think main thing is to get really good at engineering.

From F1TENTH:

Depth from Monocular Camera (Monodepth2)
Dehaze (Cameron Hodges et al.) → Allows better object detection outputs
Night to Day (ForkGAN): https://github.com/zhengziqiang/ForkGAN, or cycleGAN?

Papers with Code, interesting topics:

Literature Reviews:

Anita
Alvin

Concepts

Papers

LiDAR
Camera
- SSD
- YOLO
Other
- Ground Segmentation
- We use Cut, Paste, Learn to generate data: “A major impediment in rapidly deploying object detection models for instance detection is the lack of large annotated datasets”
Projection Matrix for Camera Calibration
Essential Papers
- Transformer

Blog for object detection:

https://lilianweng.github.io/posts/2017-10-29-object-recognition-part-1/

Resources:

Slides, video

Camera is useful for:

Knowing the type of traffic sign (to see it)
Action Classification

🛠️ Steven Gong

Table of Contents

WATonomous Perception

Implementation

Future directions / for people who want to read papers and not code as much

Current Perception Stack (on monorepo_v1)

Old Notes (DEPRECATED)

Personal Notes

Concepts

Graph View

Backlinks