WATonomous
WATonomous is the University of Waterloo design team for Autonomous Vehicles. I joined WATonomous since May 2022, and now I lead the Perception Team integrating and optimizing Deep Learning architectures in ROS.
Learn more at https://watonomous.ca/
Some of Interesting Challenges We Work On
- At what stage to do Sensor Fusion? Early Fusion, late fusion, mid-level fusion or sequential fusion?
- How do we gather enough data, since we are not a large company like Cruise or Tesla? In house Synthetic Data Generation >- Simulation was mainly used for planning and control. We were running rosbags for testing Perception. However, we are looking into Sim2Real and synthetic data generation
- How to efficiently train multiple networks that share similar features? -> Multi-Task Learning
- The Engineering challenge of Integration Hell using ROS
Important Software Development Concepts
- Docker for all software development
- Port Forwarding to forward displays such as Jupyter notebooks and VNC viewers
ssh -NfL 8886:localhost:8886 s36gong@trpro-ubuntu1.watocluster.local
Perception
About the CARLA HD Map:
- “The data that is stored in an ASAM OpenDRIVE file describes the geometry of roads, lanes and objects, such as roadmarks on the road, as well as features along the roads, like signals”.
This means Perception will do all of the following (taken from Multi-Task Learning):
- Object Detection
- Traffic Signs (usually from HD Map)
- Traffic Lights (usually from HD Map)
- Cars
- What velocity is it moving at? Is it static or moving?
- Left blinker or right blinker on? to help predict other vehicle’s trajectories
- What kind of car? most important is emergency Vehicle (Ambulance) since we need to yield to them, CARLA AD Challenge has a punishment of
0.7
for not yielding - Other lower priority: Is the car’s door open or closed?
- Traffic Cones
- Pedestrians, see what Zoox can do
- Are they looking at their phone? Are they paying attention to the road? Are they walking or standing still? (action classification) What kind of human (children, adult, senior)?
- Road Markings (usually from HD Map)
- Semantic Segmentation
- Which parts of the road is drivable? (usually from HD Map)
- Where are the lane lines?
- Where are the road curbs?
- Where are the crosswalks?
- Which parts of the road is drivable? (usually from HD Map)
Implementation
This is where we want to work towards by end of Apr 2023.
Input: CARLA simulated sensors OR real car sensors
Output: /detection
(maybe /detection_list
) based on Obstacle.msg + /segmented_map
from RoadLiensClassMask.msg:
Obstacle vs. detection: I think it doesn’t make sense to categorize between obstacle vs. non-obstacle. Semantically, for the model that we train, it is just a “detection”.
Obstacle.msg
Header header
# The label of the obstacle predicted, very expandable
string label
# Detection confidence
float32 confidence
# Position and its uncertainty
# When the message is sent to Processing it's in vehicle POV coordinate system, but then Processing transforms it to novatel coordinate system
# Coordinate system documentation: TBD
# For 3d bounding boxes, the (x, y, z) is the center point of the 3d bounding box
# For 2d bounding boxes, the (x, y) is the top left point of the 2d bounding box
geometry_msgs/PoseWithCovariance pose
# Velocity and its uncertainty
geometry_msgs/TwistWithCovariance twist
float64 height_along_y_axis
float64 depth_along_z_axis
# Unique ID number
uint32 object_id
Also see other common_msgs that we wrote ourselves here.
Main goal of these simple tasks is to get more familiar with ROS and practice integrating code. Right now, I am dividing task by modality (camera vs. lidar vs. camera+lidar), but in the future this division will likely be by task (pedestrian vs. car vs etc.). Different modalities might be better for different tasks:
- Synthetic Data Generation for Traffic Signs using CARLA Python API “building our own data warehouse” (~8 hours)
- Code here. No need for ROS. Generate datasets offline. Used to refine deployed models (with Transfer Learning).
- Can be expanded to ambulance, action classification for pedestrians, roadline detection, etc.
- Object Detection “Integration hell” (~8 hours)
Other things to think about
- Predictions in what Coordinate Frame? Is perception, or later in the stack supposed to take care of this. What reference frames?
- You are assuming that you have a single sensor coming in. What happens when you have mutliple camera? Look at Tesla
- Robustness of predictions?
- How are you gonna do Multi-Task Learning with all these sensors?
- More channeled ROS2 topics? Right now, I’m thinking of putting all the relevant info into
string label
Future directions / for people who want to read papers and not code as much
This is interesting work. But arguably not as high priority.
- ONNX,
- Bird-eye view things
- 2D / 3D Occupancy Grid
- Using LSTM so the car has a concept of time
- (Research-focused): Domain Adaptation / Transfer Learning / Sim2Real “generalize to the real-world”
- Object Tracking (to be completed once Object Detection is done. or not.)
- Multi-Task Learning (more refined classifications, type of car. Door is open or not? Pedestrian action?)
- Better benchmarking of our models. Not just using an existing model
Current Perception Stack (on monorepo_v1)
https://drive.google.com/file/d/1XAOEZ1mQ4vm3iRDFr7V6nC529kjzT3nR/view?usp=sharing
- Object Detection
- 2D: YOLOv5 (pretrained) -> Use for Traffic Sign Detection as backup
- 3D: PointPillars, PointPillars, SECOND (too slow), Frustum PointNet (didn’t work)
- Before: We did 2D object detection, then with the bbox generated we use project them (frustum) and apply Euclidean Clustering to select the best cluster
- Traffic Light Detection
- YOLO to find the traffic light
- OpenCV color filtering (Finding Contour + Finding Direction)
- Traffic Sign Detection
- Lane Detection
- Fish-eye camera
- Old method: 4-step process with Semantic Segmentation
- New Method: End to end with Ultra Fast Lane Detection: https://arxiv.org/pdf/2004.11757.pdf
Where we’re going: https://drive.google.com/file/d/1VMyHuNRETZ5gWRdH9H7obLK6-alnZBPx/view?usp=sharing
- Update 2023-01-27: I think complexifying all of this by introducing all these modalities is stupid. You should be problem oriented, what predictions are you trying to make? Refer to hand-drawn chart above.
Old Notes (DEPRECATED)
Immediate Focus:
- Implement the Bird-Eye View stuff with BEVFusion
- Idea is to then generate this with CARLA, get the ground truth of the bird-eye view
- Publish paper on this? End to end with bird eye view labels simulation from CARLA
- Learn how EfficientDet works (BiFPN), which is what Tesla does
- https://github.com/google/automl/blob/master/efficientdet/tutorial.ipynb
- There’s also this Swin Transformer used in BEVFusion
- CARLA Synthetic Data Generation
- Genrate Curb Dataset
- Generate BEV dataset
- Curb Detection (after CARLA is done), see https://arxiv.org/pdf/2110.03968v1.pdf
- So I found this repo HybridNet, which does Multi-Task Learning, and I managed to implement it combined with ONNX thanks to this guy’s repository. And he has a bunch, so I think I will follow this template
Personal Notes
To be an expert in perception, I need to:
- Be able to write YOLO from scratch
- Write PointPillars and all these detection algorithms from scratch
- Write Transformers from scratch, GANs as well
- Understand Sensor Fusion and how those are combined
- Convert to Bird-Eye View
Future Research directions:
- Generating data for our models to train on (Sim2Real)
- Lane Detection, look into this paper WATonomous wrote: https://arxiv.org/pdf/2202.07133.pdf
- Camera Calibration
- Better classifications, like Action Classification for pedestrians and cars (toggling lights, etc.)
- Monocular Depth (implement from scratch)
From F1TENTH:
- Depth from Monocular Camera (Monodepth2)
- Dehaze (Cameron Hodges et al.) -> Allows better object detection outputs
- Night to Day (ForkGAN): https://github.com/zhengziqiang/ForkGAN, or cycleGAN?
Papers with Code, interesting topics:
- Lane Detection (53 papers)
- 3D Object Detection
- Multimodal Association
- Open Vocabulary Object Detection
- Self-Supervised Image Classification
- Object Tracking
Literature Reviews:
Concepts
Papers
- LiDAR
- Camera
- Other
- Ground Segmentation
- We use Cut, Paste, Learn to generate data: “A major impediment in rapidly deploying object detection models for instance detection is the lack of large annotated datasets”
- Projection Matrix for Camera Calibration
- Essential Papers
Blog for object detection:
Resources:
Camera is useful for:
- Knowing the type of traffic sign (to see it)
- Action Classification