WATonomous is the University of Waterloo design team for Autonomous Vehicles. I joined WATonomous since May 2022, and now I lead the Perception Team integrating and optimizing Deep Learning architectures in ROS.

Learn more at https://watonomous.ca/

Some of Interesting Challenges We Work On

  • At what stage to do Sensor Fusion? Early Fusion, late fusion, mid-level fusion or sequential fusion?
  • How do we gather enough data, since we are not a large company like Cruise or Tesla? In house Synthetic Data Generation >- Simulation was mainly used for planning and control. We were running rosbags for testing Perception. However, we are looking into Sim2Real and synthetic data generation
  • How to efficiently train multiple networks that share similar features? -> Multi-Task Learning
  • The Engineering challenge of Integration Hell using ROS

Important Software Development Concepts

ssh -NfL 8886:localhost:8886 s36gong@trpro-ubuntu1.watocluster.local


About the CARLA HD Map:

  • “The data that is stored in an ASAM OpenDRIVE file describes the geometry of roads, lanes and objects, such as roadmarks on the road, as well as features along the roads, like signals”.

This means Perception will do all of the following (taken from Multi-Task Learning):

  1. Object Detection
    1. Traffic Signs (usually from HD Map)
    2. Traffic Lights (usually from HD Map)
    3. Cars
      • What velocity is it moving at? Is it static or moving?
      • Left blinker or right blinker on? to help predict other vehicle’s trajectories
      • What kind of car? most important is emergency Vehicle (Ambulance) since we need to yield to them, CARLA AD Challenge has a punishment of 0.7 for not yielding
      • Other lower priority: Is the car’s door open or closed?
    4. Traffic Cones
    5. Pedestrians, see what Zoox can do
      • Are they looking at their phone? Are they paying attention to the road? Are they walking or standing still? (action classification) What kind of human (children, adult, senior)?
    6. Road Markings (usually from HD Map)
  2. Semantic Segmentation
    1. Which parts of the road is drivable? (usually from HD Map)
      1. Where are the lane lines?
      2. Where are the road curbs?
      3. Where are the crosswalks?


This is where we want to work towards by end of Apr 2023.

Input: CARLA simulated sensors OR real car sensors Output: /detection (maybe /detection_list) based on Obstacle.msg + /segmented_map from RoadLiensClassMask.msg:

Obstacle vs. detection: I think it doesn’t make sense to categorize between obstacle vs. non-obstacle. Semantically, for the model that we train, it is just a “detection”.


Header header
# The label of the obstacle predicted, very expandable
string label
# Detection confidence
float32 confidence
# Position and its uncertainty
# When the message is sent to Processing it's in vehicle POV coordinate system, but then Processing transforms it to novatel coordinate system
# Coordinate system documentation: TBD
# For 3d bounding boxes, the (x, y, z) is the center point of the 3d bounding box
# For 2d bounding boxes, the (x, y) is the top left point of the 2d bounding box
geometry_msgs/PoseWithCovariance pose
# Velocity and its uncertainty
geometry_msgs/TwistWithCovariance twist 
float64 height_along_y_axis
float64 depth_along_z_axis
# Unique ID number
uint32 object_id

Also see other common_msgs that we wrote ourselves here.

Main goal of these simple tasks is to get more familiar with ROS and practice integrating code. Right now, I am dividing task by modality (camera vs. lidar vs. camera+lidar), but in the future this division will likely be by task (pedestrian vs. car vs etc.). Different modalities might be better for different tasks:

  1. Synthetic Data Generation for Traffic Signs using CARLA Python API “building our own data warehouse” (~8 hours)
    • Code here. No need for ROS. Generate datasets offline. Used to refine deployed models (with Transfer Learning).
    • Can be expanded to ambulance, action classification for pedestrians, roadline detection, etc.
  2. Object Detection “Integration hell” (~8 hours)
    • High-level Task: Take some off-the-shelf model for your modality. Input from CARLA ROS bridge sensors, make predictions and output to /detection topic which uses the Object.msg
    • Code here. ROS + Docker.

Other things to think about

  • Predictions in what Coordinate Frame? Is perception, or later in the stack supposed to take care of this. What reference frames?
  • You are assuming that you have a single sensor coming in. What happens when you have mutliple camera? Look at Tesla
  • Robustness of predictions?
  • How are you gonna do Multi-Task Learning with all these sensors?
  • More channeled ROS2 topics? Right now, I’m thinking of putting all the relevant info into string label
Future directions / for people who want to read papers and not code as much

This is interesting work. But arguably not as high priority.

Current Perception Stack (on monorepo_v1)


Where we’re going: https://drive.google.com/file/d/1VMyHuNRETZ5gWRdH9H7obLK6-alnZBPx/view?usp=sharing

  • Update 2023-01-27: I think complexifying all of this by introducing all these modalities is stupid. You should be problem oriented, what predictions are you trying to make? Refer to hand-drawn chart above.


Immediate Focus:

  1. Download the ONNX version of model weight
  2. Get the template from IbaiGorordo here
  3. Run inference!

Personal Notes

To be an expert in perception, I need to:

Future Research directions:


Papers with Code, interesting topics:

Literature Reviews:



Blog for object detection:


Camera is useful for: