Object Detection

YOLO

YOLO is an object detection architecture which stands for “YOU ONLY LOOK ONCE”. It is a single neural network (single-stage detector) trained end to end to take in a photograph as input and predicts bounding boxes and class labels for each bounding box directly.

  • YOLO has one fc network trained to predict the bboxes. And then another one to predict the class of the boxes.

Links

Got it running at ~60hz on the medium size model. Small size runs faster

Transfer Learning with YOLO

Refer to my repo where I did transfer learning for the Musashi-AI challenge.

Updated way, very easy, use in combination with Roboflow:

from ultralytics import YOLO
 
yolo = YOLO('yolov8m.pt')
yolo.train(data='/perception_datasets/roboflow/traffic_light_roboflow_v3/data.yaml', epochs=300, freeze=10)
valid_results = yolo.val()
print(valid_results)

YOLO-OBB

Run Inference

python detect.py --weights 'runs/train/yolov5m_csl_dotav1.5/weights/best.pt' \
  --source 'dataset/dataset_demo/images/' \
  --img 2048 --device 0 --conf-thres 0.25 --iou-thres 0.2 --hide-labels --hide-conf

NVIDIA Isaac ROS YOLOv5

https://github.com/NVIDIA-AI-IOT/YOLOv5-with-Isaac-ROS

This is nicee.

How much VRAM does YOLO use up?

Depends on the model size. I need to check this myself. For the small version, uses < 800 MB.

Original YOLO formulation (CS231n 2025 Lec 9)

Single-stage detector — no region proposals, no second-stage classifier. One forward pass produces all detections.

Output structure

Divide input image into an grid (Redmon’s original: ). Each cell predicts:

  • bounding boxes parameterized as — the center, size, and “objectness” score
  • class probabilities — shared across the boxes in this cell

Output tensor shape: . For Pascal VOC (, , ): .

How “looks once”

The full tensor comes out of one CNN forward pass — that’s the “you only look once”. Compare to R-CNN’s ~2000 forward passes per image.

Inference

Take all predicted boxes, multiply confidence by class probability to get per-class scores, threshold, non-max suppress. Many redundant boxes per object before NMS — that’s fine, NMS collapses them.

Tradeoffs vs Faster R-CNN

Faster R-CNN: more accurate (anchor-based two-stage). YOLO: much faster (single CNN pass). Real-time at deploy. Original YOLO struggled with small objects and crowded scenes; later versions (v2/v3 with anchors, multi-scale features) closed the accuracy gap.

Source

CS231n 2025 Lec 9 slides 99–106 (single-stage detectors, YOLO grid output, , inference). 2026 PDF not published — using 2025 fallback.