YOLO
YOLO is an object detection architecture which stands for “YOU ONLY LOOK ONCE”. It is a single neural network (single-stage detector) trained end to end to take in a photograph as input and predicts bounding boxes and class labels for each bounding box directly.
- YOLO has one fc network trained to predict the bboxes. And then another one to predict the class of the boxes.
Links
Got it running at ~60hz on the medium size model. Small size runs faster
Transfer Learning with YOLO
Refer to my repo where I did transfer learning for the Musashi-AI challenge.
Updated way, very easy, use in combination with Roboflow:
from ultralytics import YOLO
yolo = YOLO('yolov8m.pt')
yolo.train(data='/perception_datasets/roboflow/traffic_light_roboflow_v3/data.yaml', epochs=300, freeze=10)
valid_results = yolo.val()
print(valid_results)YOLO-OBB
Run Inference
python detect.py --weights 'runs/train/yolov5m_csl_dotav1.5/weights/best.pt' \
--source 'dataset/dataset_demo/images/' \
--img 2048 --device 0 --conf-thres 0.25 --iou-thres 0.2 --hide-labels --hide-confNVIDIA Isaac ROS YOLOv5
https://github.com/NVIDIA-AI-IOT/YOLOv5-with-Isaac-ROS
This is nicee.
How much VRAM does YOLO use up?
Depends on the model size. I need to check this myself. For the small version, uses < 800 MB.
Original YOLO formulation (CS231n 2025 Lec 9)
Single-stage detector — no region proposals, no second-stage classifier. One forward pass produces all detections.
Output structure
Divide input image into an grid (Redmon’s original: ). Each cell predicts:
- bounding boxes parameterized as — the center, size, and “objectness” score
- class probabilities — shared across the boxes in this cell
Output tensor shape: . For Pascal VOC (, , ): .
How “looks once”
The full tensor comes out of one CNN forward pass — that’s the “you only look once”. Compare to R-CNN’s ~2000 forward passes per image.
Inference
Take all predicted boxes, multiply confidence by class probability to get per-class scores, threshold, non-max suppress. Many redundant boxes per object before NMS — that’s fine, NMS collapses them.
Tradeoffs vs Faster R-CNN
Faster R-CNN: more accurate (anchor-based two-stage). YOLO: much faster (single CNN pass). Real-time at deploy. Original YOLO struggled with small objects and crowded scenes; later versions (v2/v3 with anchors, multi-scale features) closed the accuracy gap.
Source
CS231n 2025 Lec 9 slides 99–106 (single-stage detectors, YOLO grid output, , inference). 2026 PDF not published — using 2025 fallback.