Mask R-CNN
This is built upon Faster R-CNN.
https://www.kaggle.com/code/abhishek/mask-rcnn-using-torchvision-0-17/notebook
https://www.kaggle.com/code/julian3833/sartorius-starter-torch-mask-r-cnn-lb-0-273
Architecture (CS231n 2025 Lec 9)
Faster R-CNN solved object detection (boxes + classes); Mask R-CNN (He et al. ICCV 2017) extends it to instance segmentation by adding a third head that predicts a per-instance binary mask.
Pipeline
- Backbone CNN + RPN β region proposals (same as Faster R-CNN)
- RoI Align crops a feature for each proposal β bilinear interpolation, no snap-to-grid (see Object Detection for details). The sub-pixel accuracy matters because masks are pixel-aligned predictions.
- Three heads operate on each cropped feature:
- Classification: class scores
- Box regression: box offsets (per class)
- Mask network: small conv head predicts a tensor β one binary mask per class. At inference, take the mask from the predicted class.
Why per-class masks?
The mask head produces one mask per class, not a single mask. This decouples mask prediction from classification β each mask just answers βis this pixel part of that objectβ. Empirically simpler and works better than a single class-agnostic mask + class label.
Training targets
Project the GT instance mask into the proposal box, resize to , supervise with per-pixel sigmoid + binary cross-entropy on the predicted mask of the GT class only.
Bonus: pose estimation
Replace the mask head with a keypoint head that predicts heatmaps (one per body keypoint, e.g. nose/shoulder/elbow). Same architecture, different last layer β gets human pose detection essentially for free.
Source
CS231n 2025 Lec 9 slides 115β123, 160β164 (Mask R-CNN architecture, mask network, RoI Align, training mask targets, pose head). 2026 PDF not published β using 2025 fallback.