Mask R-CNN

This is built upon Faster R-CNN.

https://www.kaggle.com/code/abhishek/mask-rcnn-using-torchvision-0-17/notebook

https://www.kaggle.com/code/julian3833/sartorius-starter-torch-mask-r-cnn-lb-0-273

Architecture (CS231n 2025 Lec 9)

Faster R-CNN solved object detection (boxes + classes); Mask R-CNN (He et al. ICCV 2017) extends it to instance segmentation by adding a third head that predicts a per-instance binary mask.

Pipeline

Backbone CNN + RPN → region proposals (same as Faster R-CNN)
RoI Align crops a $256 \times 14 \times 14$ feature for each proposal — bilinear interpolation, no snap-to-grid (see Object Detection for details). The sub-pixel accuracy matters because masks are pixel-aligned predictions.
Three heads operate on each cropped feature:
- Classification: $C$ class scores
- Box regression: $4 \times C$ box offsets (per class)
- Mask network: small conv head predicts a $C \times 28 \times 28$ tensor — one $28 \times 28$ binary mask per class. At inference, take the mask from the predicted class.

Why per-class masks?

The mask head produces one mask per class, not a single mask. This decouples mask prediction from classification — each mask just answers “is this pixel part of that object”. Empirically simpler and works better than a single class-agnostic mask + class label.

Training targets

Project the GT instance mask into the proposal box, resize to $28 \times 28$ , supervise with per-pixel sigmoid + binary cross-entropy on the predicted mask of the GT class only.

Bonus: pose estimation

Replace the mask head with a keypoint head that predicts $K$ heatmaps (one per body keypoint, e.g. nose/shoulder/elbow). Same architecture, different last layer — gets human pose detection essentially for free.

Source

CS231n 2025 Lec 9 slides 115–123, 160–164 (Mask R-CNN architecture, mask network, RoI Align, training mask targets, pose head). 2026 PDF not published — using 2025 fallback.

🛠️ Steven Gong

Table of Contents

Mask R-CNN

Architecture (CS231n 2025 Lec 9)

Pipeline

Why per-class masks?

Training targets

Bonus: pose estimation

Source

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Mask R-CNN

Architecture (CS231n 2025 Lec 9)

Pipeline

Why per-class masks?

Training targets

Bonus: pose estimation

Source

Related

Graph View

Backlinks