Mask R-CNN

This is built upon Faster R-CNN.

https://www.kaggle.com/code/abhishek/mask-rcnn-using-torchvision-0-17/notebook

https://www.kaggle.com/code/julian3833/sartorius-starter-torch-mask-r-cnn-lb-0-273

Architecture (CS231n 2025 Lec 9)

Faster R-CNN solved object detection (boxes + classes); Mask R-CNN (He et al. ICCV 2017) extends it to instance segmentation by adding a third head that predicts a per-instance binary mask.

Pipeline

  1. Backbone CNN + RPN β†’ region proposals (same as Faster R-CNN)
  2. RoI Align crops a feature for each proposal β€” bilinear interpolation, no snap-to-grid (see Object Detection for details). The sub-pixel accuracy matters because masks are pixel-aligned predictions.
  3. Three heads operate on each cropped feature:
    • Classification: class scores
    • Box regression: box offsets (per class)
    • Mask network: small conv head predicts a tensor β€” one binary mask per class. At inference, take the mask from the predicted class.

Why per-class masks?

The mask head produces one mask per class, not a single mask. This decouples mask prediction from classification β€” each mask just answers β€œis this pixel part of that object”. Empirically simpler and works better than a single class-agnostic mask + class label.

Training targets

Project the GT instance mask into the proposal box, resize to , supervise with per-pixel sigmoid + binary cross-entropy on the predicted mask of the GT class only.

Bonus: pose estimation

Replace the mask head with a keypoint head that predicts heatmaps (one per body keypoint, e.g. nose/shoulder/elbow). Same architecture, different last layer β€” gets human pose detection essentially for free.

Source

CS231n 2025 Lec 9 slides 115–123, 160–164 (Mask R-CNN architecture, mask network, RoI Align, training mask targets, pose head). 2026 PDF not published β€” using 2025 fallback.