Mask R-CNN | VitaVision
Back to atlas

Mask R-CNN

8 min readIntermediatecnnView in graph
Based on
Mask R-CNN
He, Gkioxari, Dollár, Girshick · ICCV 2017
DOI ↗

Implementations

Motivation

Instance segmentation assigns, for each detected object in an RGB image, a class label, a confidence score, a bounding box, and a per-instance binary pixel mask. Input: RGB image, shorter edge resized to 800 px (§3.1). Output: per detection, a class label + score + bounding box + binary mask at m×mm \times m RoI resolution (m=14m = 14 for the ResNet-C4 head, m=28m = 28 for the FPN head), resized to the box extent at inference. The task is instance segmentation — distinct from semantic segmentation (per-pixel class labels, no instance identity) and from detection-only pipelines (bounding boxes, no pixel masks). The defining contribution is the extension of Faster R-CNN's two-stage pipeline (RPN + classification/box-regression RoI head) with a third parallel mask branch — a small FCN producing KK independent per-class binary masks at each RoI — combined with RoIAlign, which replaces RoIPool's two coarse coordinate quantizations [x/16][x/16] with continuous bilinear sampling at x/16x/16 (no rounding), recovering pixel-accurate spatial alignment that detection's box-level losses do not require but pixel-level mask prediction does.

Architecture

Family & shape. Two-stage CNN detection model (Faster R-CNN substrate) extended with a parallel mask head. Backbone: ResNet-50, ResNet-101, or ResNeXt-101, paired with FPN (the recommended default). Input: RGB image, shorter edge 800 px (§3.1). Outputs per detected instance: class label + score + bounding box + binary mask at m×mm \times m RoI resolution — m=14m = 14 for the ResNet-C4 head, m=28m = 28 for the FPN head (Figure 4).

Blocks.

(a) Region Proposal Network (RPN). The RPN proposes candidate RoIs on the backbone feature map. This component is inherited directly from Faster R-CNN (Ren et al. 2015); Mask R-CNN "adopts the same two-stage procedure" (§3).

(b) RoIAlign. RoIPool quantizes the continuous floating-point coordinate xx to [x/16][x/16] — the stride-16 feature-map cell — and then quantizes a second time when partitioning the RoI into spatial bins. Both steps introduce spatial misalignment negligible for bounding-box regression but catastrophic for pixel-accurate masks. RoIAlign removes both quantizations: it samples at the exact floating-point position x/16x/16 and computes the feature value by bilinear interpolation at four regularly spaced sampling points per bin (§3, RoIAlign; Figure 3).

(c) Three parallel heads. After RoIAlign, each RoI is routed simultaneously to: (i) the classification head, producing LclsL_\text{cls}; (ii) the bounding-box regression head, producing LboxL_\text{box}; and (iii) the mask branch, a small FCN producing a Km2Km^2-dimensional output encoding KK binary masks of resolution m×mm \times m per RoI, followed by per-pixel sigmoid and the mask loss LmaskL_\text{mask} (§3, Mask R-CNN).

(d) Decoupled mask-class prediction. The mask branch outputs KK independent per-class binary masks per RoI; a per-pixel sigmoid (not a per-pixel softmax across KK classes) is applied to each. LmaskL_\text{mask} is the average binary cross-entropy computed only on the kk-th mask channel, where kk is the ground-truth class of the RoI. Masks for the remaining K1K-1 classes contribute nothing to the loss. The classification head selects which of the KK masks to use at inference. This decoupling avoids inter-class competition during mask training and is the paper's key departure from FCN-style per-pixel softmax prediction (§3, Mask R-CNN, paragraph 3).

The mask-head loss in PyTorch:

import torch
import torch.nn.functional as F

def mask_head_loss(
    logits: torch.Tensor,      # (N, K, m, m)  — K per-class mask logits
    targets: torch.Tensor,     # (N, m, m)     — binary ground-truth masks
    gt_classes: torch.Tensor,  # (N,)          — ground-truth class index per RoI
) -> torch.Tensor:
    """Per-class sigmoid BCE on the ground-truth class channel only.
    N is the count of matched foreground RoIs (positive RoIs only).
    """
    N, K, m, _ = logits.shape

    idx = gt_classes.view(N, 1, 1, 1).expand(N, 1, m, m)
    chosen = logits.gather(dim=1, index=idx).squeeze(1)

    return F.binary_cross_entropy_with_logits(chosen, targets.float())

Training. Dataset: COCO train2017 (80 classes). Loss: L=Lcls+Lbox+LmaskL = L_\text{cls} + L_\text{box} + L_\text{mask} per RoI, where LclsL_\text{cls} and LboxL_\text{box} are identical to Faster R-CNN and LmaskL_\text{mask} is the per-class sigmoid BCE on the ground-truth class channel only. Schedule: SGD with momentum 0.9, weight decay 10410^{-4}, initial LR 0.02 reduced 10×10\times at 120k iterations of 160k total, 8 GPUs at 2 images/GPU (effective mini-batch 16); ResNeXt variants use 1 image/GPU and starting LR 0.01 (§3.1, Training). Augmentation: standard horizontal flipping. Headline metrics (Table 1, COCO test-dev): ResNet-101-FPN mask AP 35.7; ResNeXt-101-FPN mask AP 37.1; both surpass FCIS+++ (AP 33.6) without test-time augmentation.

Definition
Mask R-CNN multi-task loss

L=Lcls+Lbox+LmaskL = L_\text{cls} + L_\text{box} + L_\text{mask} LclsL_\text{cls} and LboxL_\text{box} inherit from Faster R-CNN; LmaskL_\text{mask} is the average binary cross-entropy over the kk-th mask channel only (where kk is the ground-truth class of the RoI), preventing inter-class competition during mask training (§3, Mask R-CNN).

Complexity. The mask branch adds approximately 20% inference overhead over the Faster R-CNN counterpart (§3.1, Inference). Inference: ~5 fps on a Tesla M40 GPU at approximately 195 ms per image for ResNet-101-FPN; the ResNet-101-C4 variant runs at approximately 400 ms per image because the C4 box head incorporates the heavy res5 stage (§4.4).

Implementations

Official PyTorch implementation in facebookresearch/detectron2 (FAIR's successor to the original Caffe2 Detectron release); the most widely used third-party Keras/TensorFlow port is matterport/Mask_RCNN.

Assessment

Novelty.

  • Adds a parallel mask branch (small FCN per RoI) to Faster R-CNN's two-head detector (class + box) — extends the architecture rather than replacing the detection backbone with a segmentation network. Contrast: FCIS, which couples detection and segmentation in a single position-sensitive output, and MNC, which uses a multi-stage cascade.
  • Decoupled per-class binary masks with per-pixel sigmoid + binary cross-entropy on the ground-truth class channel — contrast to FCN-style per-pixel softmax across classes, which forces inter-class competition during mask training.
  • RoIAlign replaces RoIPool's two coarse quantizations [x/16][x/16] with continuous-coordinate bilinear sampling — direct response to the localization-sensitivity gap that detection's box-only losses hide.
  • Same framework extends naturally to person keypoint detection by treating each joint type as a one-hot binary mask channel (§5).
  • Functions as the deep-learning replacement for the classical part-based detection paradigm (Felzenszwalb et al. 2010): CNN features + region proposals supplant HOG + root/part filters, and per-RoI mask prediction extends the output beyond DPM's bounding boxes.

Strengths.

  • ResNet-101-FPN COCO test-dev mask AP 35.7; ResNeXt-101-FPN test-dev mask AP 37.1 (Table 1) — surpassed the official COCO 2016 instance-segmentation winner FCIS+++ (AP 33.6) without test-time augmentation.
  • RoIAlign alone contributes approximately +3 AP and +5 AP75_{75} over RoIPool on ResNet-50-C4 (Table 2c) — the largest single ablation gain.
  • Per-class sigmoid (decoupled) vs per-pixel softmax (FCN-style): +5.5 AP at ResNet-50-C4 (Table 2b, 30.3 vs 24.8) — confirms decoupling masks from class prediction.
  • Framework generality demonstrated across instance segmentation, bounding-box object detection, and person keypoint detection without architectural changes (§5).

Limitations.

  • Two-stage pipeline (RPN → RoI head) is slow for real-time deployment — approximately 5 fps on a Tesla M40 GPU (Abstract); not suitable for mobile or edge inference at <10 ms budgets.
  • Fixed mask resolution per RoI (14×1414 \times 14 for C4, 28×2828 \times 28 for FPN; Figure 4) — small or thin instances are resampled from low-resolution masks and lose fine boundary detail.
  • NMS during inference suppresses heavily overlapping detections of the same class — two adjacent same-class instances that overlap above the NMS threshold lose one mask silently.
  • Closed-vocabulary mask head: KK output channels for the KK training classes; novel classes require retraining or adaptation.

References

  1. K. He, G. Gkioxari, P. Dollár, R. Girshick. Mask R-CNN. ICCV, 2017. arXiv
    .06870
  2. S. Ren, K. He, R. Girshick, J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS, 2015. arXiv
    .01497
  3. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie. Feature Pyramid Networks for Object Detection. CVPR, 2017. arXiv
    .03144
  4. J. Long, E. Shelhamer, T. Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR, 2015. arXiv
    .4038
  5. K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR, 2016. arXiv
    .03385

Fed by

  • FCN: Fully Convolutional Networks

    Mask R-CNN adopts FCN's per-pixel binary prediction for the mask branch inside an instance-segmentation pipeline; mask branch is decoupled from class prediction.

  • ResNet

    Mask R-CNN's headline backbones are ResNet-50/101 and ResNeXt-101 paired with FPN.

Learned alternative of

  • medium
    Deformable Part Models

    Mask R-CNN's CNN backbone, region proposals, and RoIAlign replace DPM's HOG features, root + part filters, and latent-SVM scoring; Mask R-CNN also outputs per-instance masks beyond DPM's bounding boxes.