Motivation
Takes an RGB image of arbitrary size and produces a set of axis-aligned bounding boxes, each paired with a class label and a softmax confidence score. The distinguishing property is a Region Proposal Network (RPN) that shares the full-image convolutional feature map with a Fast R-CNN detection head: candidate box generation becomes a 10 ms learned forward pass on the same GPU-resident features, replacing external proposal methods such as Selective Search (~1.5 s CPU) and eliminating any test-time coupling to image or filter pyramids.
Architecture
Family & shape. Two-stage CNN object detector. Input: RGB image, shorter side resized to px (§3.3). Output: per detected instance, a class label drawn from a fixed vocabulary, a softmax score in , and an axis-aligned bounding box. Backbone options: ZF (5 shareable conv layers, 256-d feature map) or VGG-16 (13 shareable conv layers, 512-d feature map); competition results use ResNet-101 (§4, ILSVRC/COCO 2015).
Blocks. The RPN head slides a convolutional window over the last shared conv feature map, producing a 256-d (ZF) or 512-d (VGG-16) intermediate feature at each spatial location (§3.1). Two parallel convolutions branch from this intermediate feature: a classification branch producing objectness logits (foreground / background per anchor) and a regression branch producing box-delta outputs. At each location anchors are placed — 3 scales px × 3 aspect ratios — yielding approximately 20 000 anchors on a image (§3.1.1, §3.3).
The RPN head forward pass in PyTorch pseudocode:
import torch.nn as nn
class RPNHead(nn.Module):
def __init__(self, in_channels: int, num_anchors: int = 9):
super().__init__()
self.conv = nn.Conv2d(in_channels, in_channels, 3, padding=1)
self.relu = nn.ReLU(inplace=True)
self.cls_logits = nn.Conv2d(in_channels, num_anchors * 2, 1)
self.bbox_deltas = nn.Conv2d(in_channels, num_anchors * 4, 1)
def forward(self, feature_map):
# feature_map: (B, C, H, W) — last shared conv layer
t = self.relu(self.conv(feature_map))
cls = self.cls_logits(t) # (B, 2k, H, W)
bbox = self.bbox_deltas(t) # (B, 4k, H, W)
return cls, bbox
# Anchors decoded: t_x=(x−x_a)/w_a, t_w=log(w/w_a), etc. (Eq. 2)
RPN proposals are ranked by objectness score, deduplicated with NMS at IoU 0.7, and the top 300 are forwarded to the Fast R-CNN head, which performs RoI pooling on the same shared conv weights and outputs per-class softmax scores and refined boxes.
Binary log-loss on objectness plus smooth-L1 regression loss on box deltas, activated only for positive anchors; normalised by mini-batch size and anchor-location count respectively.
Log-space offsets relative to anchor centre , width , and height ; log-space for scale ensures a well-conditioned regression loss across large scale variation (§3.1.2, Eq. 2).
Training. Four-step alternating optimisation (§3.2): (1) train RPN from ImageNet initialisation; (2) train Fast R-CNN detector on step-1 proposals; (3) fine-tune RPN with shared conv layers frozen; (4) fine-tune Fast R-CNN unique layers only. Approximate joint training converges to comparable accuracy 25–50% faster (§3.2). Mini-batch: 256 anchors per image, up to 128 positive, remainder negative; positive label if IoU with any ground-truth box, negative if IoU (§3.1.2, §3.1.3). Loss balance by default, making cls and reg terms roughly equal given and anchor locations (§3.1.2). Learning rate schedule on PASCAL VOC: 0.001 for 60k mini-batches, 0.0001 for 20k; momentum 0.9; weight decay 0.0005 (§3.1.3). Headline results: VGG-16 trained on VOC 07+12 trainval achieves 73.2% mAP on PASCAL VOC 2007 test at 300 proposals (Table III); VGG-16 trained on COCO trainval achieves 42.1% mAP@0.5 and 21.5% mAP@[.5,.95] on COCO test-dev (Table XI). ResNet-101 backbone raises COCO val mAP to 48.4% mAP@0.5 / 27.2% mAP@[.5,.95], placing 1st at ILSVRC and COCO 2015 detection challenges (§4).
Complexity. Inference on K40 GPU with VGG-16: 198 ms/image (5 fps) total — RPN forward 10 ms, Fast R-CNN forward 188 ms (Table V). ZF backbone: 59 ms/image (17 fps) (Table V). The RPN head adds approximately M parameters on top of VGG-16. Dominant parameter cost is the backbone (VGG-16 ≈ 138M; ZF ≈ 62M).
Implementations
Original MATLAB + Caffe release by the paper authors; the Python + Caffe port by co-author Girshick was the dominant reference for years; Detectron2 is the modern PyTorch reference used in current research.
Assessment
Novelty.
- Replaces external Selective Search / EdgeBoxes proposals (~1.5 s CPU per image) with a learned RPN sharing conv features with the Fast R-CNN detector — proposal cost drops to 10 ms GPU (§1; Table V).
- Introduces anchor boxes (3 scale × 3 aspect-ratio templates) as translation-invariant proposal references, eliminating the need for image or filter pyramids (§3.1.1).
- Establishes the empirical case for two-stage detection: same ZF backbone, one-stage dense sliding-window variant trails by 4.8% mAP on PASCAL VOC 2007 (Table X; 58.7% two-stage vs 53.9% one-stage).
Strengths.
- 73.2% mAP on PASCAL VOC 2007 with VGG-16 + 07+12 trainval at 5 fps (Tables III, V) — pareto-dominant on speed–accuracy for 2015-era detectors.
- Robust to anchor hyperparameter choices: varies mAP by ~1% across (Table IX); reducing anchor aspect ratios from 3 to 1 while keeping 3 scales costs only ~0.1% mAP (Table VIII).
- Backbone-agnostic: substituting ResNet-101 raises COCO mAP@0.5 from 42.1% to 48.4% and mAP@[.5,.95] from 21.5% to 27.2% without architectural change (§4).
Limitations.
- Small objects below backbone stride (16 px for ZF/VGG-16) produce weak RPN activations; the COCO benchmark requires adding a anchor scale specifically to address this (§4.2).
- Anchor templates are fixed at design time — objects with aspect ratios outside are systematically under-covered; the regressor degrades gracefully (3–4% mAP loss at single aspect ratio, Table VIII) but cannot compensate for shapes entirely outside the template set.
- Real-time figures (5–17 fps) require GPU; CPU inference is impractical at reported speeds. One-stage detectors (SSD, YOLO, RetinaNet) are preferable when latency budget is below ~30 ms or GPU is unavailable (Table V; research-note §Applicability).
- Alternating 4-step training runs ~60k + 20k iterations on PASCAL VOC (§3.1.3); approximate joint training reduces wall-clock by 25–50% but introduces a gradient approximation that can subtly degrade very deep backbones (§3.2).
References
- Ren, He, Girshick, Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS, 2015. arXiv.01497.
- Long, Shelhamer, Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR, 2015. (The FCN formulation the RPN adopts.)
- Krizhevsky, Sutskever, Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS, 2012. (Backbone pre-training paradigm.)
- He, Zhang, Ren, Sun. Deep Residual Learning for Image Recognition. CVPR, 2016. (ResNet-101 backbone for the 2015 competition wins.)
- He, Gkioxari, Dollár, Girshick. Mask R-CNN. ICCV, 2017. (Extends Faster R-CNN with a mask branch.)