Motivation
Direct set prediction for object detection — the model outputs a fixed-size set of (class, box) pairs in one forward pass, supervised by a bipartite-matching loss; eliminates hand-designed components (anchor boxes, region proposal network, non-maximum suppression) that prior detectors required. Input: RGB image (variable size, batches zero-padded to a shared spatial extent). Output: a fixed-size set of predictions, each a tuple of a class-probability vector over classes (including the special "no-object" class ) and a normalized box ; predictions matched to for unfilled slots.
Architecture
Family & shape. Hybrid (CNN + transformer encoder-decoder). A ResNet-50 or ResNet-101 backbone extracts a feature map where , , (§3.2.1). A 1×1 convolution projects channels from to model dimension , producing . The spatial map is flattened to a sequence of tokens, augmented with fixed 2D sinusoidal positional encodings, and passed through a 6-layer transformer encoder. A 6-layer transformer decoder then takes learned object queries as input and produces one (class, box) prediction per query in parallel.
Blocks. Five named components constitute the pipeline: backbone, 1×1 conv projection, transformer encoder, transformer decoder, and prediction heads (§3.2). The encoder applies standard multi-head self-attention + FFN over all spatial tokens; fixed sinusoidal positional encodings are added before every self-attention layer (§3.2.2). The novel architectural element is the transformer decoder with object queries: the queries are learnable positional embeddings that initialize the decoder input. Each query attends to all encoder tokens via cross-attention and to all other queries via self-attention; decoding is non-autoregressive — all queries are transformed in parallel across 6 decoder layers (§3.2.3). Auxiliary decoding losses are applied at every decoder layer with shared FFN weights (§3.2.5).
The DETR decoder layer in PyTorch:
import torch.nn as nn
class DETRDecoderLayer(nn.Module):
"""One decoder layer: object queries self-attend, then cross-attend to
encoder features. Sec. 3.2.3 of DETR (Carion et al. 2020).
"""
def __init__(self, d: int = 256, n_heads: int = 8, mlp: int = 2048):
super().__init__()
self.self_attn = nn.MultiheadAttention(d, n_heads, batch_first=True)
self.cross_attn = nn.MultiheadAttention(d, n_heads, batch_first=True)
self.mlp = nn.Sequential(nn.Linear(d, mlp), nn.ReLU(), nn.Linear(mlp, d))
self.n1, self.n2, self.n3 = (nn.LayerNorm(d) for _ in range(3))
def forward(self, queries, query_pos, memory, memory_pos):
# queries: [B, N, d]; memory: [B, HW, d]
q = queries + query_pos
queries = queries + self.self_attn(q, q, queries)[0]
queries = self.n1(queries)
q = queries + query_pos
k = memory + memory_pos
queries = queries + self.cross_attn(q, k, memory)[0]
queries = self.n2(queries)
queries = queries + self.mlp(queries)
return self.n3(queries)
Each decoder output embedding is passed independently to two prediction heads: a shared 3-layer MLP (hidden dimension , ReLU activations) predicts a normalized bounding box via a sigmoid output, and a linear layer with softmax predicts the class label over classes including the special "no-object" class for unmatched slots (§3.2.4).
Training. Dataset: COCO 2017 object detection. The loss has two parts — first, the Hungarian algorithm finds the optimal permutation matching predictions to ground-truth objects; then a per-matched-pair loss is applied:
Given predictions and the ground-truth set padded to size with , find the permutation minimising a pairwise matching cost; then apply a per-matched-pair Hungarian loss combining negative-log class probability with L1 and GIoU box terms.
Optimiser: AdamW with transformer LR , backbone LR , weight decay . Batch size 64 across 16 V100 GPUs. The default schedule trains for 300 epochs with a ×0.1 LR drop at epoch 200 (~3 days on 16 V100s); an extended 500-epoch schedule drops at epoch 400 and adds ~1.5 AP. Augmentation: random horizontal flip, random scale (shortest side 480–800 px, longest 1333 px), random crop (+1 AP). Dropout 0.1 in all transformer components (§4.0.2). Headline metric on COCO val: DETR-R50 reaches 42.0 AP at 500 epochs, matching Faster R-CNN-R50-FPN+ at 42.0 AP (Table 1, §4.2).
Complexity. DETR-R50: 41 M params, 86 GFLOPs, 28 FPS. DETR-R101: 60 M params, 152 GFLOPs, 20 FPS. DETR-DC5 (dilated C5, stride 16 instead of 32): 41 M params, 187 GFLOPs, 12 FPS — doubles spatial token count, incurs ~16× encoder self-attention cost and ~2× total FLOPs, raises small-object AP by ~2 points (Table 1, §4.0.2).
Implementations
Official PyTorch release from Facebook AI; the reference repository ships pretrained checkpoints for DETR-R50, DETR-R101, DETR-DC5, and the COCO panoptic head.
Assessment
Novelty.
- Bipartite-matching set-prediction loss eliminates NMS at inference: each prediction is supervised only by its uniquely assigned ground-truth instance via the Hungarian algorithm, so the model learns to produce diverse, non-duplicate predictions without any post-processing. This replaces the anchor-matching and NMS pipeline of Faster R-CNN, Mask R-CNN, and the one-stage detector family (YOLO, SSD, RetinaNet).
- Transformer encoder-decoder applied to detection as a complete end-to-end pipeline — not merely as an attention module bolted onto CNN features. The decoder uses learnable object queries decoded non-autoregressively (all queries in parallel), a structural departure from autoregressive transformers used in language modelling.
- Zero spatial hyperparameters: no anchors, no region proposals, no IoU thresholds, no scale or aspect-ratio priors — the object queries learn implicit spatial distributions from data (§3.1).
- Natural extension to panoptic segmentation via a small segmentation head trained on top of frozen DETR (§5) — the same set-prediction interface accommodates instance + stuff segmentation without architectural redesign.
Strengths.
- COCO val AP: DETR-R50 42.0 (500 ep, 86 GFLOPs) vs Faster R-CNN-R50-FPN+ 42.0 (180 GFLOPs) — comparable accuracy at 2.1× fewer FLOPs (Table 1).
- Large-object AP: DETR-R50 AP 61.1 vs Faster R-CNN-R50-FPN+ 53.4 — a 7.7-point absolute lead, attributed to global encoder self-attention where every spatial token attends to every other (§4.1).
- Conceptual simplicity: the detector is fully end-to-end differentiable from image pixels to (class, box) set, with no anchor boxes, no NMS, no RPN, and no IoU-threshold hyperparameters.
- Direct extensibility to other set-prediction tasks (panoptic segmentation, instance segmentation, 3D detection) by replacing the per-query prediction head, with no changes to the encoder or decoder.
Limitations.
- Slow training convergence: DETR requires 300–500 epochs vs 12–36 for Faster R-CNN (§4.0.2, §4.1) — 10–25× more wall-clock training compute to reach equal accuracy. Deformable DETR (2020) was motivated specifically by this deficit and achieves competitive AP in ~50 epochs via sparse attention.
- Small-object AP underperforms Faster R-CNN: DETR-R50 AP 20.5 vs Faster R-CNN-FPN+ 26.6 — a 6.1-point absolute deficit (Table 1). The stride-32 feature map gives small objects very few spatial tokens in the encoder; DETR-DC5 partially closes the gap (AP 22.5) at the cost of doubled compute.
- Quadratic encoder self-attention cost : DETR-DC5's stride-16 backbone quadruples the token count, producing 16× higher encoder attention cost and 2× total FLOPs, making very-high-resolution inputs impractical without windowed-attention variants (§4.0.2).
- Fixed cap of predictions per image: images with more than objects will silently miss detections. COCO has up to 63 instances per image (§4.0.1), leaving the current cap adequate for that dataset but not for dense-scene applications.