DETR | VitaVision
Back to atlas

DETR

7 min readIntermediatehybrid41M (DETR R50), 60M (DETR R101)86 GFLOPs (DETR R50), 152 GFLOPs (DETR R101)View in graph
Based on
End-to-End Object Detection with Transformers
Carion, Massa, Synnaeve, Usunier et al. · ECCV 2020 2020
arXiv ↗

Implementations

Motivation

Direct set prediction for object detection — the model outputs a fixed-size set of NN (class, box) pairs in one forward pass, supervised by a bipartite-matching loss; eliminates hand-designed components (anchor boxes, region proposal network, non-maximum suppression) that prior detectors required. Input: RGB image ximgR3×H0×W0x_{\rm img} \in \mathbb{R}^{3 \times H_0 \times W_0} (variable size, batches zero-padded to a shared spatial extent). Output: a fixed-size set of N=100N = 100 predictions, each a tuple of a class-probability vector over C+1C+1 classes (including the special "no-object" class \varnothing) and a normalized box (cx,cy,w,h)[0,1]4(c_x, c_y, w, h) \in [0,1]^4; predictions matched to \varnothing for unfilled slots.

Architecture

Family & shape. Hybrid (CNN + transformer encoder-decoder). A ResNet-50 or ResNet-101 backbone extracts a feature map fRC×H×Wf \in \mathbb{R}^{C \times H \times W} where C=2048C = 2048, H=H0/32H = H_0/32, W=W0/32W = W_0/32 (§3.2.1). A 1×1 convolution projects channels from CC to model dimension d=256d = 256, producing z0Rd×H×Wz_0 \in \mathbb{R}^{d \times H \times W}. The spatial map is flattened to a sequence of HWHW tokens, augmented with fixed 2D sinusoidal positional encodings, and passed through a 6-layer transformer encoder. A 6-layer transformer decoder then takes N=100N = 100 learned object queries as input and produces one (class, box) prediction per query in parallel.

Blocks. Five named components constitute the pipeline: backbone, 1×1 conv projection, transformer encoder, transformer decoder, and prediction heads (§3.2). The encoder applies standard multi-head self-attention + FFN over all HWHW spatial tokens; fixed sinusoidal positional encodings are added before every self-attention layer (§3.2.2). The novel architectural element is the transformer decoder with object queries: the queries are N=100N = 100 learnable positional embeddings that initialize the decoder input. Each query attends to all encoder tokens via cross-attention and to all other queries via self-attention; decoding is non-autoregressive — all NN queries are transformed in parallel across 6 decoder layers (§3.2.3). Auxiliary decoding losses are applied at every decoder layer with shared FFN weights (§3.2.5).

The DETR decoder layer in PyTorch:

import torch.nn as nn


class DETRDecoderLayer(nn.Module):
    """One decoder layer: object queries self-attend, then cross-attend to
    encoder features. Sec. 3.2.3 of DETR (Carion et al. 2020).
    """

    def __init__(self, d: int = 256, n_heads: int = 8, mlp: int = 2048):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d, n_heads, batch_first=True)
        self.cross_attn = nn.MultiheadAttention(d, n_heads, batch_first=True)
        self.mlp = nn.Sequential(nn.Linear(d, mlp), nn.ReLU(), nn.Linear(mlp, d))
        self.n1, self.n2, self.n3 = (nn.LayerNorm(d) for _ in range(3))

    def forward(self, queries, query_pos, memory, memory_pos):
        # queries: [B, N, d]; memory: [B, HW, d]
        q = queries + query_pos
        queries = queries + self.self_attn(q, q, queries)[0]
        queries = self.n1(queries)
        q = queries + query_pos
        k = memory + memory_pos
        queries = queries + self.cross_attn(q, k, memory)[0]
        queries = self.n2(queries)
        queries = queries + self.mlp(queries)
        return self.n3(queries)

Each decoder output embedding is passed independently to two prediction heads: a shared 3-layer MLP (hidden dimension d=256d = 256, ReLU activations) predicts a normalized bounding box (cx,cy,w,h)(c_x, c_y, w, h) via a sigmoid output, and a linear layer with softmax predicts the class label over K+1K + 1 classes including the special "no-object" class \varnothing for unmatched slots (§3.2.4).

Training. Dataset: COCO 2017 object detection. The loss has two parts — first, the Hungarian algorithm finds the optimal permutation matching predictions to ground-truth objects; then a per-matched-pair loss is applied:

Definition
Bipartite-matching set-prediction loss

Given NN predictions and the ground-truth set padded to size NN with \varnothing, find the permutation σ^\hat{\sigma} minimising a pairwise matching cost; then apply a per-matched-pair Hungarian loss combining negative-log class probability with L1 and GIoU box terms.

σ^=arg minσSNi=1NLmatch ⁣(yi,y^σ(i))\hat{\sigma} = \underset{\sigma \in \mathfrak{S}_N}{\operatorname{arg\,min}} \sum_{i=1}^{N} \mathcal{L}_{\rm match}\!\left(y_i,\, \hat{y}_{\sigma(i)}\right)LHungarian=i=1N[logp^σ^(i)(ci)+1ci(λL1bib^σ^(i)1+λGIoULGIoU(bi,b^σ^(i)))]\mathcal{L}_{\rm Hungarian} = \sum_{i=1}^{N} \Bigl[ -\log \hat{p}_{\hat{\sigma}(i)}(c_i) + \mathbb{1}_{c_i \neq \varnothing}\left( \lambda_{\rm L1} \|b_i - \hat{b}_{\hat{\sigma}(i)}\|_1 + \lambda_{\rm GIoU}\, \mathcal{L}_{\rm GIoU}(b_i,\, \hat{b}_{\hat{\sigma}(i)}) \right) \Bigr]

Optimiser: AdamW with transformer LR 10410^{-4}, backbone LR 10510^{-5}, weight decay 10410^{-4}. Batch size 64 across 16 V100 GPUs. The default schedule trains for 300 epochs with a ×0.1 LR drop at epoch 200 (~3 days on 16 V100s); an extended 500-epoch schedule drops at epoch 400 and adds ~1.5 AP. Augmentation: random horizontal flip, random scale (shortest side 480–800 px, longest \leq 1333 px), random crop (+1 AP). Dropout 0.1 in all transformer components (§4.0.2). Headline metric on COCO val: DETR-R50 reaches 42.0 AP at 500 epochs, matching Faster R-CNN-R50-FPN+ at 42.0 AP (Table 1, §4.2).

Complexity. DETR-R50: 41 M params, 86 GFLOPs, 28 FPS. DETR-R101: 60 M params, 152 GFLOPs, 20 FPS. DETR-DC5 (dilated C5, stride 16 instead of 32): 41 M params, 187 GFLOPs, 12 FPS — doubles spatial token count, incurs ~16× encoder self-attention cost and ~2× total FLOPs, raises small-object AP by ~2 points (Table 1, §4.0.2).

Implementations

Official PyTorch release from Facebook AI; the reference repository ships pretrained checkpoints for DETR-R50, DETR-R101, DETR-DC5, and the COCO panoptic head.

Assessment

Novelty.

  • Bipartite-matching set-prediction loss eliminates NMS at inference: each prediction is supervised only by its uniquely assigned ground-truth instance via the Hungarian algorithm, so the model learns to produce diverse, non-duplicate predictions without any post-processing. This replaces the anchor-matching and NMS pipeline of Faster R-CNN, Mask R-CNN, and the one-stage detector family (YOLO, SSD, RetinaNet).
  • Transformer encoder-decoder applied to detection as a complete end-to-end pipeline — not merely as an attention module bolted onto CNN features. The decoder uses N=100N = 100 learnable object queries decoded non-autoregressively (all queries in parallel), a structural departure from autoregressive transformers used in language modelling.
  • Zero spatial hyperparameters: no anchors, no region proposals, no IoU thresholds, no scale or aspect-ratio priors — the object queries learn implicit spatial distributions from data (§3.1).
  • Natural extension to panoptic segmentation via a small segmentation head trained on top of frozen DETR (§5) — the same set-prediction interface accommodates instance + stuff segmentation without architectural redesign.

Strengths.

  • COCO val AP: DETR-R50 42.0 (500 ep, 86 GFLOPs) vs Faster R-CNN-R50-FPN+ 42.0 (180 GFLOPs) — comparable accuracy at 2.1× fewer FLOPs (Table 1).
  • Large-object AP: DETR-R50 APL_{\rm L} 61.1 vs Faster R-CNN-R50-FPN+ 53.4 — a 7.7-point absolute lead, attributed to global encoder self-attention where every spatial token attends to every other (§4.1).
  • Conceptual simplicity: the detector is fully end-to-end differentiable from image pixels to (class, box) set, with no anchor boxes, no NMS, no RPN, and no IoU-threshold hyperparameters.
  • Direct extensibility to other set-prediction tasks (panoptic segmentation, instance segmentation, 3D detection) by replacing the per-query prediction head, with no changes to the encoder or decoder.

Limitations.

  • Slow training convergence: DETR requires 300–500 epochs vs 12–36 for Faster R-CNN (§4.0.2, §4.1) — 10–25× more wall-clock training compute to reach equal accuracy. Deformable DETR (2020) was motivated specifically by this deficit and achieves competitive AP in ~50 epochs via sparse attention.
  • Small-object AP underperforms Faster R-CNN: DETR-R50 APS_{\rm S} 20.5 vs Faster R-CNN-FPN+ 26.6 — a 6.1-point absolute deficit (Table 1). The stride-32 feature map gives small objects very few spatial tokens in the encoder; DETR-DC5 partially closes the gap (APS_{\rm S} 22.5) at the cost of doubled compute.
  • Quadratic encoder self-attention cost O((HW)2)O((HW)^2): DETR-DC5's stride-16 backbone quadruples the token count, producing 16× higher encoder attention cost and 2× total FLOPs, making very-high-resolution inputs impractical without windowed-attention variants (§4.0.2).
  • Fixed cap of N=100N = 100 predictions per image: images with more than NN objects will silently miss detections. COCO has up to 63 instances per image (§4.0.1), leaving the current cap adequate for that dataset but not for dense-scene applications.

References

  1. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. End-to-End Object Detection with Transformers. ECCV, 2020. arxiv
  2. He, K., Zhang, X., Ren, S., & Sun, J. Deep Residual Learning for Image Recognition. CVPR, 2016. arxiv

Compared with

  • Faster R-CNN

    DETR vs Faster R-CNN is the headline detection comparison; DETR removes hand-designed RPN + anchor boxes + NMS in favour of bipartite matching + transformer decoder at comparable COCO AP.

Feeds into

  • SAM

    SAM's mask decoder two-way cross-attention is inspired by DETR's transformer decoder; SAM 3's concept detector is explicitly DETR-based.

  • RF-DETR

    RF-DETR is a DETR-family set-prediction detector; built on the DETR paradigm via its parents LW-DETR/Deformable-DETR.