RF-DETR | VitaVision
Back to atlas

RF-DETR

5 min readAdvancedhybrid30.5M (nano), 126.9M (2x-large)View in graph
Based on
RF-DETR: Neural Architecture Search for Real-Time Detection Transformers
Robinson, Robicheaux, Popov, Ramanan et al. · arXiv (2025) 2025
arXiv ↗

Implementations

Motivation

RGB images in at variable resolution; axis-aligned bounding boxes with class scores — and optionally binary instance masks — out, with no non-maximum suppression. A single training run on a labeled target dataset produces a weight-sharing supernet from which any of 6,468 sub-network configurations can be evaluated at inference, tracing a continuous accuracy-latency Pareto frontier without per-configuration retraining. The operating point — image resolution, patch size, decoder depth, query count, window count — is chosen at inference time; each choice yields a different latency and accuracy without any additional training.

Architecture

Family & shape. Hybrid: a DINOv2 self-supervised ViT backbone (ViT-S or ViT-B) feeding a LW-DETR-derived transformer encoder-decoder with learned object queries. Input: RGB image at variable resolution. Output: up to QQ (class, box) pairs where Q{50,100,200,300}Q \in \{50, 100, 200, 300\} is selected at inference without retraining; end-to-end set prediction with no NMS. Released as a size family from nano to 2x-large.

Blocks. The central contribution is end-to-end weight-sharing NAS. One base network is fully trained once on the target dataset. At each training step a random sub-network configuration is sampled uniformly from the search space and updated — training thousands of sub-networks jointly without separate retraining per configuration, inspired by OFA (Cai et al., 2019) but the first application to detection and segmentation. After training, 6,468 configurations are grid-searched on the validation set to trace the accuracy-latency Pareto frontier with no additional training. Five knobs are tunable:

  1. Image resolution — 11 values from 320 to 960 px; positional embeddings pre-allocated and interpolated.
  2. ViT patch size — 7 values in {8,10,12,16,20,24,32}\{8, 10, 12, 16, 20, 24, 32\}; FlexiVIT-style bilinear interpolation of patch embeddings handles unseen sizes at inference.
  3. Number of decoder layers — up to 6; each layer is independently supervised during training, so the decoder can be truncated at inference.
  4. Number of query tokens{50,100,200,300}\{50, 100, 200, 300\}; lowest-confidence queries are dropped without retraining.
  5. Number of windowed attention blocks per encoder layer{1,2,4}\{1, 2, 4\}; correlates with the spatial density of the target dataset.

Architecture augmentation — the diversity of configurations sampled during training — acts as a regularizer, improving generalisation to out-of-distribution datasets. RF-DETR replaces LW-DETR's CAEv2 backbone with DINOv2 and uses a layer-norm (not batch-norm) multi-scale projector for consumer-GPU training compatibility.

Training. Backbone: DINOv2 ViT-S or ViT-B, fine-tuned on COCO or RF100-VL. Learning rate 1×1041 \times 10^{-4} (vs. 4×1044 \times 10^{-4} in LW-DETR); per-layer multiplicative backbone decay 0.8; gradient clipping at 0.1; cosine schedule dropped (scheduler-free); minimal augmentation (horizontal flip + random crop). Headline result: RF-DETR (nano) achieves 48.0 AP on COCO at 2.3 ms (T4, TensorRT, FP-16), exceeding D-FINE (nano) by 5.3 AP at comparable latency (2.1 ms).

Complexity. RF-DETR (nano): 30.5M parameters, 2.3 ms on T4 with TensorRT FP-16; RF-DETR (2x-large): 126.9M parameters, 17.2 ms, 60.1 AP on COCO — the first real-time detector to exceed 60 AP on COCO.

Implementations

Official Roboflow PyTorch release; N/S/M/L weights under Apache-2.0, XL/2XL weights under the proprietary PML 1.0 license.

Assessment

Novelty.

  • First end-to-end weight-sharing NAS applied to object detection and instance segmentation, adapting OFA-style joint sub-network training to the DETR domain.
  • Reframes DETR inference as a choice over five independently tunable knobs (resolution, patch size, decoder depth, query count, window count), tracing the full Pareto frontier from a single training run.
  • Replaces LW-DETR's CAEv2 backbone with DINOv2, inheriting internet-scale self-supervised features.

Strengths.

  • RF-DETR (nano) achieves 48.0 AP on COCO at 2.3 ms, a 5.3 AP margin over D-FINE (nano) at matched latency.
  • RF-DETR (2x-large) reaches 60.1 AP on COCO — the first real-time detector to surpass 60 AP.
  • RF-DETR (2XL, fine-tuned) achieves 63.5 AP on RF100-VL, a 1.2 AP improvement over GroundingDINO (tiny) at approximately 20× lower latency (15.6 ms vs. 309.9 ms).
  • Robust to FP-16 quantisation across model sizes, where D-FINE degrades from 55.1 AP to 0.5 AP under naive FP-16 conversion.

Limitations.

  • RF-DETR-XL and RF-DETR-2XL weights (the models exceeding 60 AP) are released under PML 1.0, a proprietary license; only the N/S/M/L weights are Apache-2.0.
  • Latency and Pareto results are measured on NVIDIA T4 with TensorRT FP-16; figures are not transferable to CPU or mobile targets.
  • Closed-vocabulary specialist detector: open-vocabulary or zero-shot detection requires GroundingDINO or comparable vision-language models at roughly 20× higher latency.
  • Requires a pre-trained DINOv2 checkpoint (ViT-S or ViT-B) as initialisation; performance degrades substantially when trained from scratch on small datasets.

References

  1. I. Robinson, P. Robicheaux, M. Popov, D. Ramanan, N. Peri. RF-DETR: Neural Architecture Search for Real-Time Detection Transformers. arXiv, 2025. arXiv
    .09554
  2. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-End Object Detection with Transformers. ECCV, 2020. arXiv
    .12872
  3. A. Dosovitskiy, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR, 2021. arXiv
    .11929
flowchart LR
    A["Train supernet once<br/>random sub-net per step<br/>(OFA-style)"] --> B["Grid-search 6,468 configs<br/>on val (no retraining)"]
    B --> C["Accuracy-latency<br/>Pareto frontier"]
    C --> D["Pick operating point<br/>at inference"]

Fed by

  • DETR

    RF-DETR is a DETR-family set-prediction detector; built on the DETR paradigm via its parents LW-DETR/Deformable-DETR.

  • medium
    ViT

    RF-DETR's backbone is a DINOv2 self-supervised ViT.