Motivation
RGB images in at variable resolution; axis-aligned bounding boxes with class scores — and optionally binary instance masks — out, with no non-maximum suppression. A single training run on a labeled target dataset produces a weight-sharing supernet from which any of 6,468 sub-network configurations can be evaluated at inference, tracing a continuous accuracy-latency Pareto frontier without per-configuration retraining. The operating point — image resolution, patch size, decoder depth, query count, window count — is chosen at inference time; each choice yields a different latency and accuracy without any additional training.
Architecture
Family & shape. Hybrid: a DINOv2 self-supervised ViT backbone (ViT-S or ViT-B) feeding a LW-DETR-derived transformer encoder-decoder with learned object queries. Input: RGB image at variable resolution. Output: up to (class, box) pairs where is selected at inference without retraining; end-to-end set prediction with no NMS. Released as a size family from nano to 2x-large.
Blocks. The central contribution is end-to-end weight-sharing NAS. One base network is fully trained once on the target dataset. At each training step a random sub-network configuration is sampled uniformly from the search space and updated — training thousands of sub-networks jointly without separate retraining per configuration, inspired by OFA (Cai et al., 2019) but the first application to detection and segmentation. After training, 6,468 configurations are grid-searched on the validation set to trace the accuracy-latency Pareto frontier with no additional training. Five knobs are tunable:
- Image resolution — 11 values from 320 to 960 px; positional embeddings pre-allocated and interpolated.
- ViT patch size — 7 values in ; FlexiVIT-style bilinear interpolation of patch embeddings handles unseen sizes at inference.
- Number of decoder layers — up to 6; each layer is independently supervised during training, so the decoder can be truncated at inference.
- Number of query tokens — ; lowest-confidence queries are dropped without retraining.
- Number of windowed attention blocks per encoder layer — ; correlates with the spatial density of the target dataset.
Architecture augmentation — the diversity of configurations sampled during training — acts as a regularizer, improving generalisation to out-of-distribution datasets. RF-DETR replaces LW-DETR's CAEv2 backbone with DINOv2 and uses a layer-norm (not batch-norm) multi-scale projector for consumer-GPU training compatibility.
Training. Backbone: DINOv2 ViT-S or ViT-B, fine-tuned on COCO or RF100-VL. Learning rate (vs. in LW-DETR); per-layer multiplicative backbone decay 0.8; gradient clipping at 0.1; cosine schedule dropped (scheduler-free); minimal augmentation (horizontal flip + random crop). Headline result: RF-DETR (nano) achieves 48.0 AP on COCO at 2.3 ms (T4, TensorRT, FP-16), exceeding D-FINE (nano) by 5.3 AP at comparable latency (2.1 ms).
Complexity. RF-DETR (nano): 30.5M parameters, 2.3 ms on T4 with TensorRT FP-16; RF-DETR (2x-large): 126.9M parameters, 17.2 ms, 60.1 AP on COCO — the first real-time detector to exceed 60 AP on COCO.
Implementations
Official Roboflow PyTorch release; N/S/M/L weights under Apache-2.0, XL/2XL weights under the proprietary PML 1.0 license.
Assessment
Novelty.
- First end-to-end weight-sharing NAS applied to object detection and instance segmentation, adapting OFA-style joint sub-network training to the DETR domain.
- Reframes DETR inference as a choice over five independently tunable knobs (resolution, patch size, decoder depth, query count, window count), tracing the full Pareto frontier from a single training run.
- Replaces LW-DETR's CAEv2 backbone with DINOv2, inheriting internet-scale self-supervised features.
Strengths.
- RF-DETR (nano) achieves 48.0 AP on COCO at 2.3 ms, a 5.3 AP margin over D-FINE (nano) at matched latency.
- RF-DETR (2x-large) reaches 60.1 AP on COCO — the first real-time detector to surpass 60 AP.
- RF-DETR (2XL, fine-tuned) achieves 63.5 AP on RF100-VL, a 1.2 AP improvement over GroundingDINO (tiny) at approximately 20× lower latency (15.6 ms vs. 309.9 ms).
- Robust to FP-16 quantisation across model sizes, where D-FINE degrades from 55.1 AP to 0.5 AP under naive FP-16 conversion.
Limitations.
- RF-DETR-XL and RF-DETR-2XL weights (the models exceeding 60 AP) are released under PML 1.0, a proprietary license; only the N/S/M/L weights are Apache-2.0.
- Latency and Pareto results are measured on NVIDIA T4 with TensorRT FP-16; figures are not transferable to CPU or mobile targets.
- Closed-vocabulary specialist detector: open-vocabulary or zero-shot detection requires GroundingDINO or comparable vision-language models at roughly 20× higher latency.
- Requires a pre-trained DINOv2 checkpoint (ViT-S or ViT-B) as initialisation; performance degrades substantially when trained from scratch on small datasets.
References
- I. Robinson, P. Robicheaux, M. Popov, D. Ramanan, N. Peri. RF-DETR: Neural Architecture Search for Real-Time Detection Transformers. arXiv, 2025. arXiv.09554
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-End Object Detection with Transformers. ECCV, 2020. arXiv.12872
- A. Dosovitskiy, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR, 2021. arXiv.11929
flowchart LR
A["Train supernet once<br/>random sub-net per step<br/>(OFA-style)"] --> B["Grid-search 6,468 configs<br/>on val (no retraining)"]
B --> C["Accuracy-latency<br/>Pareto frontier"]
C --> D["Pick operating point<br/>at inference"]