Motivation
Takes a full RGB image and produces, in a single CNN forward pass, a set of bounding boxes each paired with class scores — with no region-proposal stage and no per-window classifier. Input: a 448×448 RGB image. Output: a 7×7×30 tensor encoding, for each cell of a 7×7 spatial grid, two bounding boxes (each as center offset, width, height, and objectness confidence) and 20 conditional class probabilities shared across both boxes. The defining property is the regression framing: detection is not a cascade of proposal and classification steps but a direct mapping from pixels to a per-cell prediction tensor in one shot, enabling real-time throughput without test-time proposal generation.
Architecture
Family & shape. Single-stage CNN. Input: 448×448 RGB for detection, 224×224 for ImageNet pretraining. Output: 7×7×30 tensor for VOC (S=7, B=2, C=20, giving 7×7×(2·5+20)=7×7×30). Backbone is GoogLeNet-inspired (§2.1, Figure 3); it does not use Inception modules but adopts the same philosophy of alternating width reduction and spatial convolution.
Blocks. 24 convolutional layers followed by 2 fully connected layers; alternating 1×1 reduction layers precede 3×3 conv layers throughout the backbone (§2.1). All layers except the final output use leaky ReLU:
\phi(x) = \begin{cases} x & \text{if } x > 0 \\ 0.1x & \text{otherwise} \end{cases} \tag{eq. 2}
The final layer uses a linear activation. At test time, class-specific confidence scores per cell are computed as (§2, eq. 1). The YOLO head decode for a single cell in Python:
import numpy as np
def decode_yolo_cell(raw: np.ndarray, cell_row: int, cell_col: int,
S: int = 7, B: int = 2, C: int = 20):
"""Decode one cell from the 7×7×30 YOLO output tensor.
raw: (S, S, B*5 + C) — raw network output, values in [0,1].
Returns list of (x_img, y_img, w_img, h_img, class_conf[C]) per box.
"""
results = []
class_probs = raw[cell_row, cell_col, B * 5:] # (C,) conditional
for b in range(B):
tx, ty, tw, th, conf = raw[cell_row, cell_col, b * 5: b * 5 + 5]
# (x, y) are offsets from cell top-left, normalised to cell width
x_img = (cell_col + tx) / S
y_img = (cell_row + ty) / S
# (w, h) are relative to the full image
w_img = tw
h_img = th
# class-specific confidence: Pr(Class_i) × IOU (paper §2, eq. 1)
class_conf = class_probs * conf # (C,)
results.append((x_img, y_img, w_img, h_img, class_conf))
return results
Training. The first 20 conv layers are pretrained on ImageNet at 224×224 (top-5 accuracy 88%, comparable to GoogLeNet, §2.2); the full detection head is fine-tuned at 448×448. The loss is a multi-part sum-squared-error over all S×S grid cells:
Sum-squared-error over all cells and responsible box predictors. is 1 when cell 's -th box is responsible for a ground-truth object; is 1 when any object center falls in cell .
with , . Width and height are parametrized as to reduce gradient imbalance between large and small boxes (§2.2, eq. 3).
Learning rate schedule: warmup 10⁻³→10⁻² over the first epochs, then 10⁻² for 75 epochs, 10⁻³ for 30 epochs, 10⁻⁴ for 30 epochs (§2.2). Batch 64, momentum 0.9, weight decay 5×10⁻⁴; dropout 0.5 after the first FC layer; data augmentation: random scaling and translations up to ±20%, exposure and saturation jitter ×1.5 in HSV. Headline: YOLO 63.4% mAP at 45 fps, Fast YOLO 52.7% mAP at 155 fps on VOC 2007 test (Table 1).
Complexity. The network has 24 conv layers + 2 FC layers; the output is a 7×7×30 tensor per image. At test time the model produces 98 candidate bounding boxes per image (S×S×B = 7×7×2, §2.3). No parameter count or FLOPs figure is reported in the paper.
Implementations
Official Darknet (C/CUDA) release by the paper authors; a widely-used PyTorch community port exists for research use.
Assessment
Novelty.
- Replaces the DPM sliding-window pipeline — hand-crafted deformable templates scored independently at each location — with a single CNN regression over the full image, eliminating the multi-step proposal-and-classify structure (§3).
- Replaces the R-CNN / Fast R-CNN multi-stage pipeline — external proposal generator, per-region warped feature extraction, SVM classifier — with one forward pass that shares features across all boxes and all classes simultaneously (§3).
- Global image reasoning: each grid cell sees context from the full receptive field rather than a cropped proposal window, reducing background false positives relative to Fast R-CNN (§4.2).
Strengths.
- Real-time throughput: 45 fps (base YOLO) and 155 fps (Fast YOLO) on Titan X, versus 7 fps for Faster R-CNN VGG-16 and 0.5 fps for Fast R-CNN (Table 1).
- Fewer background false positives than Fast R-CNN: localization is YOLO's dominant error type, while Fast R-CNN makes almost 3× more background errors (§4.2).
- Generalizes across domains: YOLO degrades less than R-CNN on artwork benchmarks (Picasso dataset, People-Art dataset), suggesting that global regression is less reliant on photographic-image statistics (§4.5).
- Ensemble synergy: using YOLO to rescore Fast R-CNN detections raises Fast R-CNN's VOC 2007 mAP from 71.8% to 75.0% (§4.3), demonstrating complementary error modes.
Limitations.
- Localization is the dominant error source (§4.2); in contrast to proposal-based detectors, YOLO sacrifices per-box coordinate precision for throughput.
- Small objects appearing in groups are poorly handled: the 7×7 grid imposes a hard constraint of at most B=2 detections per cell, so dense clusters (e.g. flocks of birds) exceed the grid's representational capacity (§2.4).
- The coarse 7×7 spatial grid (effective stride 64 px on 448×448 input) limits localization precision for small-scale objects and cannot assign different classes to two objects whose centers land in the same cell (§2.4).
- VOC 2012 mAP is 57.9%, notably below the 63.4% on VOC 2007; small-object categories such as bottle, sheep, and tv/monitor are 8–10% below R-CNN or Feature Edit (§4.4).
References
- Redmon, Divvala, Girshick, Farhadi. You Only Look Once: Unified, Real-Time Object Detection. CVPR 2016. arXiv.02640
- Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich. Going deeper with convolutions. CVPR 2015. arXiv.4842
- Simonyan, Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. arXiv.1556
- Ren, He, Girshick, Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NeurIPS 2015. arXiv.01497
- Felzenszwalb, Girshick, McAllester, Ramanan. Object Detection with Discriminatively Trained Part-Based Models. IEEE TPAMI, 2010. paper