Motivation
ResNet takes a RGB image as input and produces a softmax probability vector. The defining property is the residual reformulation: rather than fitting a stacked-layer mapping directly, each building block approximates the residual , so the full mapping is realised as . This reformulation resolves the degradation problem that prevents plain networks from improving with additional depth — a problem distinct from vanishing gradients, as plain nets trained with BatchNorm exhibit healthy gradient norms yet still reach worse optima as depth increases beyond approximately 20 layers.
Architecture
Family & shape. CNN. Input RGB; output softmax probabilities. The family covers five depth variants — ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152 — organised in a five-stage layout. ResNet-18 and ResNet-34 use the basic block (two stacked convolutions); ResNet-50, ResNet-101, and ResNet-152 use the bottleneck block ().
Blocks. Each residual unit wraps its layer stack in a shortcut connection. When input and output dimensions match, the identity shortcut is parameter-free (§3.2, Eq. 1):
When channel count or spatial resolution changes between stages, a linear projection is applied via a convolution with stride 2 (§3.2, Eq. 2):
Three shortcut options are evaluated for dimension-changing connections: (A) zero-padding (no extra parameters), (B) projection shortcut only when dimensions change, (C) all shortcuts as projections. Option B is selected as the default — option A leaves zero-padded dimensions without residual learning, and option C approximately doubles complexity for bottleneck architectures (§3.3, Table 3).
For ResNet-50/101/152, the residual function uses a bottleneck design: a convolution reducing channels, then a convolution, then a convolution restoring channels (§3.3, Fig. 5). BatchNorm is applied after each convolution and before the ReLU activation throughout.
The five-stage layer structure, following Table 1 of the paper:
| Layer | Output size | ResNet-18 | ResNet-34 | ResNet-50 | ResNet-101 | ResNet-152 |
|---|---|---|---|---|---|---|
| conv1 | 112×112 | 7×7/2, 64 | 7×7/2, 64 | 7×7/2, 64 | 7×7/2, 64 | 7×7/2, 64 |
| conv2_x | 56×56 | [3×3, 64]×2 | [3×3, 64]×3 | [1×1, 64; 3×3, 64; 1×1, 256]×3 | [1×1, 64; 3×3, 64; 1×1, 256]×3 | [1×1, 64; 3×3, 64; 1×1, 256]×3 |
| conv3_x | 28×28 | [3×3, 128]×2 | [3×3, 128]×4 | [1×1, 128; 3×3, 128; 1×1, 512]×4 | [1×1, 128; 3×3, 128; 1×1, 512]×4 | [1×1, 128; 3×3, 128; 1×1, 512]×8 |
| conv4_x | 14×14 | [3×3, 256]×2 | [3×3, 256]×6 | [1×1, 256; 3×3, 256; 1×1, 1024]×6 | [1×1, 256; 3×3, 256; 1×1, 1024]×23 | [1×1, 256; 3×3, 256; 1×1, 1024]×36 |
| conv5_x | 7×7 | [3×3, 512]×2 | [3×3, 512]×3 | [1×1, 512; 3×3, 512; 1×1, 2048]×3 | [1×1, 512; 3×3, 512; 1×1, 2048]×3 | [1×1, 512; 3×3, 512; 1×1, 2048]×3 |
| pool + FC | 1×1, 1000 | avg pool, FC-1000 | avg pool, FC-1000 | avg pool, FC-1000 | avg pool, FC-1000 | avg pool, FC-1000 |
| FLOPs | — | 1.8B | 3.6B | 3.8B | 7.6B | 11.3B |
The bottleneck residual unit in PyTorch (torchvision @ afc54f7):
# torchvision/models/resnet.py @ afc54f7 (BSD-3-Clause)
# torchvision places the stride at conv2 (3×3), not conv1 (1×1) as in the paper.
# This variant is known as ResNet V1.5 and improves accuracy slightly.
class Bottleneck(nn.Module):
expansion: int = 4
def __init__(self, inplanes, planes, stride=1, downsample=None,
groups=1, base_width=64, dilation=1, norm_layer=None):
super().__init__()
if norm_layer is None:
norm_layer = nn.BatchNorm2d
width = int(planes * (base_width / 64.0)) * groups
self.conv1 = conv1x1(inplanes, width)
self.bn1 = norm_layer(width)
self.conv2 = conv3x3(width, width, stride, groups, dilation)
self.bn2 = norm_layer(width)
self.conv3 = conv1x1(width, planes * self.expansion)
self.bn3 = norm_layer(planes * self.expansion)
self.relu = nn.ReLU(inplace=True)
self.downsample = downsample
def forward(self, x):
identity = x
out = self.relu(self.bn1(self.conv1(x)))
out = self.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
if self.downsample is not None:
identity = self.downsample(x)
out += identity
return self.relu(out)
Training. Trained on ILSVRC ImageNet: 1.28 million training images, 50k validation images, 100k test images, 1000 classes (§3.4). Objective: standard softmax cross-entropy. Optimiser: SGD with momentum , weight decay , mini-batch 256, initial learning rate divided by when error plateaus, up to iterations. Augmentation: random crops from images rescaled with shorter side in , per-pixel mean subtraction, and standard colour augmentation. Weights initialised from with (He 2015 initialisation); no dropout. Headline results:
- Plain-18 top-1 val: 27.94%; plain-34 top-1 val: 28.54% — the 34-layer plain net is worse than the 18-layer plain net, confirming the degradation problem (Table 2).
- ResNet-152 single model: 19.38% top-1 / 4.49% top-5 validation (Table 4).
- 6-model ResNet ensemble: 3.57% top-5 on the ImageNet test set, winning ILSVRC-2015 (Table 5).
- CIFAR-10: ResNet-110 6.43% test error; ResNet-1202 7.93% (Table 6, §4.2).
- Transfer to detection — PASCAL VOC 2007: ResNet-101 76.4% mAP vs VGG-16 73.2% (Table 7); COCO mAP@[.5,.95]: ResNet-101 27.2% vs VGG-16 21.2% (Table 8).
Complexity. ResNet-18: 11.7M parameters, 1.8B FLOPs; ResNet-50: 25.6M parameters, 3.8B FLOPs; ResNet-101: 44.5M parameters, 7.6B FLOPs; ResNet-152: 60.2M parameters, 11.3B FLOPs — all below VGG-16 (15.3B) and VGG-19 (19.6B) (Table 1, torchvision).
Implementations
The original Caffe release accompanies the paper; the PyTorch torchvision port (torchvision.models.resnet{18,34,50,101,152}) is the de-facto practical reference and ships pretrained ImageNet weights.
Assessment
Novelty.
- Identified the degradation problem as an optimization difficulty — not vanishing gradients — by demonstrating that plain nets trained with BatchNorm still degrade with depth (§1, Fig. 1, Fig. 4); AlexNet and VGG share the same class of depth-scaling failure.
- Introduced residual shortcuts as parameter-free identity connections that bypass the optimization barrier, enabling systematic depth scaling from 8 layers (AlexNet) and 19 layers (VGG) to 152 layers.
- Demonstrated that the bottleneck block keeps FLOPs comparable to the shallower basic block (ResNet-50 at 3.8B FLOPs is below VGG-19 at 19.6B), decoupling depth from compute growth — a strictly more efficient path than GoogLeNet's Inception module for equal-depth networks.
- Showed that the residual reformulation generalises: the same shortcut mechanism, without architecture-specific tuning, yields improvements from CIFAR-10 through ImageNet detection and segmentation benchmarks.
Strengths.
- ILSVRC-2015 classification winner: 6-model ensemble 3.57% top-5 on the test set (Table 5) — versus GoogLeNet (ILSVRC-2014) ensemble 6.66% top-5 (Table 5), a 46% relative reduction with a single architectural change (residual shortcuts).
- Depth scaling yields monotone improvement up to 152 layers: ResNet-152 single model 4.49% top-5 val beats BN-Inception 5.81%, PReLU-net 5.71%, and GoogLeNet 7.89% all single-model (Table 4).
- Transfer to detection is directly measurable: replacing VGG-16 with ResNet-101 in Faster R-CNN raises PASCAL VOC 2007 mAP from 73.2% to 76.4% and COCO mAP@[.5,.95] from 21.2% to 27.2% (Tables 7, 8).
- Pretrained torchvision weights are BSD-3-Clause licensed and cover five depth variants with ImageNet-pretrained checkpoints; the architecture supports
replace_stride_with_dilationfor dilated-convolution dense-prediction variants without retraining.
Limitations.
- Depth without dataset scale overfits: the 1202-layer CIFAR-10 model achieves 7.93% test error — worse than the 110-layer model at 6.43% — with training error below 0.1%, indicating that 19.4M parameters exceed the capacity of a 50k-image training set without strong regularisation (Table 6, §4.2).
- Spatial downsampling leaves a final feature map at the deepest stage; fine-grained localisation requires FPN or dilated convolutions to recover spatial resolution for dense prediction tasks.
- ResNet-152 carries 60.2M parameters, comparable in memory footprint to VGG-19 (144M weights but dominated by FC layers); at float32 the backbone alone requires ~230 MB of storage.
- The deepest CIFAR-10 variant (110 layers) requires a learning-rate warm-up: initial LR 0.1 is too large, necessitating a warm-up to LR 0.01 until training error drops below 80% (~400 iterations) before reverting to 0.1 (§4.2, footnote 5).
References
- He, Zhang, Ren, Sun. Deep Residual Learning for Image Recognition. CVPR 2016. arXiv.03385
- Krizhevsky, Sutskever, Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012.
- Simonyan, Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015. arXiv.1556
- Szegedy et al. Going Deeper with Convolutions. CVPR 2015. arXiv.4842
- Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ICML 2015.