Motivation
Assign a semantic class label to every pixel of an arbitrary-resolution RGB image in a single forward pass, without region proposals, superpixels, or post-processing. Input: RGB image. Output: per-pixel probability distribution over semantic classes at the full input resolution (K = 21 for PASCAL VOC, 40 for NYUDv2, 33+3 for SIFT Flow). The defining contribution is the fully-convolutional reinterpretation of pretrained ImageNet classifiers — replacing all fully-connected layers with equivalent 1×1 convolutions — combined with a skip-fusion deconvolution architecture (FCN-8s) that recovers spatial detail lost by pooling. This supersedes patch-CNN sliding-window segmenters (Farabet et al. 2013; Pinheiro & Collobert 2014) and superpixel-plus-classifier pipelines, replacing them with a single end-to-end trained network that operates on whole images.
Architecture
Family & shape. Encoder-decoder CNN derived from VGG-16 (16 parameter layers, 134 M parameters) by replacing its three fully-connected layers with 1×1 convolutions, yielding a fully convolutional network that accepts inputs of arbitrary spatial size. Input shape: (RGB, arbitrary resolution). Output shape: (per-pixel class scores at input resolution, for PASCAL VOC). Three skip-fusion variants exist — FCN-32s, FCN-16s, FCN-8s — differing in which intermediate pooling layers are tapped before the final deconvolution to full resolution.
Blocks. The architecture has three structural components.
(1) Fully-convolutional reinterpretation. VGG-16's fc6 and fc7 layers (originally 4096-unit fully-connected) are replaced by 1×1 convolutions of output widths 4096 and 4096 respectively, operating over the full spatial feature map. The 1000-way ImageNet classifier is replaced by a 1×1 convolution with 21 output channels (PASCAL VOC). This substitution is exact: a fully-connected layer over a fixed-size input is mathematically equivalent to a 1×1 convolution covering the full feature map (Section 3.1, Figure 2). After VGG-16's five max-pooling stages (each with stride 2), the total stride is 32 — a 640×480 input yields a 20×15 score map before upsampling.
(2) Deconvolution upsampling. Upsampling by factor is implemented as a backwards strided convolution ("deconvolution") with output stride . All deconvolution filters are initialised to bilinear interpolation weights; intermediate deconvolutions are fine-tuned, and the final deconvolution layer may be frozen or learned (Section 4.3, Dense Prediction paragraph).
(3) Skip fusion (FCN-8s). Rather than upsampling 32× from the final conv7 score map alone, intermediate pool layers are tapped via 1×1 convolutions that produce class score maps at their native strides. These score maps are element-wise summed (not concatenated — max fusion was found to impede learning) before the final deconvolution. Pool4 (stride 16) and pool3 (stride 8) 1×1 prediction layers are zero-initialised; the learning rate is decreased 100× when each skip branch is added (Section 4.2).
The FCN-8s skip-fusion forward pass in PyTorch pseudocode, corresponding line-by-line to Figure 3 of the paper:
def fcn8s_forward(x, vgg, score_fr, score_pool4, score_pool3,
upscore2, upscore_pool4, upscore8):
# Encoder: VGG-16 convolutionalised, total stride 32
pool3 = vgg.pool3(x) # H/8 x W/8 x 256
pool4 = vgg.pool4(pool3) # H/16 x W/16 x 512
conv7 = vgg.conv7(pool4) # H/32 x W/32 x 4096
# Score maps at each scale (21-channel, 1x1 conv)
s7 = score_fr(conv7) # H/32 x W/32 x 21
u2 = upscore2(s7) # H/16 x W/16 x 21 (2x deconv)
s4 = score_pool4(pool4) # H/16 x W/16 x 21 (zero-init)
f4 = u2 + s4 # skip fusion 1 (element-wise sum)
u4 = upscore_pool4(f4) # H/8 x W/8 x 21 (2x deconv)
s3 = score_pool3(pool3) # H/8 x W/8 x 21 (zero-init)
f3 = u4 + s3 # skip fusion 2
out = upscore8(f3) # H x W x 21 (8x deconv)
return out
flowchart LR
X["RGB input<br/>H x W x 3"] --> P3["pool3<br/>H/8 x W/8 x 256"]
P3 --> P4["pool4<br/>H/16 x W/16 x 512"]
P4 --> C7["conv7<br/>H/32 x W/32 x 4096"]
C7 --> S7["score_fr<br/>H/32 x W/32 x 21"]
S7 --> U2["2x deconv"]
P4 --> SP4["score_pool4<br/>(zero-init)"]
U2 --> F4["+ sum"]
SP4 --> F4
F4 --> U4["2x deconv"]
P3 --> SP3["score_pool3<br/>(zero-init)"]
U4 --> F3["+ sum"]
SP3 --> F3
F3 --> OUT["8x deconv<br/>H x W x 21"]
Per-pixel multinomial logistic loss, summed over all unmasked spatial output positions. For a prediction at pixel and ground-truth label :
where is the set of unmasked (labelled) spatial positions. Whole-image training computes this loss over the full image in a single forward pass; Section 3.4 and Figure 5 of the paper show it is equivalent in expectation to patchwise training but faster in wall clock.
Training. Dataset: PASCAL VOC 2011 segmentation training set — 1112 images native, extended to 8498 training images with Hariharan et al. extra annotations (Section 4.3). Loss: per-pixel multinomial logistic (softmax cross-entropy), summed over unmasked positions. Whole-image training with no patch sampling. SGD, momentum 0.9, weight decay (or ), learning rate for FCN-VGG16 (Section 4.3). Minibatch: 20 images. Staged fine-tuning: FCN-32s first (~3 days on a single Tesla K40c), then FCN-16s (~1 day), then FCN-8s (~1 day). Bilinear initialisation for all deconvolution layers; skip-stream 1×1 prediction layers zero-initialised; learning rate decreased 100× when each skip stage is added.
Headline metrics: PASCAL VOC 2011 test mean IU = 62.7 (FCN-8s, Table 3), a 20% relative gain over the prior SOTA SDS at 52.6 (Table 3). PASCAL VOC 2012 test mean IU = 62.2 (FCN-8s, Table 3). NYUDv2 RGB-HHA late fusion mean IU = 34.0 (FCN-16s, Table 4). SIFT Flow geometric pixel accuracy = 94.3, mean IU = 39.5, pixel accuracy = 85.2 (FCN-16s, Table 5). Inference time approximately 175 ms per image on Tesla K40c (Table 3, abstract).
Complexity. 134 M parameters for FCN-VGG16 (Table 1). Receptive field 404 pixels (Table 1). Forward pass ~210 ms for a 500×500 input on Tesla K40c (Table 1). Five 2×-stride max-pool stages yield total stride 32.
Implementations
Official Caffe release (no LICENSE file at pinned commit; BSD-2-Clause declared in the repository README, inherited from Caffe); torchvision ships a community FCN-ResNet50/101 port that swaps VGG-16 for ResNet — a substantive backbone deviation from the paper's reported design.
Assessment
Novelty.
- Fully-convolutional reinterpretation of fc layers as 1×1 convolutions, enabling transfer of any pretrained ImageNet classifier (AlexNet, VGG-16, GoogLeNet) to dense prediction without architectural retraining (contrast: Pinheiro & Collobert 2014 and Farabet et al. 2013 operate on fixed-size patches with sliding-window inference).
- Skip architecture (FCN-8s) that fuses deep coarse-semantic features (conv7, stride 32) with shallow fine-spatial features (pool3, stride 8) via element-wise sum on per-class score maps — framed as a "deep jet" by analogy to Koenderink–van Doorn's feature jet (Section 4.2).
- End-to-end per-pixel training on whole images, shown equivalent in expectation to patchwise training but faster in wall clock (Section 3.4, Figure 5).
- Learnable deconvolution layers initialised to bilinear upsampling rather than fixed bilinear or hand-crafted unpooling, enabling the network to refine spatial layout during fine-tuning.
Strengths.
- 20% relative improvement over the prior SOTA on PASCAL VOC 2011 test (FCN-8s 62.7 vs SDS 52.6, Table 3).
- ~286× faster than SDS overall at inference (Table 3 footnote / Section 5); ~114× faster on the convnet-only component.
- Single forward pass — no region proposals, no superpixels, no post-processing CRF.
- General recipe: benchmarked on AlexNet (mean IU 39.8 on VOC 2011 val, Table 1), VGG-16 (59.4 with extra data, Section 4.1), and GoogLeNet (mean IU 42.5 on VOC 2011 val, Table 1); foundational for U-Net, DeepLab, SegNet, and the dense-prediction literature broadly.
Limitations.
- Coarse boundary resolution: even FCN-8s upsamples from 1/8-resolution score maps; fine object boundaries are blurred. DeepLab (Chen et al. 2015, dilated convolutions + dense CRF) and U-Net (Ronneberger et al. 2015, symmetric encoder-decoder) were required to recover this detail.
- Very thin or small structures whose spatial support falls below the stride-8 effective resolution tend to be classified as background.
- Class confusion in homogeneous regions (sky vs. water, road vs. sidewalk) where discriminative cues require context that the receptive field spans but the network conflates — explicitly noted as a failure case in Section 5, Figure 6.
- Backbone-sensitivity: FCN-AlexNet achieves only 39.8 mean IU on VOC 2011 val vs FCN-VGG16's 56.0 (Table 1); headline numbers are not portable across backbones. The torchvision community port swaps VGG-16 for ResNet-50/101, making published accuracy figures from the paper non-reproducible from that codebase.
- Reference implementation carries no
LICENSEfile at the pinned commit (shelhamer/fcn.berkeleyvision.orgat1305c73); BSD-2-Clause is declared in the README only, inherited from Caffe. Practitioners should retrieve the README at the pinned SHA before deploying.
References
- J. Long, E. Shelhamer, T. Darrell. Fully Convolutional Networks for Semantic Segmentation. IEEE CVPR, 2015. arXiv.4038
- J. Long, E. Shelhamer, T. Darrell. Fully Convolutional Networks for Semantic Segmentation (extended journal version). IEEE TPAMI, 2016. arXiv.06211
- K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2015. arXiv.1556 (VGG-16 backbone.)
- O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI, 2015. arXiv.04597 (Immediate extension: symmetric encoder-decoder with dense skip connections.)
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. ICLR, 2015. arXiv.7062 (DeepLab: dilated convolutions + CRF post-processing to recover boundary precision.)