Motivation
GoogLeNet takes a RGB image and produces a 1000-class softmax probability vector. The defining property is the Inception module — four parallel branches ( convolution, reduce followed by convolution, reduce followed by convolution, and max-pool followed by projection) concatenated on the channel axis, with bottleneck convolutions performing cross-channel dimensionality reduction before the larger spatial convolutions. GoogLeNet stacks 22 weight layers and requires approximately 7M parameters — 12× fewer than AlexNet — while winning ILSVRC-2014 classification at 6.67% top-5 error.
Architecture
Family & shape. CNN. Input . Output softmax. The stem is a traditional (non-Inception) stack used for memory efficiency: /2 conv → /2 max-pool → /1 conv → /2 max-pool. Nine Inception modules follow in three groups (3a–3b, 4a–4e, 5a–5b), with stride-2 max-pool separating groups 3→4 and 4→5. Global average pooling precedes the single linear + softmax head (§4, §5, Table 1).
Blocks. The Inception module's dimension-reduction variant (Figure 2(b)) runs four parallel branches on the same input: (1) direct convolution; (2) bottleneck → convolution; (3) bottleneck → convolution; (4) max-pool → projection. All branches are concatenated on the channel axis. The bottleneck idea originates from Lin et al. Network-in-Network (2013) and serves dual purpose: cross-channel dimensionality compression and an additional ReLU non-linearity (§2, §4).
Two auxiliary classifiers are branched off at Inception (4a) and (4d) during training only. Each auxiliary classifier applies a average-pool stride 3, then a /128 convolution, then FC-1024 with ReLU, then dropout 70%, then FC-1000 softmax. Auxiliary losses are weighted 0.3 at training and discarded at inference (§5).
The Inception module's dimension-reduction variant (torchvision port at commit 336d36e):
# torchvision/models/googlenet.py @ 336d36e
class Inception(nn.Module):
def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3,
ch5x5red, ch5x5, pool_proj, conv_block=None):
super().__init__()
if conv_block is None:
conv_block = BasicConv2d
self.branch1 = conv_block(in_channels, ch1x1, kernel_size=1)
self.branch2 = nn.Sequential(
conv_block(in_channels, ch3x3red, kernel_size=1),
conv_block(ch3x3red, ch3x3, kernel_size=3, padding=1),
)
self.branch3 = nn.Sequential(
conv_block(in_channels, ch5x5red, kernel_size=1),
# NOTE: kernel_size=3 is a known torchvision bug; paper specifies 5×5.
# See pytorch/vision#906.
conv_block(ch5x5red, ch5x5, kernel_size=3, padding=1),
)
self.branch4 = nn.Sequential(
nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=True),
conv_block(in_channels, pool_proj, kernel_size=1),
)
def forward(self, x):
outs = [self.branch1(x), self.branch2(x),
self.branch3(x), self.branch4(x)]
return torch.cat(outs, 1)
Global average pooling before the final softmax improves top-1 accuracy by approximately 0.6% over FC-only (§5). The network comprises 22 weight layers (27 including pooling) and approximately 100 building blocks total (§5).
Training. Trained on ILSVRC (approximately 1.2 million images, 1000 classes). Optimiser: DistBelief distributed CPU training with asynchronous SGD, momentum 0.9, and a polynomial learning-rate schedule decreasing by 4% every 8 epochs. Polyak averaging of iterates produces the final inference model (§6). Data augmentation: aspect-ratio sampling with area 8%–100% of the image and aspect ratio ; photometric distortions; random interpolation method (§6). Test-time evaluation uses crops per image with softmax probabilities averaged across crops (§7). Auxiliary classifier branches contribute loss weight 0.3 at training and are discarded at inference (§5). Results:
- ILSVRC 2014 classification: ensemble top-5 error 6.67% (7 models, 144 crops, §7, Table 2/3, first place). Single model top-5 7.9% versus single VGG-16 at 7.0% (VGG paper Table 7). Relative reduction versus ILSVRC 2012 SuperVision (AlexNet 16.4%): 56.5% (§7).
- ILSVRC 2014 detection: ensemble mAP 43.9% (Table 4, first place); single model mAP 38.02% (Table 5).
Complexity. Approximately 7M parameters; approximately 1.5 billion multiply-adds at inference (§1 budget target).
Implementations
No official authors' repository is maintained; the BVLC Caffe Model Zoo replication and the PyTorch torchvision port are the canonical community implementations.
Assessment
Novelty.
- Introduced the Inception module — parallel , , convolutions and max-pool concatenated on the channel axis — replacing the homogeneous block stacking of AlexNet and (concurrently) VGG.
- Established convolutions (from Lin et al. Network-in-Network 2013) as cross-channel dimensionality-reduction bottlenecks before expensive and spatial convolutions, decoupling depth and width growth from quadratic compute growth (§3, §4).
- Replaced AlexNet's and VGG's fully-connected classification head with global average pooling, reducing parameter count and improving top-1 by approximately 0.6% (§5).
- Demonstrated auxiliary classifiers at intermediate Inception layers (4a, 4d) injecting gradient and regularising training of 22-layer networks before batch normalisation existed (§5).
Strengths.
- ILSVRC 2014 classification winner at 6.67% ensemble top-5 — a 56.5% relative reduction over the 2012 SuperVision baseline (16.4%) — with approximately 12× fewer parameters (7M versus ~60M for AlexNet) (§1, §7, Table 2/3).
- ILSVRC 2014 detection winner at 43.9% mAP ensemble / 38.02% mAP single model (§8, Table 4/5).
- Architecturally efficient: approximately 7M parameters fit a 1.5 billion multiply-adds inference budget targeting mobile and embedded deployment (§1), in contrast to VGG-16's 138M parameters.
Limitations.
- Detection submission did not use bounding-box regression "due to lack of time" (§8); R-CNN with regression produces better localisation.
- Poor backbone for dense pixel-level prediction: FCN-GoogLeNet achieves mean IU 42.5 on PASCAL VOC 2011 val versus FCN-VGG16 at 56.0 (FCN Table 1) — aggressive early downsampling (two stride-2 stem operations before the Inception modules) and heterogeneous branch widths make the topology hard to repurpose for FCN-style upsampling.
- Single-model classification accuracy trails VGG-16: 7.9% top-5 (single GoogLeNet) versus 7.0% (single VGG-16) on ILSVRC 2014 test (VGG Table 7); the ensemble headline depends on 7 models × 144 crops.
- Training instability before batch normalisation: the auxiliary classifiers exist explicitly to mitigate gradient flow concerns through 22 layers (§5); BN-Inception (2015) and ResNet (2015) superseded this workaround within months.
- Implementation caveats. Torchvision's
Inceptionblock useskernel_size=3in the branch (documented bug, seepytorch/vision#906) — it diverges from the paper's Figure 2(b) specification, and the torchvision pretrained weightsgooglenet-1378be20.pth(BSD-3-Clause) are trained independently by maintainers rather than loaded from BVLC. The BVLC Caffe replication's weights (licenseunrestricted) reach 68.7% top-1 / 88.9% top-5 single centre-crop and were trained for 60 epochs withquick_solver.prototxtrather than the paper's longer training schedule — a faithful but not identical reproduction.
References
- Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, Rabinovich. Going deeper with convolutions. CVPR 2015. arXiv.4842
- Krizhevsky, Sutskever, Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS 2012. paper
- Simonyan, Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015 (arXiv 2014). arXiv.1556