Motivation
SegFormer takes an RGB image of any resolution as input and produces a per-pixel semantic mask, assigning one of category labels to each pixel. The defining property is a hierarchical Mix Transformer (MiT) encoder that generates multi-scale features at resolution without positional encodings, paired with a lightweight all-MLP decoder that requires no ASPP, OCR, or other context modules. Prior Transformer-based segmentation (SETR) used a plain ViT backbone, yielding only single-scale features at low resolution and requiring ImageNet-22K pretraining; SegFormer replaces both with a four-stage hierarchy pretrained on ImageNet-1K. Models range from MiT-B0 through MiT-B5 and are evaluated on ADE20K, Cityscapes, and COCO-Stuff.
Architecture
Family & shape. ViT-family hierarchical encoder (Mix Transformer, MiT) combined with an all-MLP decoder. Input: RGB image; no resolution constraint at inference. Output: per-pixel category logits at , upsampled to full resolution. Backbone: one of MiT-B0 through MiT-B5.
Blocks. Three named blocks distinguish MiT from plain ViT.
Overlapping patch merging (Sec. 3.1.1) uses a strided convolution with kernel , stride , padding : at stage 1, , , ; at stages 2–4, , , . The overlap preserves local continuity that ViT's non-overlapping patchification discards.
Efficient self-attention (Eq. 2, Sec. 3.1) reduces the key sequence from shape to before computing attention:
This cuts self-attention complexity from to . Per-stage reduction ratios are .
Mix-FFN (Eq. 3, Sec. 3.1) replaces fixed positional encodings with a depthwise convolution inside the feed-forward network:
The convolution's zero-padding supplies positional information implicitly, enabling inference at any resolution without re-interpolation artifacts.
The Mix-FFN block in PyTorch:
class MixFFN(nn.Module):
def __init__(self, c: int, hidden: int):
super().__init__()
self.fc1 = nn.Linear(c, hidden)
self.dwconv = nn.Conv2d(hidden, hidden, kernel_size=3,
padding=1, groups=hidden)
self.act = nn.GELU()
self.fc2 = nn.Linear(hidden, c)
def forward(self, x, H: int, W: int):
# x: (B, N, C) with N = H * W
B, N, C = x.shape
h = self.fc1(x)
h = h.transpose(1, 2).reshape(B, -1, H, W)
h = self.dwconv(h)
h = h.flatten(2).transpose(1, 2)
h = self.act(h)
h = self.fc2(h)
return x + h
The all-MLP decoder (Sec. 3.2, Eq. 4a–4d) operates in four steps. First, per-stage linear projections unify each feature map from its native channels to a shared : . Second, all maps are bilinearly upsampled to . Third, a fused linear projects the concatenated -channel map back to : . Fourth, a classifier linear produces the final mask: . Decoder channel is for B0 and B1, and for B2–B5. The MLP decoder suffices because SegFormer's stage-4 Transformer blocks already produce non-local attention covering the full image (Fig. 3), rendering ASPP and similar context modules redundant.
Training. Datasets: ADE20K (150 categories), Cityscapes (19 categories), COCO-Stuff (172 labels). Loss: standard cross-entropy on per-pixel logits. Optimizer: AdamW with initial learning rate , polynomial decay (power 1.0), weight decay 0.01. Schedule: 160K iterations on ADE20K and Cityscapes, batch size 16 (ADE20K, COCO-Stuff) or 8 (Cityscapes). MiT encoder pretrained on ImageNet-1K. Augmentation: random resize ratio 0.5–2.0, horizontal flip, random crop (ADE20K, COCO-Stuff) or (Cityscapes). Headline results: ADE20K val mIoU 51.8% (B5, multi-scale) — Table 2; Cityscapes val mIoU 84.0% (B5, multi-scale) — Table 2; on Cityscapes-C, B5 outperforms DeepLabV3+ variants by up to 588% relative improvement on Gaussian noise — Table 5.
Complexity. Variants span B0 (3.8M params, 8.4G FLOPs at ) through B5 (84.7M params, 183.3G FLOPs at ) — Table 2.
Implementations
Official NVlabs PyTorch release, with widely-used community ports in HuggingFace Transformers and MMSegmentation.
Assessment
Novelty.
- Replaces ViT's plain single-scale encoder with a four-stage MiT that produces multi-scale features at ; SETR built on ViT retained a single scale and required a correspondingly heavy decoder.
- Eliminates positional encodings entirely: Mix-FFN's depthwise convolution supplies positional information via zero-padding, so the model accepts arbitrary input resolutions at inference without accuracy degradation. With PE, switching from to on Cityscapes drops mIoU 3.3 points (Table 1c); Mix-FFN reduces this to 0.7 points.
- All-MLP decoder replaces ASPP, OCR, and UPerNet-style context modules, relying on the encoder's already-global stage-4 effective receptive field (Fig. 3).
- Efficient self-attention with per-stage reduction ratio (Eq. 2) cuts dense-prediction attention cost from to .
Strengths.
- Strong accuracy across scales: B0 reaches 37.4% mIoU on ADE20K with 3.8M parameters (Table 2); B5 reaches 51.8% mIoU (Table 2).
- Zero-shot robustness: on Cityscapes-C, B5 shows up to 588% relative improvement over DeepLabV3+ variants on Gaussian noise (Table 5) — attributable to the encoder's attention-pooled features smoothing over local corruption patterns.
- Resolution flexibility: no positional encoding means inference at any resolution without reinterpolation or accuracy penalty.
- Adopted as a backbone in downstream tasks — SegFormer-B0 and B3 are explicit Segmentor backbone variants in FocalClick (2022).
Limitations.
- The official NVlabs/SegFormer code and NVIDIA-hosted pretrained weights are released under the NVIDIA Source Code License — research and evaluation use only. Commercial deployment requires reimplementing and retraining from scratch; the HuggingFace Transformers and MMSegmentation Apache-2.0 ports do not change the weights' license.
- Efficient self-attention with at stage 1 collapses 64 tokens into one for keys and values — a lossy approximation that can hurt very fine-grained boundary prediction at high input resolution.
- Per-pixel classification (versus the mask-classification paradigm introduced by MaskFormer and Mask2Former) limits the architecture to semantic segmentation; instance and panoptic segmentation require architectural changes.
References
- Xie, Wang, Yu, Anandkumar, Alvarez, Luo. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS, 2021. arXiv.15203
- Dosovitskiy et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR, 2021. arXiv.11929
- Long, Shelhamer, Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR, 2015. arXiv.4038
- Chen, Papandreou, Schroff, Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint, 2017. arXiv.05587