SegFormer | VitaVision
Back to atlas

SegFormer

7 min readIntermediatevit3.8M (B0) — 84.7M (B5)8.4 GFLOPs (B0 @ 512×512) — 183.3 GFLOPs (B5 @ 640×640)View in graph
Based on
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
Xie, Wang, Yu, Anandkumar et al. · NeurIPS 2021 (arXiv 2021) 2021
arXiv ↗

Implementations

Motivation

SegFormer takes an RGB image of any resolution as input and produces a per-pixel semantic mask, assigning one of NclsN_\text{cls} category labels to each pixel. The defining property is a hierarchical Mix Transformer (MiT) encoder that generates multi-scale features at {1/4,1/8,1/16,1/32}\{1/4, 1/8, 1/16, 1/32\} resolution without positional encodings, paired with a lightweight all-MLP decoder that requires no ASPP, OCR, or other context modules. Prior Transformer-based segmentation (SETR) used a plain ViT backbone, yielding only single-scale features at low resolution and requiring ImageNet-22K pretraining; SegFormer replaces both with a four-stage hierarchy pretrained on ImageNet-1K. Models range from MiT-B0 through MiT-B5 and are evaluated on ADE20K, Cityscapes, and COCO-Stuff.

Architecture

Family & shape. ViT-family hierarchical encoder (Mix Transformer, MiT) combined with an all-MLP decoder. Input: H×W×3H \times W \times 3 RGB image; no resolution constraint at inference. Output: per-pixel category logits at H4×W4×Ncls\frac{H}{4} \times \frac{W}{4} \times N_\text{cls}, upsampled to full resolution. Backbone: one of MiT-B0 through MiT-B5.

Blocks. Three named blocks distinguish MiT from plain ViT.

Overlapping patch merging (Sec. 3.1.1) uses a strided convolution with kernel KK, stride SS, padding PP: at stage 1, K=7K=7, S=4S=4, P=3P=3; at stages 2–4, K=3K=3, S=2S=2, P=1P=1. The overlap preserves local continuity that ViT's non-overlapping patchification discards.

Efficient self-attention (Eq. 2, Sec. 3.1) reduces the key sequence from shape N×CN \times C to NR×C\frac{N}{R} \times C before computing attention:

K^=Reshape ⁣(NR,CR)(K),K=Linear(CR,C)(K^)\hat{K} = \text{Reshape}\!\left(\frac{N}{R},\, C \cdot R\right)(K), \quad K = \text{Linear}(C \cdot R,\, C)(\hat{K})

This cuts self-attention complexity from O(N2)O(N^2) to O(N2/R)O(N^2/R). Per-stage reduction ratios are R=[64,16,4,1]R = [64, 16, 4, 1].

Mix-FFN (Eq. 3, Sec. 3.1) replaces fixed positional encodings with a 3×33 \times 3 depthwise convolution inside the feed-forward network:

xout=MLP(GELU(Conv3×3(MLP(xin))))+xinx_\text{out} = \text{MLP}(\text{GELU}(\text{Conv}_{3\times3}(\text{MLP}(x_\text{in})))) + x_\text{in}

The convolution's zero-padding supplies positional information implicitly, enabling inference at any resolution without re-interpolation artifacts.

The Mix-FFN block in PyTorch:

class MixFFN(nn.Module):
    def __init__(self, c: int, hidden: int):
        super().__init__()
        self.fc1 = nn.Linear(c, hidden)
        self.dwconv = nn.Conv2d(hidden, hidden, kernel_size=3,
                                padding=1, groups=hidden)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(hidden, c)

    def forward(self, x, H: int, W: int):
        # x: (B, N, C) with N = H * W
        B, N, C = x.shape
        h = self.fc1(x)
        h = h.transpose(1, 2).reshape(B, -1, H, W)
        h = self.dwconv(h)
        h = h.flatten(2).transpose(1, 2)
        h = self.act(h)
        h = self.fc2(h)
        return x + h

The all-MLP decoder (Sec. 3.2, Eq. 4a–4d) operates in four steps. First, per-stage linear projections unify each feature map FiF_i from its native CiC_i channels to a shared CC: F^i=Linear(Ci,C)(Fi)\hat{F}_i = \text{Linear}(C_i, C)(F_i). Second, all maps are bilinearly upsampled to H4×W4\frac{H}{4} \times \frac{W}{4}. Third, a fused linear projects the concatenated 4C4C-channel map back to CC: F=Linear(4C,C)(Concat(F^i))F = \text{Linear}(4C, C)(\text{Concat}(\hat{F}_i)). Fourth, a classifier linear produces the final mask: M=Linear(C,Ncls)(F)M = \text{Linear}(C, N_\text{cls})(F). Decoder channel is C=256C = 256 for B0 and B1, and C=768C = 768 for B2–B5. The MLP decoder suffices because SegFormer's stage-4 Transformer blocks already produce non-local attention covering the full image (Fig. 3), rendering ASPP and similar context modules redundant.

Training. Datasets: ADE20K (150 categories), Cityscapes (19 categories), COCO-Stuff (172 labels). Loss: standard cross-entropy on per-pixel logits. Optimizer: AdamW with initial learning rate 6×1056\times10^{-5}, polynomial decay (power 1.0), weight decay 0.01. Schedule: 160K iterations on ADE20K and Cityscapes, batch size 16 (ADE20K, COCO-Stuff) or 8 (Cityscapes). MiT encoder pretrained on ImageNet-1K. Augmentation: random resize ratio 0.5–2.0, horizontal flip, random crop 512×512512\times512 (ADE20K, COCO-Stuff) or 1024×10241024\times1024 (Cityscapes). Headline results: ADE20K val mIoU 51.8% (B5, multi-scale) — Table 2; Cityscapes val mIoU 84.0% (B5, multi-scale) — Table 2; on Cityscapes-C, B5 outperforms DeepLabV3+ variants by up to 588% relative improvement on Gaussian noise — Table 5.

Complexity. Variants span B0 (3.8M params, 8.4G FLOPs at 512×512512\times512) through B5 (84.7M params, 183.3G FLOPs at 640×640640\times640) — Table 2.

Implementations

Official NVlabs PyTorch release, with widely-used community ports in HuggingFace Transformers and MMSegmentation.

Assessment

Novelty.

  • Replaces ViT's plain single-scale encoder with a four-stage MiT that produces multi-scale features at {1/4,1/8,1/16,1/32}\{1/4, 1/8, 1/16, 1/32\}; SETR built on ViT retained a single scale and required a correspondingly heavy decoder.
  • Eliminates positional encodings entirely: Mix-FFN's 3×33 \times 3 depthwise convolution supplies positional information via zero-padding, so the model accepts arbitrary input resolutions at inference without accuracy degradation. With PE, switching from 768×768768\times768 to 1024×20481024\times2048 on Cityscapes drops mIoU 3.3 points (Table 1c); Mix-FFN reduces this to 0.7 points.
  • All-MLP decoder replaces ASPP, OCR, and UPerNet-style context modules, relying on the encoder's already-global stage-4 effective receptive field (Fig. 3).
  • Efficient self-attention with per-stage reduction ratio R=[64,16,4,1]R = [64, 16, 4, 1] (Eq. 2) cuts dense-prediction attention cost from O(N2)O(N^2) to O(N2/R)O(N^2/R).

Strengths.

  • Strong accuracy across scales: B0 reaches 37.4% mIoU on ADE20K with 3.8M parameters (Table 2); B5 reaches 51.8% mIoU (Table 2).
  • Zero-shot robustness: on Cityscapes-C, B5 shows up to 588% relative improvement over DeepLabV3+ variants on Gaussian noise (Table 5) — attributable to the encoder's attention-pooled features smoothing over local corruption patterns.
  • Resolution flexibility: no positional encoding means inference at any resolution without reinterpolation or accuracy penalty.
  • Adopted as a backbone in downstream tasks — SegFormer-B0 and B3 are explicit Segmentor backbone variants in FocalClick (2022).

Limitations.

  • The official NVlabs/SegFormer code and NVIDIA-hosted pretrained weights are released under the NVIDIA Source Code License — research and evaluation use only. Commercial deployment requires reimplementing and retraining from scratch; the HuggingFace Transformers and MMSegmentation Apache-2.0 ports do not change the weights' license.
  • Efficient self-attention with R=64R = 64 at stage 1 collapses 64 tokens into one for keys and values — a lossy approximation that can hurt very fine-grained boundary prediction at high input resolution.
  • Per-pixel classification (versus the mask-classification paradigm introduced by MaskFormer and Mask2Former) limits the architecture to semantic segmentation; instance and panoptic segmentation require architectural changes.

References

  1. Xie, Wang, Yu, Anandkumar, Alvarez, Luo. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. NeurIPS, 2021. arXiv
    .15203
  2. Dosovitskiy et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. ICLR, 2021. arXiv
    .11929
  3. Long, Shelhamer, Darrell. Fully Convolutional Networks for Semantic Segmentation. CVPR, 2015. arXiv
    .4038
  4. Chen, Papandreou, Schroff, Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint, 2017. arXiv
    .05587

Compared with

Feeds into

  • FocalClick

    SegFormer-B0 and SegFormer-B3 are explicit Segmentor backbones in FocalClick Table 3; the MiT encoder + all-MLP decoder is reused intact and the decoder logits feed FocalClick's Refiner.