Motivation
Produce a refined binary foreground mask for a user-selected object from a sequence of positive and negative clicks on an RGB image, with the ability to correct a preexisting mask supplied by an upstream model or manual tool. Input: an RGB image together with positive-click binary-disk maps, negative-click binary-disk maps, and an optional prior binary mask — concatenated into a 6-channel tensor (Sec. 3.1). Output: a per-pixel foreground probability map of shape . The defining property is that each click triggers inference on two small local crops only — a Target Crop fed to a lightweight Segmentor and a Focus Crop fed to a Refiner — rather than a full-image forward pass, combined with Progressive Merge compositing that writes only the largest connected update region containing the new click, preserving correctly-segmented parts of any preexisting mask.
Architecture
Family & shape. Encoder-decoder, two-stage. Input: 6-channel tensor (3-channel RGB + positive-click disk map + negative-click disk map + previous-mask channel). Output: foreground probability map. Stage 1 is a Segmentor (HRNet-W18s+OCR, HRNet-W32+OCR, SegFormer-B0, or SegFormer-B3) that operates on the Target Crop resized to (S1) or (S2). Stage 2 is a lightweight Refiner (Xception depthwise convolutions, 0.011–0.025 MB parameters) that operates on a smaller Focus Crop. The canonical configuration is hrnet18s-S2.
Blocks. Four sub-operations execute per click (Sec. 3.1).
(a) Target Crop. A bounding box is formed around the union of the previous mask and the new click, expanded by . The crop is downsampled to the Segmentor resolution (128×128 or 256×256) and processed by the Segmentor, which fuses click maps after its stem layers via two convolutional layers following the RITM Conv1S scheme. Output: a coarse probability map.
(b) Focus Crop. A Difference Mask is formed by XOR-ing the binarised coarse prediction against the previous mask. The largest connected component of that contains the new click is found, its bounding box is expanded by , and the resulting patch is the Focus Crop. This locates the region that the current click most directly affects.
(c) Refiner. The Refiner receives the Focus Crop pixels through Xception depthwise convolutional layers and also receives Segmentor features RoiAligned into the Focus Crop coordinate frame (the coarse logit ). Two prediction heads produce a detail map and a boundary map . The refined logit is blended per Eq. 1:
where is the sigmoid function. acts as a spatial gate: at boundary pixels the Refiner detail head dominates; at interior pixels the RoiAligned coarse logit dominates.
(d) Progressive Merge. The binarised new prediction (threshold 0.5) is XOR-ed against the prior mask to form a candidate change set. The largest connected component of that set which contains the new click is written to the global mask; all other pixels inherit the prior mask. This is a parameter-free morphological compositing step.
At each click, let be the binarised refined prediction and the current global mask. Let be the per-pixel difference. The update region is the largest connected component of that contains the new click location. The updated mask is
Progressive Merge is inactive for the first 10 clicks when annotating from scratch, during which predictions are written globally (Sec. 3.1).
The Refiner fusion block in PyTorch:
import torch
import torch.nn as nn
class FocalClickRefinerFuser(nn.Module):
"""Combine boundary, detail, and RoiAligned coarse logits per Eq. 1.
m_l: RoiAlign-cropped coarse logit from the Segmentor, shape (B, 1, H, W)
m_d: detail map from Refiner depthwise head, shape (B, 1, H, W)
m_b: boundary map from Refiner boundary head, shape (B, 1, H, W)
"""
def forward(
self,
m_l: torch.Tensor,
m_d: torch.Tensor,
m_b: torch.Tensor,
) -> torch.Tensor:
# Boundary map gates between Refiner detail (at edges)
# and coarse logit (at interiors).
gate = torch.sigmoid(m_b)
return gate * m_d + (1.0 - gate) * m_l
Training. Primary dataset: COCO+LVIS; secondary: SBD. Click simulation follows the RITM iterative protocol: up to 24 positive and 24 negative clicks per sample, probability decay 0.8. Loss (Eq. 2):
where is binary cross-entropy on the boundary head, is Normalized Focal Loss on the coarse Segmentor output, and is boundary-weighted NFL on the Refiner output with boundary weight 1.5. Optimizer: Adam (, ), initial learning rate , decayed at epochs 190 and 220, 230 epochs total (1 epoch = 30 000 images), batch size 32, 2×V100, approximately 24 h. Headline results from Table 4 (DAVIS-585, from initial mask): hrnet18s-S1 NoC85=2.72, NoC90=3.82 vs RITM-hrnet18s NoC85=3.71, NoC90=5.96; segformerB3-S2 NoC85=2.00, NoC90=2.76. From Table 2 (standard DAVIS, from scratch, COCO+LVIS): hrnet18s-S2 NoC85=3.90, NoC90=5.25; segformerB3-S2 NoC85=3.61, NoC90=4.90.
Complexity. Six variants span a wide compute range (Table 3). hrnet18s-S2: 4.22M Segmentor parameters + 0.011M Refiner parameters, 3.66 G Segmentor FLOPs + 0.16 G Refiner FLOPs, 213 ms Segmentor + 51 ms Refiner on a 2.4 GHz 4-core Intel Core i5 CPU. B0-S1: 0.43 G + 0.17 G FLOPs — 15× lower than the lightest RITM variant (hrnet18s-400, 8.96 G FLOPs). B3-S2 is the heaviest at 12.72 G + 0.20 G FLOPs (634 ms + 72 ms). The Refiner contributes 0.011–0.025M parameters and 0.15–0.20 G FLOPs regardless of Segmentor size.
Implementations
Official Apache-2.0 release XavierCHEN34/ClickSEG bundles training, evaluation, and pretrained weights for all six variants (HRNet18s-S1/S2, HRNet32-S2, B0-S1/S2, B3-S2).
Assessment
Novelty.
- Two-stage local inference — Target Crop through Segmentor followed by Focus Crop through Refiner — replaces the whole-image forward pass used on every click by RITM, f-BRS, and CDNet, enabling CPU-feasible per-click latency without accuracy regression.
- Progressive Merge is the first algorithmic treatment of preexisting-mask preservation in click-based interactive segmentation: a morphological compositing rule that restricts each update to the largest connected change region containing the new click, leaving correctly-segmented pixels unmodified.
- DAVIS-585 benchmark contribution: 300 base masks expanded to 585 corrupted samples with controlled IoU range (75%–85%) and explicit error-type probabilities (boundary 0.65, external false-positive 0.25, internal true-negative 0.10), filling a gap left by prior from-scratch-only evaluation protocols (Sec. 3.2).
Strengths.
- 15× lower FLOPs than the lightest RITM variant at competitive NoC numbers: B0-S1 uses 0.43 + 0.17 G FLOPs vs RITM-hrnet18s-400's 8.96 G (Table 3).
- Native preexisting-mask correction: hrnet18s-S1 NoC90 drops from 5.25 (from scratch, Table 2) to 3.82 (from initial mask, Table 4), while RITM-hrnet18s NoC90 stays at 5.96 from-initial-mask (Table 4).
- Sub-300 ms per click on a 2.4 GHz 4-core i5 CPU for hrnet18s-S2 — 213 ms Segmentor + 51 ms Refiner (Table 3) — practical for desktop annotation tools without GPU.
- Drop-in backbone swap: the same two-stage pipeline accommodates HRNet+OCR and SegFormer variants, spanning a 6-point accuracy–latency design space from 0.43 + 0.17 G FLOPs to 12.72 + 0.20 G FLOPs (Table 3).
Limitations.
- Single-component assumption: Focus Crop selection and Progressive Merge both isolate the largest connected XOR-region containing the new click; multi-region edits per click are silently truncated to the dominant component.
- Tiny and filamentary structures (parachute ropes, hair, cables): B3-S2 reaches only 23.7% IoU at 20 clicks on a thin-structure example (Fig. 5, row 5).
- Progressive Merge is inactive for the first 10 clicks when annotating from scratch; during that phase, predictions are applied globally and can overwrite well-segmented detail.
- Iterative training carries RITM's instability: causes training collapse; FocalClick inherits the ceiling without an explicit mitigation strategy.
References
- Chen, X., Zhao, Z., Zhang, Y., Duan, M., Qi, D., & Zhao, H. FocalClick: Towards Practical Interactive Image Segmentation. CVPR, 2022. arxiv
- Sofiiuk, K., Petrov, I. A., & Konushin, A. Reviving Iterative Training with Mask Guidance for Interactive Segmentation. arXiv.06583, 2021. arxiv
- Sun, K., Xiao, B., Liu, D., & Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation (HRNet). CVPR, 2019. arxiv