GrabCut Iterative Segmentation | VitaVision
Back to atlas

GrabCut Iterative Segmentation

6 min readIntermediateView in graph
Based on
GrabCut: Interactive Foreground Extraction using Iterated Graph Cuts
Rother, Kolmogorov, Blake · ACM Transactions on Graphics (SIGGRAPH) 2004
DOI ↗

Goal

A colour image z=(z1,,zN)z = (z_1, \ldots, z_N) with znR3z_n \in \mathbb{R}^3 and a user-drawn bounding rectangle are the inputs. The output is a binary opacity array α{0,1}N\alpha \in \{0,1\}^N partitioning the image into foreground and background, plus a continuous αn[0,1]\alpha_n \in [0,1] in a narrow border ribbon of width ±w\pm w pixels for smooth edge transitions. A single rectangle suffices as the sole required interaction; the algorithm needs no inner-boundary trace. The Gibbs energy E(α,k,θ,z)E(\alpha, k, \theta, z) decreases monotonically across the iterative coordinate-descent loop, guaranteeing convergence to a local minimum.

Algorithm

Let z=(z1,,zN)z = (z_1, \ldots, z_N) denote the RGB image, znR3z_n \in \mathbb{R}^3. Let α{0,1}N\alpha \in \{0,1\}^N denote the binary opacity matte; αn=0\alpha_n = 0 is background, αn=1\alpha_n = 1 is foreground. Let k=(k1,,kN)k = (k_1, \ldots, k_N), kn{1,,K}k_n \in \{1, \ldots, K\}, denote the per-pixel GMM component assignment. Let K=5K = 5 be the number of Gaussian components per side. Let θ={π(α,k),μ(α,k),Σ(α,k)}\theta = \{\pi(\alpha, k),\, \mu(\alpha, k),\, \Sigma(\alpha, k)\} collect the GMM mixture weights, means, and full 3×33{\times}3 covariance matrices. Let TBT_B, TUT_U, TFT_F denote the hard-background, unknown, and hard-foreground pixel sets forming the trimap. Let CC denote the set of 8-connected neighbour pairs. Let β\beta be the contrast normalisation derived from image statistics. Let γ=50\gamma = 50 be the smoothness weight. Let w=6w = 6 be the border-matting ribbon half-width in pixels. Let λ1=50\lambda_1 = 50 and λ2=103\lambda_2 = 10^3 be the DP regulariser parameters. Let L=41L = 41 be the neighbourhood patch size for GMM parameter estimation in border matting. Let Δt\Delta_t and σt\sigma_t parameterise the α\alpha-profile step function along contour point tt.

Definition
Gibbs energy

The total energy over (α,k,θ,z)(\alpha, k, \theta, z) decomposes into a data term and a smoothness term.

E(α,k,θ,z)=U(α,k,θ,z)+V(α,z)E(\alpha, k, \theta, z) = U(\alpha, k, \theta, z) + V(\alpha, z)

UU is minimised over α\alpha and kk by the min-cut and GMM-assignment steps; VV penalises label discontinuities across high-contrast edges.

Definition
Data term

The data term sums per-pixel negative log-likelihoods under the GMM.

U(α,k,θ,z)=nD(αn,kn,θ,zn)U(\alpha, k, \theta, z) = \sum_{n} D(\alpha_n, k_n, \theta, z_n)

D(αn,kn,θ,zn)=logπ(αn,kn)+12logdetΣ(αn,kn)+12[znμ(αn,kn)]Σ(αn,kn)1[znμ(αn,kn)]D(\alpha_n, k_n, \theta, z_n) = -\log \pi(\alpha_n, k_n) + \tfrac{1}{2}\log\det\Sigma(\alpha_n, k_n) + \tfrac{1}{2}\bigl[z_n - \mu(\alpha_n, k_n)\bigr]^\top \Sigma(\alpha_n, k_n)^{-1}\bigl[z_n - \mu(\alpha_n, k_n)\bigr]

Each GMM component contributes a mixture-weight penalty and a Mahalanobis distance from the component mean.

Definition
Smoothness term

The smoothness term applies a contrast-weighted penalty at every label boundary in the 8-connected graph.

V(α,z)=γ(m,n)C[αnαm]exp ⁣(βzmzn2)V(\alpha, z) = \gamma \sum_{(m,n)\in C} [\alpha_n \neq \alpha_m]\, \exp\!\bigl(-\beta \|z_m - z_n\|^2\bigr)

where β=(2(zmzn)2)1\beta = \bigl(2\langle(z_m - z_n)^2\rangle\bigr)^{-1} is computed from all neighbouring pixel pairs in the image, and γ=50\gamma = 50.

Definition
Border-matting \alpha-profile

Along contour CC, a soft step function is fitted per contour point tt by dynamic programming.

g(rn;Δt(n),σt(n))g(r_n;\, \Delta_{t(n)},\, \sigma_{t(n)})

rnr_n is the signed distance of pixel nn from the boundary. Parameters Δt\Delta_{t} (position) and σt\sigma_{t} (width) are optimised with a DP regulariser controlled by λ1=50\lambda_1 = 50 and λ2=103\lambda_2 = 10^3, over a L=41L = 41 pixel neighbourhood patch centred on each contour point, with ribbon half-width w=6w = 6.

Procedure

Algorithm
GrabCut iterative segmentation
Input: Colour image zz; bounding rectangle RR; parameters K=5K = 5, γ=50\gamma = 50, w=6w = 6
Output: Hard segmentation α{0,1}N\alpha \in \{0,1\}^N; continuous α\alpha in ±w\pm w-pixel border ribbon
  1. Set TBT_B to all pixels outside RR; set TUT_U to all pixels inside RR; set TF=T_F = \emptyset. Assign αn=0\alpha_n = 0 for nTBn \in T_B and αn=1\alpha_n = 1 for nTUn \in T_U. Compute β=(2(zmzn)2)1\beta = \bigl(2\langle(z_m - z_n)^2\rangle\bigr)^{-1} from all 8-connected neighbour pairs. Initialise foreground and background GMMs (K=5K = 5 components each) from pixels with αn=1\alpha_n = 1 and αn=0\alpha_n = 0 respectively.
  2. Assign: for each pixel nTUn \in T_U, set kn=argminkD(αn,k,θ,zn)k_n = \arg\min_{k} D(\alpha_n, k, \theta, z_n) (hard assignment to the nearest GMM component).
  3. Learn: for each component kk in each GMM, recompute the sample mean μ\mu, covariance Σ\Sigma, and weight π=F(k)/kF(k)\pi = |F(k)| / \sum_k |F(k)| from the pixels currently assigned to that component.
  4. Min-cut: construct the contrast-weighted pixel graph with TT-links from UU and NN-links from VV; compute the global s-t min-cut to update α\alpha over TUT_U pixels.
  5. Repeat steps 2–4 until E(α,k,θ,z)E(\alpha, k, \theta, z) ceases to decrease.
  6. Optional user editing: hard-constrain misclassified pixels as foreground or background via brush strokes, then re-run step 4 (single min-cut) or the full loop from step 2.
  7. Border matting: detect the boundary contour CC of the hard segmentation. For each contour point tt, fit g(rn;Δt,σt)g(r_n; \Delta_t, \sigma_t) by DP with regulariser parameters λ1=50\lambda_1 = 50, λ2=103\lambda_2 = 10^3, over the L=41L = 41 pixel patch, within ribbon half-width w=6w = 6. Steal foreground colours from TFT_F to eliminate background bleeding at mixed pixels.
flowchart TB
    A["Assign<br/>k_n ← argmin D(α_n, k, θ, z_n)"] --> B["Learn<br/>μ, Σ, π ← sample stats"]
    B --> C["Min-cut<br/>α ← argmin E(α, k, θ, z)"]
    C --> D{"E decreased?"}
    D -- yes --> A
    D -- no --> E["Converged"]

Implementation

The iterative coordinate-descent loop in Rust:

trait MinCut {
    fn solve(&mut self, data: &[f32], smoothness: &[f32]) -> Vec<u8>;
}

struct Gmm {
    means: Vec<[f32; 3]>,
    covs: Vec<[[f32; 3]; 3]>,
    weights: Vec<f32>,
}

impl Gmm {
    fn assign(&self, z: [f32; 3]) -> usize {
        (0..self.means.len())
            .min_by(|&a, &b| self.mahal(z, a).partial_cmp(&self.mahal(z, b)).unwrap())
            .unwrap()
    }
    fn mahal(&self, z: [f32; 3], k: usize) -> f32 { /* (z-μ)ᵀ Σ⁻¹ (z-μ) */ 0.0 }
    fn fit(pixels: &[[f32; 3]], component_ids: &[usize], k: usize) -> Gmm { todo!() }
}

fn grabcut_iterate(
    z: &[[f32; 3]],
    alpha: &mut Vec<u8>,
    fg_gmm: &mut Gmm,
    bg_gmm: &mut Gmm,
    tu_mask: &[bool],
    solver: &mut impl MinCut,
) {
    let n = z.len();
    let mut comp = vec![0usize; n];
    loop {
        for i in 0..n {
            if tu_mask[i] {
                comp[i] = if alpha[i] == 1 { fg_gmm.assign(z[i]) }
                           else            { bg_gmm.assign(z[i]) };
            }
        }
        *fg_gmm = Gmm::fit(z, &comp, fg_gmm.means.len());
        *bg_gmm = Gmm::fit(z, &comp, bg_gmm.means.len());
        let new_alpha = solver.solve(&[], &[]);
        for i in 0..n { if tu_mask[i] { alpha[i] = new_alpha[i]; } }
        break; // caller checks ΔE and re-enters; graph build follows graph-cut-segmentation.
    }
}

Remarks

  • Each full pass (assign → learn → min-cut) minimises EE over one variable group; the energy decreases monotonically per pass. Representative images converge in approximately 12 iterations.
  • Each min-cut step runs on a Θ(N)\Theta(N)-node graph with Θ(N)\Theta(N) edges for an 8-connected pixel lattice; per-iteration cost is dominated by one max-flow computation.
  • γ=50\gamma = 50 was calibrated on 15 training images; changing it shifts the balance between the colour-model unary and the boundary smoothness term and can cause over-smooth or fragmented results.
  • K=5K = 5 GMM components per side is a fixed design choice: too few components underfit multi-modal colour distributions; too many starve individual components of training pixels.
  • Three failure modes: (i) low-contrast boundaries where the exponential smoothness term cannot localise the edge; (ii) foreground–background camouflage where the GMM unary term provides no gradient; (iii) background material inside the rectangle that is absent from the rectangle exterior, biasing the background GMM at initialisation.
  • Border matting covers only the ±w=6\pm w = 6 pixel boundary ribbon; full-image soft transparency (hair, smoke) is out of scope. A lasso input instead of a rectangle provides tighter background coverage and partially mitigates failure mode (iii).

References

  1. C. Rother, V. Kolmogorov, A. Blake. GrabCut: Interactive Foreground Extraction using Iterated Graph Cuts. ACM TOG (SIGGRAPH), 2004. DOI
  2. Y. Boykov, M.-P. Jolly. Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images. ICCV, 2001. PDF