SURF: Speeded Up Robust Features | VitaVision
Back to atlas

SURF: Speeded Up Robust Features

7 min readAdvancedView in graph
Based on
SURF: Speeded Up Robust Features
Bay, Tuytelaars, Gool · Lecture notes in computer science 2006
DOI ↗

Goal

Detect scale- and rotation-invariant blob keypoints in a greyscale image and emit a 64-D descriptor robust to moderate illumination and geometric change. Input: a single-channel image I:ΩRI: \Omega \to \mathbb{R} on pixel domain Ω\Omega. Output: a set of keypoints {(xi,yi,si,θi)}\{(x_i, y_i, s_i, \theta_i)\}, each paired with a unit-norm descriptor fiR64\mathbf{f}_i \in \mathbb{R}^{64} and a Laplacian sign for matching pre-filter. The algorithm is specific to the use of an integral image to evaluate box-filter approximations of the Hessian determinant in O(1)O(1) per pixel regardless of scale, making detection cost independent of the filter size.

Algorithm

Let I:ΩRI: \Omega \to \mathbb{R} denote the greyscale image and let IΣI_\Sigma denote its integral image. Let H(x,σ)\mathcal{H}(\mathbf{x}, \sigma) denote the Hessian matrix of the Gaussian-smoothed image L(x,σ)L(\mathbf{x}, \sigma) at point x=(x,y)\mathbf{x} = (x, y) and scale σ\sigma. Let Dxx,Dyy,DxyD_{xx}, D_{yy}, D_{xy} denote the box-filter approximations of Lxx,Lyy,LxyL_{xx}, L_{yy}, L_{xy}. Let ss denote the keypoint scale parameter. Let dx,dyd_x, d_y denote Haar-wavelet responses in xx and yy.

Definition
Integral image

Cumulative pixel sum that lets any rectangular sum be computed with four lookups regardless of size.

IΣ(x,y)=ix,jyI(i,j).I_\Sigma(x, y) = \sum_{i \le x,\, j \le y} I(i, j).
Definition
Approximated Hessian determinant

Detector response combining box-filtered second-derivative approximations with an empirical balancing factor that compensates for the difference in lobe weights between true Gaussian derivatives and the box-filter approximation.

det(Happrox)=DxxDyy(0.9Dxy)2.\det(\mathcal{H}_\text{approx}) = D_{xx}\, D_{yy} - (0.9\, D_{xy})^2.
Definition
Sub-region descriptor vector

Per-sub-region Haar-wavelet response statistics.

v=(dx,  dx,  dy,  dy).\mathbf{v} = \left(\sum d_x,\; \sum |d_x|,\; \sum d_y,\; \sum |d_y|\right).

The first-octave box-filter sizes are 9,15,21,279, 15, 21, 27 pixels with the 9×99 \times 9 filter corresponding to σ=1.2\sigma = 1.2. Subsequent octaves double the inter-filter step (octave step sizes 6,12,24,6, 12, 24, \ldots). Filter responses are area-normalized so that the Frobenius norm is constant across scales.

Algorithm
SURF detect-and-describe
Input: Greyscale image II; number of octaves and intervals; Hessian threshold tt.
Output: Keypoints {(xi,yi,si,θi)}\{(x_i, y_i, s_i, \theta_i)\} with 64-D descriptors fi\mathbf{f}_i and Laplacian signs.
  1. Build the integral image IΣI_\Sigma in a single pass.
  2. For each scale layer in each octave, evaluate det(Happrox)\det(\mathcal{H}_\text{approx}) at every pixel using box filters Dxx,Dyy,DxyD_{xx}, D_{yy}, D_{xy} via IΣI_\Sigma.
  3. Threshold the response at tt and apply non-maximum suppression in the 3×3×33 \times 3 \times 3 image-and-scale neighbourhood.
  4. Refine each surviving extremum to sub-pixel and sub-scale accuracy by fitting a 3D quadratic to det(Happrox)\det(\mathcal{H}_\text{approx}) around the discrete maximum.
  5. Compute dx,dyd_x, d_y Haar responses (wavelet side 4s4s) at sample step ss over a circular neighbourhood of radius 6s6s; weight with Gaussian σ=2.5s\sigma = 2.5s.
  6. Slide a π/3\pi/3 angular window over the response vectors; assign the dominant orientation θ\theta as the direction of the longest summed response. (Skip this step for U-SURF.)
  7. Build a 20s×20s20s \times 20s window aligned to θ\theta around the keypoint; split into a 4×44 \times 4 grid of sub-regions with 5×55 \times 5 sample points each.
  8. At each sample point, compute Haar responses dx,dyd_x, d_y with wavelet side 2s2s, Gaussian-weighted with σ=3.3s\sigma = 3.3s, expressed in the keypoint's rotated frame.
  9. For each sub-region, accumulate v=(dx,dx,dy,dy)\mathbf{v} = (\sum d_x, \sum |d_x|, \sum d_y, \sum |d_y|) and concatenate the sixteen 4-vectors into a 64-D descriptor.
  10. L2-normalize the descriptor to a unit vector; record the sign of the Hessian trace (Laplacian sign) for matching pre-filter.

Implementation

The Hessian-determinant evaluation is the per-pixel kernel; the integral image makes each box-filter response a constant-time sum. The per-pixel detector response in Rust:

fn integral_at(ii: &[u32], w: usize, x: i32, y: i32) -> u32 {
    if x < 0 || y < 0 { return 0; }
    ii[(y as usize) * w + (x as usize)]
}

fn box_sum(ii: &[u32], w: usize, x0: i32, y0: i32, x1: i32, y1: i32) -> i32 {
    let a = integral_at(ii, w, x1, y1) as i32;
    let b = integral_at(ii, w, x0 - 1, y1) as i32;
    let c = integral_at(ii, w, x1, y0 - 1) as i32;
    let d = integral_at(ii, w, x0 - 1, y0 - 1) as i32;
    a - b - c + d
}

// Box-filter approximations D_xx, D_yy, D_xy at scale `l` (filter side in pixels),
// evaluated at pixel (x, y) on the integral image `ii` of width `w`.
// Returns det(H_approx) = D_xx * D_yy - (0.9 * D_xy)^2, area-normalised.
fn fast_hessian(ii: &[u32], w: usize, x: i32, y: i32, l: i32) -> f32 {
    let lobe = l / 3;
    let half = l / 2;
    let dxx = box_sum(ii, w, x - l + 1, y - lobe, x + l - 1, y + lobe)
            - 3 * box_sum(ii, w, x - lobe + 1, y - lobe, x + lobe - 1, y + lobe);
    let dyy = box_sum(ii, w, x - lobe, y - l + 1, x + lobe, y + l - 1)
            - 3 * box_sum(ii, w, x - lobe, y - lobe + 1, x + lobe, y + lobe - 1);
    let dxy = box_sum(ii, w, x - lobe, y - lobe, x - 1, y - 1)
            + box_sum(ii, w, x + 1, y + 1, x + lobe, y + lobe)
            - box_sum(ii, w, x + 1, y - lobe, x + lobe, y - 1)
            - box_sum(ii, w, x - lobe, y + 1, x - 1, y + lobe);
    let inv_area = 1.0 / ((l * l) as f32);
    let dxx = dxx as f32 * inv_area;
    let dyy = dyy as f32 * inv_area;
    let dxy = dxy as f32 * inv_area;
    dxx * dyy - (0.9 * dxy).powi(2)
}

Remarks

  • Detector cost is O(N)O(N) pixels per scale layer, independent of filter size, because each box-filter response is four integral-image lookups regardless of ll.
  • The descriptor is half the dimension of SIFT's 128-D vector; the Laplacian-sign index halves the candidate set for nearest-neighbour matching by polarity, both of which compound at match time.
  • Box-filter quantisation in scale σ\sigma is roughly ±0.1\pm 0.1 at the smallest scales; sub-scale quadratic refinement partially absorbs this, but the smallest detectable structures are bounded by the 9×99 \times 9 filter (σ=1.2\sigma = 1.2).
  • Integral-image precision must be 64-bit for image dimensions beyond approximately 4100×41004100 \times 4100 pixels; 32-bit unsigned overflows on the cumulative sum at 8-bit input.
  • Affine and large out-of-plane viewpoint changes are out of scope — the detector is scale- and rotation-invariant only.
  • U-SURF drops the orientation-assignment step for additional speed at the cost of in-plane rotation invariance; valid only when the camera stays approximately upright.
  • ORB (2011) replaces SURF's box-filter Hessian detector and 64-D Haar-wavelet descriptor with a FAST-9 + Harris-ranked detector and a 256-bit rBRIEF binary descriptor, yielding an order-of-magnitude speedup at comparable inlier rates on textured outdoor scenes — at the cost of SURF's continuous scale estimate and float descriptor capacity. See §When to choose SURF over ORB.

When to choose SURF over ORB

SURF (2006) is the older method; this page hosts the comparison.

Subpixel scale. SURF carries a continuous scale σ\sigma per keypoint via quadratic interpolation in (x,y,σ)(x, y, \sigma) across the box-filter response stack. ORB's scale is the discrete pyramid level on which the keypoint survived; there is no sub-level interpolation. Use SURF when downstream geometry — wide-baseline reconstruction, scale-precise stitching — needs an explicit per-keypoint scale.

Blob versus corner detection. SURF's det(Hσ)\det(\mathcal{H}^\sigma) response detects blob-like structures: spots, junctions, and rounded features. ORB detects FAST-9 corners. On graffiti-style monochromatic regions with sparse corners but visible blobs (illumination gradients, painted shapes), SURF localises keypoints where ORB does not. Use SURF when the scene's salient structure is blob-like rather than corner-like.

Descriptor capacity. SURF emits a 64-D float descriptor; ORB emits a 256-bit binary descriptor. SURF's float-valued Haar-wavelet responses retain more discriminative information per keypoint, helping in dense or repetitive scenes where binary tests collide. Use SURF when descriptor capacity dominates the decision — fine-grained recognition, retrieval over highly similar imagery, scene re-identification.

Compute and storage. ORB is roughly an order of magnitude faster than SURF on the same CPU at 640×480640 \times 480, with 32-byte binary descriptors against SURF's 256-byte float descriptors and Hamming-distance matching that reduces to bitwise XOR + popcount. Use ORB when frame budget, descriptor storage, or matching throughput dominates the decision.

References

  1. H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. Lecture Notes in Computer Science, 2006. link.springer.com
  2. D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 2004. cs.ubc.ca
  3. C. Harris, M. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conference, 1988.

Alternative formulation of

Compared with

Has learned alternative

  • SuperPoint

    SuperPoint replaces SURF's box-filter Hessian detector + 64-D Haar-wavelet descriptor with a learned VGG encoder + dual decoder heads.

  • XFeat

    XFeat replaces SURF's integral-image Hessian detector + Haar-wavelet descriptor with a featherweight learned model targeting CPU inference.