Goal

Detect scale- and rotation-invariant blob keypoints in a greyscale image and emit a 64-D descriptor robust to moderate illumination and geometric change. Input: a single-channel image $I: \Omega \to \mathbb{R}$ on pixel domain $\Omega$ . Output: a set of keypoints $\{(x_i, y_i, s_i, \theta_i)\}$ , each paired with a unit-norm descriptor $\mathbf{f}_i \in \mathbb{R}^{64}$ and a Laplacian sign for matching pre-filter. The algorithm is specific to the use of an integral image to evaluate box-filter approximations of the Hessian determinant in $O(1)$ per pixel regardless of scale, making detection cost independent of the filter size.

Algorithm

Let $I: \Omega \to \mathbb{R}$ denote the greyscale image and let $I_\Sigma$ denote its integral image. Let $\mathcal{H}(\mathbf{x}, \sigma)$ denote the Hessian matrix of the Gaussian-smoothed image $L(\mathbf{x}, \sigma)$ at point $\mathbf{x} = (x, y)$ and scale $\sigma$ . Let $D_{xx}, D_{yy}, D_{xy}$ denote the box-filter approximations of $L_{xx}, L_{yy}, L_{xy}$ . Let $s$ denote the keypoint scale parameter. Let $d_x, d_y$ denote Haar-wavelet responses in $x$ and $y$ .

Definition

Integral image

Cumulative pixel sum that lets any rectangular sum be computed with four lookups regardless of size.

I_\Sigma(x, y) = \sum_{i \le x,\, j \le y} I(i, j).

Definition

Approximated Hessian determinant

Detector response combining box-filtered second-derivative approximations with an empirical balancing factor that compensates for the difference in lobe weights between true Gaussian derivatives and the box-filter approximation.

\det(\mathcal{H}_\text{approx}) = D_{xx}\, D_{yy} - (0.9\, D_{xy})^2.

Definition

Sub-region descriptor vector

Per-sub-region Haar-wavelet response statistics.

\mathbf{v} = \left(\sum d_x,\; \sum |d_x|,\; \sum d_y,\; \sum |d_y|\right).

The first-octave box-filter sizes are $9, 15, 21, 27$ pixels with the $9 \times 9$ filter corresponding to $\sigma = 1.2$ . Subsequent octaves double the inter-filter step (octave step sizes $6, 12, 24, \ldots$ ). Filter responses are area-normalized so that the Frobenius norm is constant across scales.

Algorithm

SURF detect-and-describe

Input: Greyscale image

I

; number of octaves and intervals; Hessian threshold

t

Output: Keypoints

\{(x_i, y_i, s_i, \theta_i)\}

with 64-D descriptors

\mathbf{f}_i

and Laplacian signs.

Build the integral image $I_\Sigma$ in a single pass.
For each scale layer in each octave, evaluate $\det(\mathcal{H}_\text{approx})$ at every pixel using box filters $D_{xx}, D_{yy}, D_{xy}$ via $I_\Sigma$ .
Threshold the response at $t$ and apply non-maximum suppression in the $3 \times 3 \times 3$ image-and-scale neighbourhood.
Refine each surviving extremum to sub-pixel and sub-scale accuracy by fitting a 3D quadratic to $\det(\mathcal{H}_\text{approx})$ around the discrete maximum.
Compute $d_x, d_y$ Haar responses (wavelet side $4s$ ) at sample step $s$ over a circular neighbourhood of radius $6s$ ; weight with Gaussian $\sigma = 2.5s$ .
Slide a $\pi/3$ angular window over the response vectors; assign the dominant orientation $\theta$ as the direction of the longest summed response. (Skip this step for U-SURF.)
Build a $20s \times 20s$ window aligned to $\theta$ around the keypoint; split into a $4 \times 4$ grid of sub-regions with $5 \times 5$ sample points each.
At each sample point, compute Haar responses $d_x, d_y$ with wavelet side $2s$ , Gaussian-weighted with $\sigma = 3.3s$ , expressed in the keypoint's rotated frame.
For each sub-region, accumulate $\mathbf{v} = (\sum d_x, \sum |d_x|, \sum d_y, \sum |d_y|)$ and concatenate the sixteen 4-vectors into a 64-D descriptor.
L2-normalize the descriptor to a unit vector; record the sign of the Hessian trace (Laplacian sign) for matching pre-filter.

Implementation

The Hessian-determinant evaluation is the per-pixel kernel; the integral image makes each box-filter response a constant-time sum. The per-pixel detector response in Rust:

fn integral_at(ii: &[u32], w: usize, x: i32, y: i32) -> u32 {
    if x < 0 || y < 0 { return 0; }
    ii[(y as usize) * w + (x as usize)]
}

fn box_sum(ii: &[u32], w: usize, x0: i32, y0: i32, x1: i32, y1: i32) -> i32 {
    let a = integral_at(ii, w, x1, y1) as i32;
    let b = integral_at(ii, w, x0 - 1, y1) as i32;
    let c = integral_at(ii, w, x1, y0 - 1) as i32;
    let d = integral_at(ii, w, x0 - 1, y0 - 1) as i32;
    a - b - c + d
}

// Box-filter approximations D_xx, D_yy, D_xy at scale `l` (filter side in pixels),
// evaluated at pixel (x, y) on the integral image `ii` of width `w`.
// Returns det(H_approx) = D_xx * D_yy - (0.9 * D_xy)^2, area-normalised.
fn fast_hessian(ii: &[u32], w: usize, x: i32, y: i32, l: i32) -> f32 {
    let lobe = l / 3;
    let half = l / 2;
    let dxx = box_sum(ii, w, x - l + 1, y - lobe, x + l - 1, y + lobe)
            - 3 * box_sum(ii, w, x - lobe + 1, y - lobe, x + lobe - 1, y + lobe);
    let dyy = box_sum(ii, w, x - lobe, y - l + 1, x + lobe, y + l - 1)
            - 3 * box_sum(ii, w, x - lobe, y - lobe + 1, x + lobe, y + lobe - 1);
    let dxy = box_sum(ii, w, x - lobe, y - lobe, x - 1, y - 1)
            + box_sum(ii, w, x + 1, y + 1, x + lobe, y + lobe)
            - box_sum(ii, w, x + 1, y - lobe, x + lobe, y - 1)
            - box_sum(ii, w, x - lobe, y + 1, x - 1, y + lobe);
    let inv_area = 1.0 / ((l * l) as f32);
    let dxx = dxx as f32 * inv_area;
    let dyy = dyy as f32 * inv_area;
    let dxy = dxy as f32 * inv_area;
    dxx * dyy - (0.9 * dxy).powi(2)
}

Remarks

Detector cost is $O(N)$ pixels per scale layer, independent of filter size, because each box-filter response is four integral-image lookups regardless of $l$ .
The descriptor is half the dimension of SIFT's 128-D vector; the Laplacian-sign index halves the candidate set for nearest-neighbour matching by polarity, both of which compound at match time.
Box-filter quantisation in scale $\sigma$ is roughly $\pm 0.1$ at the smallest scales; sub-scale quadratic refinement partially absorbs this, but the smallest detectable structures are bounded by the $9 \times 9$ filter ( $\sigma = 1.2$ ).
Integral-image precision must be 64-bit for image dimensions beyond approximately $4100 \times 4100$ pixels; 32-bit unsigned overflows on the cumulative sum at 8-bit input.
Affine and large out-of-plane viewpoint changes are out of scope — the detector is scale- and rotation-invariant only.
U-SURF drops the orientation-assignment step for additional speed at the cost of in-plane rotation invariance; valid only when the camera stays approximately upright.
ORB (2011) replaces SURF's box-filter Hessian detector and 64-D Haar-wavelet descriptor with a FAST-9 + Harris-ranked detector and a 256-bit rBRIEF binary descriptor, yielding an order-of-magnitude speedup at comparable inlier rates on textured outdoor scenes — at the cost of SURF's continuous scale estimate and float descriptor capacity. See §When to choose SURF over ORB.

When to choose SURF over ORB

SURF (2006) is the older method; this page hosts the comparison.

Subpixel scale. SURF carries a continuous scale $\sigma$ per keypoint via quadratic interpolation in $(x, y, \sigma)$ across the box-filter response stack. ORB's scale is the discrete pyramid level on which the keypoint survived; there is no sub-level interpolation. Use SURF when downstream geometry — wide-baseline reconstruction, scale-precise stitching — needs an explicit per-keypoint scale.

Blob versus corner detection. SURF's $\det(\mathcal{H}^\sigma)$ response detects blob-like structures: spots, junctions, and rounded features. ORB detects FAST-9 corners. On graffiti-style monochromatic regions with sparse corners but visible blobs (illumination gradients, painted shapes), SURF localises keypoints where ORB does not. Use SURF when the scene's salient structure is blob-like rather than corner-like.

Descriptor capacity. SURF emits a 64-D float descriptor; ORB emits a 256-bit binary descriptor. SURF's float-valued Haar-wavelet responses retain more discriminative information per keypoint, helping in dense or repetitive scenes where binary tests collide. Use SURF when descriptor capacity dominates the decision — fine-grained recognition, retrieval over highly similar imagery, scene re-identification.

Compute and storage. ORB is roughly an order of magnitude faster than SURF on the same CPU at $640 \times 480$ , with 32-byte binary descriptors against SURF's 256-byte float descriptors and Hamming-distance matching that reduces to bitwise XOR + popcount. Use ORB when frame budget, descriptor storage, or matching throughput dominates the decision.

References

H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. Lecture Notes in Computer Science, 2006. link.springer.com
D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 2004. cs.ubc.ca
C. Harris, M. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conference, 1988.

SURF: Speeded Up Robust Features

Goal

Algorithm

Implementation

Remarks

When to choose SURF over ORB

References

Prerequisites

Alternative formulation of

Compared with

Has learned alternative