Goal
Detect scale- and rotation-invariant blob keypoints in a greyscale image and emit a 64-D descriptor robust to moderate illumination and geometric change. Input: a single-channel image on pixel domain . Output: a set of keypoints , each paired with a unit-norm descriptor and a Laplacian sign for matching pre-filter. The algorithm is specific to the use of an integral image to evaluate box-filter approximations of the Hessian determinant in per pixel regardless of scale, making detection cost independent of the filter size.
Algorithm
Let denote the greyscale image and let denote its integral image. Let denote the Hessian matrix of the Gaussian-smoothed image at point and scale . Let denote the box-filter approximations of . Let denote the keypoint scale parameter. Let denote Haar-wavelet responses in and .
Cumulative pixel sum that lets any rectangular sum be computed with four lookups regardless of size.
Detector response combining box-filtered second-derivative approximations with an empirical balancing factor that compensates for the difference in lobe weights between true Gaussian derivatives and the box-filter approximation.
Per-sub-region Haar-wavelet response statistics.
The first-octave box-filter sizes are pixels with the filter corresponding to . Subsequent octaves double the inter-filter step (octave step sizes ). Filter responses are area-normalized so that the Frobenius norm is constant across scales.
- Build the integral image in a single pass.
- For each scale layer in each octave, evaluate at every pixel using box filters via .
- Threshold the response at and apply non-maximum suppression in the image-and-scale neighbourhood.
- Refine each surviving extremum to sub-pixel and sub-scale accuracy by fitting a 3D quadratic to around the discrete maximum.
- Compute Haar responses (wavelet side ) at sample step over a circular neighbourhood of radius ; weight with Gaussian .
- Slide a angular window over the response vectors; assign the dominant orientation as the direction of the longest summed response. (Skip this step for U-SURF.)
- Build a window aligned to around the keypoint; split into a grid of sub-regions with sample points each.
- At each sample point, compute Haar responses with wavelet side , Gaussian-weighted with , expressed in the keypoint's rotated frame.
- For each sub-region, accumulate and concatenate the sixteen 4-vectors into a 64-D descriptor.
- L2-normalize the descriptor to a unit vector; record the sign of the Hessian trace (Laplacian sign) for matching pre-filter.
Implementation
The Hessian-determinant evaluation is the per-pixel kernel; the integral image makes each box-filter response a constant-time sum. The per-pixel detector response in Rust:
fn integral_at(ii: &[u32], w: usize, x: i32, y: i32) -> u32 {
if x < 0 || y < 0 { return 0; }
ii[(y as usize) * w + (x as usize)]
}
fn box_sum(ii: &[u32], w: usize, x0: i32, y0: i32, x1: i32, y1: i32) -> i32 {
let a = integral_at(ii, w, x1, y1) as i32;
let b = integral_at(ii, w, x0 - 1, y1) as i32;
let c = integral_at(ii, w, x1, y0 - 1) as i32;
let d = integral_at(ii, w, x0 - 1, y0 - 1) as i32;
a - b - c + d
}
// Box-filter approximations D_xx, D_yy, D_xy at scale `l` (filter side in pixels),
// evaluated at pixel (x, y) on the integral image `ii` of width `w`.
// Returns det(H_approx) = D_xx * D_yy - (0.9 * D_xy)^2, area-normalised.
fn fast_hessian(ii: &[u32], w: usize, x: i32, y: i32, l: i32) -> f32 {
let lobe = l / 3;
let half = l / 2;
let dxx = box_sum(ii, w, x - l + 1, y - lobe, x + l - 1, y + lobe)
- 3 * box_sum(ii, w, x - lobe + 1, y - lobe, x + lobe - 1, y + lobe);
let dyy = box_sum(ii, w, x - lobe, y - l + 1, x + lobe, y + l - 1)
- 3 * box_sum(ii, w, x - lobe, y - lobe + 1, x + lobe, y + lobe - 1);
let dxy = box_sum(ii, w, x - lobe, y - lobe, x - 1, y - 1)
+ box_sum(ii, w, x + 1, y + 1, x + lobe, y + lobe)
- box_sum(ii, w, x + 1, y - lobe, x + lobe, y - 1)
- box_sum(ii, w, x - lobe, y + 1, x - 1, y + lobe);
let inv_area = 1.0 / ((l * l) as f32);
let dxx = dxx as f32 * inv_area;
let dyy = dyy as f32 * inv_area;
let dxy = dxy as f32 * inv_area;
dxx * dyy - (0.9 * dxy).powi(2)
}
Remarks
- Detector cost is pixels per scale layer, independent of filter size, because each box-filter response is four integral-image lookups regardless of .
- The descriptor is half the dimension of SIFT's 128-D vector; the Laplacian-sign index halves the candidate set for nearest-neighbour matching by polarity, both of which compound at match time.
- Box-filter quantisation in scale is roughly at the smallest scales; sub-scale quadratic refinement partially absorbs this, but the smallest detectable structures are bounded by the filter ().
- Integral-image precision must be 64-bit for image dimensions beyond approximately pixels; 32-bit unsigned overflows on the cumulative sum at 8-bit input.
- Affine and large out-of-plane viewpoint changes are out of scope — the detector is scale- and rotation-invariant only.
- U-SURF drops the orientation-assignment step for additional speed at the cost of in-plane rotation invariance; valid only when the camera stays approximately upright.
- ORB (2011) replaces SURF's box-filter Hessian detector and 64-D Haar-wavelet descriptor with a FAST-9 + Harris-ranked detector and a 256-bit rBRIEF binary descriptor, yielding an order-of-magnitude speedup at comparable inlier rates on textured outdoor scenes — at the cost of SURF's continuous scale estimate and float descriptor capacity. See §When to choose SURF over ORB.
When to choose SURF over ORB
SURF (2006) is the older method; this page hosts the comparison.
Subpixel scale. SURF carries a continuous scale per keypoint via quadratic interpolation in across the box-filter response stack. ORB's scale is the discrete pyramid level on which the keypoint survived; there is no sub-level interpolation. Use SURF when downstream geometry — wide-baseline reconstruction, scale-precise stitching — needs an explicit per-keypoint scale.
Blob versus corner detection. SURF's response detects blob-like structures: spots, junctions, and rounded features. ORB detects FAST-9 corners. On graffiti-style monochromatic regions with sparse corners but visible blobs (illumination gradients, painted shapes), SURF localises keypoints where ORB does not. Use SURF when the scene's salient structure is blob-like rather than corner-like.
Descriptor capacity. SURF emits a 64-D float descriptor; ORB emits a 256-bit binary descriptor. SURF's float-valued Haar-wavelet responses retain more discriminative information per keypoint, helping in dense or repetitive scenes where binary tests collide. Use SURF when descriptor capacity dominates the decision — fine-grained recognition, retrieval over highly similar imagery, scene re-identification.
Compute and storage. ORB is roughly an order of magnitude faster than SURF on the same CPU at , with 32-byte binary descriptors against SURF's 256-byte float descriptors and Hamming-distance matching that reduces to bitwise XOR + popcount. Use ORB when frame budget, descriptor storage, or matching throughput dominates the decision.
References
- H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. Lecture Notes in Computer Science, 2006. link.springer.com
- D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 2004. cs.ubc.ca
- C. Harris, M. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conference, 1988.