Goal
Given a grayscale image with pixel values in , detect a sparse set of locally distinctive keypoints and compute, for each, a 2D location, scale , dominant orientation , and a 128-dimensional L2-normalized descriptor vector. The output is suitable for matching across images related by scale change, in-plane rotation, moderate affine distortion, illumination change, and noise. Invariance to scale and rotation is achieved through a Difference-of-Gaussian scale-space pyramid, a canonical orientation assignment from a gradient-histogram peak, and a gradient-histogram descriptor sampled in the keypoint's own coordinate frame.
Algorithm
Let denote the grayscale image on pixel domain . Let denote the Gaussian kernel at scale . Let denote the scale-space representation. Let denote the constant multiplicative scale step, with intervals per octave. Let denote the spatial Hessian of the DoG function evaluated at a candidate keypoint. Let denote the edge-response rejection threshold. Let denote the Gaussian window scale used for orientation assignment.
The Difference-of-Gaussian function approximates the scale-normalized Laplacian :
Extrema of across are used as candidate keypoint locations.
A 3D quadratic is fit to the DoG volume by Taylor expansion through second order:
where is the offset from the discrete sample. Setting the derivative to zero gives the sub-pixel offset:
If in any dimension, the candidate is relocated to the neighbouring sample and the fit is repeated.
To reject keypoints localised on edges rather than corners, the ratio of principal curvatures of the spatial Hessian is bounded:
With the threshold is . Candidates with (saddle points) are always rejected.
Image gradients are sampled in a neighbourhood rotated to the keypoint orientation. Samples are accumulated into a array of 8-bin orientation histograms with trilinear interpolation and Gaussian weighting (window half the descriptor window width). The resulting 128-element vector is L2-normalized, each element clamped at , then renormalized:
Procedure
- Double the image resolution (bilinear or similar); the assumed input blur becomes in the doubled image. Apply an additional Gaussian with to reach the target initial blur .
- For each octave, convolve the base image with Gaussians at scales to produce blurred images. Downsample by a factor of 2 to advance to the next octave.
- Subtract adjacent Gaussian-blurred images within each octave to obtain . This yields DoG images per octave.
- For each interior DoG sample, compare it with its 26 neighbours (8 in the same scale layer, 9 in the layer above, 9 in the layer below). Retain the sample if it is strictly greater than or less than all 26 neighbours.
- Fit a 3D quadratic to the DoG volume to compute . If , relocate and refit. Discard the keypoint if (low contrast).
- Compute the spatial Hessian at the refined location. Discard the keypoint if or .
- Compute gradient magnitudes and orientations in a Gaussian-weighted region (). Accumulate into a 36-bin histogram covering . Fit a parabola over each local peak and its two neighbours. Assign the dominant peak orientation; create additional keypoints for any secondary peaks within of the dominant peak height.
- Rotate the gradient sample grid to the assigned orientation. Accumulate a grid of 8-bin histograms with trilinear interpolation and Gaussian weighting. L2-normalize, clamp each element at , and renormalize to produce a 128-dimensional descriptor .
Implementation
The per-keypoint sub-pixel refinement and edge-response rejection in Rust, corresponding line-by-line to the Taylor-expansion localization and Hessian ratio test:
use nalgebra::{Matrix3, Vector3};
/// Returns the refined offset `x_hat` (in [x, y, sigma] sample units) and
/// the interpolated DoG value at the refined location, or `None` if the
/// keypoint fails the contrast or edge-response tests.
fn refine_keypoint(dog: &[[[f32; 3]; 3]; 3]) -> Option<(Vector3<f32>, f32)> {
let dx = 0.5 * (dog[1][1][2] - dog[1][1][0]);
let dy = 0.5 * (dog[1][2][1] - dog[1][0][1]);
let ds = 0.5 * (dog[2][1][1] - dog[0][1][1]);
let grad = Vector3::new(dx, dy, ds);
let d = dog[1][1][1];
let dxx = dog[1][1][2] - 2.0 * d + dog[1][1][0];
let dyy = dog[1][2][1] - 2.0 * d + dog[1][0][1];
let dss = dog[2][1][1] - 2.0 * d + dog[0][1][1];
let dxy = 0.25 * ((dog[1][2][2] - dog[1][2][0]) - (dog[1][0][2] - dog[1][0][0]));
let dxs = 0.25 * ((dog[2][1][2] - dog[2][1][0]) - (dog[0][1][2] - dog[0][1][0]));
let dys = 0.25 * ((dog[2][2][1] - dog[2][0][1]) - (dog[0][2][1] - dog[0][0][1]));
let hessian = Matrix3::new(dxx, dxy, dxs, dxy, dyy, dys, dxs, dys, dss);
let x_hat = hessian.try_inverse()? * (-grad);
let d_interp = d + 0.5 * grad.dot(&x_hat);
if d_interp.abs() < 0.03 {
return None;
}
let tr = dxx + dyy;
let det = dxx * dyy - dxy * dxy;
if det <= 0.0 || (tr * tr / det) >= 12.1 {
return None;
}
Some((x_hat, d_interp))
}
Remarks
- Gaussian pyramid construction is per pixel (fixed number of convolutions per octave); descriptor computation is per keypoint. A 500×500 image typically yields approximately 2000 stable keypoints.
- intervals per octave () is empirically optimal: fewer intervals miss extrema; more intervals detect increasingly unstable ones without improving correct-match count. Changing materially affects keypoint count and stability.
- The Hessian ratio threshold (giving the limit 12.1) is a geometric parameter that does not depend on image contrast. It is rarely tuned; values smaller than 10 produce tighter edge rejection and fewer keypoints on elongated structures.
- Low-texture or uniform regions produce no DoG extrema and thus no keypoints. Repetitive periodic patterns (brickwork, fabric, gratings) produce many ambiguous matches that the ratio test fails to disambiguate. Viewpoint changes beyond approximately 50° tilt for planar surfaces cause matching reliability to fall below 50%.
- The descriptor encodes gradient magnitude and orientation from a grayscale image only. Color extensions (RootSIFT, ColorSIFT) exist but are not part of this algorithm. For dense per-pixel correspondence, SIFT is not applicable — it is a sparse detector by design.
- The practitioner comparison of SIFT against the Harris corner detector is hosted on the Harris page (Harris 1988 is the older paper). See harris-corner-detector §When to choose SIFT over Harris.
- ORB (2011) couples FAST-9 with a steered, learned binary descriptor and matches via Hamming distance, eliminating the floating-point gradient histograms that dominate SIFT's per-keypoint cost. The two methods overlap on textured rigid scenes; SIFT's DoG blob detector localises better on graffiti-style monochromatic regions where FAST corners are sparse. See §When to choose SIFT over ORB for the practitioner-choice breakdown.
When to choose SIFT over FAST
SIFT (2004) is the older method; this page hosts the comparison.
Scale invariance. SIFT is scale-invariant by construction: the DoG pyramid spans multiple octaves and each keypoint carries an explicit scale . FAST operates at a single fixed scale; a scene captured at different resolutions or zoom levels will not produce matching FAST keypoints. Use SIFT whenever scale change between images is expected.
Descriptor bundled. SIFT produces a 128-dimensional descriptor alongside each detected point. FAST is a detector only and must be paired with a separate descriptor (BRIEF, BRISK, ORB) to produce matchable feature vectors. Use SIFT when a self-contained detect-and-describe pipeline is required.
Computational cost. FAST is an order of magnitude faster than SIFT on CPU: its ring-pixel brightness comparison requires no convolution and no floating-point gradient computation. FAST was designed for real-time video tracking under bounded scale change. Use FAST for high-frame-rate tracking where scale variation is controlled and descriptor cost matters.
Wide-baseline accuracy on textured rigid scenes. When images are taken from significantly different viewpoints (different zoom, different distance) and the scene is textured and approximately rigid, SIFT keypoints repeat more reliably and descriptors disambiguate matches more accurately than FAST + lightweight descriptors. Use SIFT for wide-baseline matching, panorama assembly, and structure-from-motion pipelines.
When to choose SIFT over ORB
SIFT (2004) is the older method; this page hosts the comparison.
Detector regime. SIFT detects scale-space extrema in a Difference-of-Gaussian pyramid — blob-like structures localised at a continuous scale . ORB detects FAST-9 corners at five discrete pyramid levels and ranks them by Harris cornerness. On graffiti-style monochromatic regions where corner responses are sparse, SIFT's blob detector finds keypoints that ORB does not. Use SIFT when the scene contains few corner-like structures and the image-matching budget allows continuous scale-space construction.
Subpixel scale. SIFT carries an explicit subpixel scale per keypoint via 3D quadratic interpolation in . ORB's scale is the discrete pyramid level the keypoint survived at — there is no sub-level interpolation. Use SIFT when downstream geometry (homography, fundamental-matrix estimation under wide-baseline) benefits from precise scale matching.
Descriptor capacity. SIFT emits a 128-D float descriptor with gradient-histogram cells; ORB emits a 256-bit binary descriptor with rotated pixel-pair tests. SIFT's higher descriptor capacity disambiguates more aggressively in dense scenes; ORB's Hamming-distance matching is integer-only and an order of magnitude faster but trades absolute discriminability for compute. Use SIFT when descriptor capacity matters more than match cost — bag-of-words retrieval over million-image corpora, re-identification, fine-grained classification.
Compute and storage. ORB is roughly two orders of magnitude faster than SIFT on the same CPU at — its integer-only pipeline (FAST + integral-image box-sums + bitwise tests + popcount) suits real-time tracking, embedded vision, and mobile devices. Storage is 32 bytes per ORB descriptor against 512 bytes per SIFT descriptor (128 floats). Use ORB when frame budget, descriptor storage, or matching throughput dominates the decision.
References
- D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2), 2004. PDF
- C. Harris and M. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conference, 1988.