Goal
Detect corners in a grayscale image on pixel domain . Output: a set of integer pixel locations at which intensity varies strongly in two independent directions. The detector responds to local image structure where both principal curvatures of the local autocorrelation function are large — flat regions and edges are suppressed.
Algorithm
Let denote the image derivatives in the horizontal and vertical directions. The original kernel is the central difference and ; Sobel or Scharr kernels are common alternatives with improved high-frequency response. Let denote a Gaussian window with standard deviation :
Let denote the eigenvalues of at a given pixel.
The symmetric matrix summarising gradient covariance in a local neighbourhood.
with entries , , , so that and .
A rotation-invariant corner score expressed in terms of and to avoid explicit eigendecomposition.
where is an empirical sensitivity constant, typically in the range –. in corner regions, on edges, and small in flat regions.
Procedure
- Compute and across the image via central differences or Sobel operators.
- Form the per-pixel products , , and .
- Convolve each product with the Gaussian window to obtain the entries , , of at every pixel.
- Compute at every pixel.
- Discard pixels with .
- Apply non-maximum suppression: retain only pixels whose value is an 8-way local maximum.
Implementation
The per-pixel response computation in Rust:
fn harris_response(img: &[f32], w: usize, h: usize, sigma: f32, k: f32) -> Vec<f32> {
// Step 1: gradients via central differences
let (ix, iy) = gradients(img, w, h);
// Step 2: per-pixel products
let ixx: Vec<f32> = ix.iter().zip(&ix).map(|(a, b)| a * b).collect();
let iyy: Vec<f32> = iy.iter().zip(&iy).map(|(a, b)| a * b).collect();
let ixy: Vec<f32> = ix.iter().zip(&iy).map(|(a, b)| a * b).collect();
// Step 3: Gaussian-blur each component to obtain M's entries A, B, C
let a = gaussian_blur(&ixx, w, h, sigma);
let b = gaussian_blur(&iyy, w, h, sigma);
let c = gaussian_blur(&ixy, w, h, sigma);
// Step 4: R = det(M) - k * tr(M)^2 per pixel
a.iter().zip(&b).zip(&c)
.map(|((&aa, &bb), &cc)| (aa * bb - cc * cc) - k * (aa + bb).powi(2))
.collect()
}
gradients and gaussian_blur are standard helpers; gradients returns central-difference derivatives and gaussian_blur applies a separable Gaussian kernel. Each line of the kernel corresponds directly to a step in the Procedure above.
Remarks
- Complexity: per scale — gradient computation, three convolutions (one per product , , ), and a fixed scalar arithmetic pass per pixel. Separable Gaussian convolutions vectorize over rows.
- The sensitivity constant is typically in 0.04–0.06. The response is a continuous function of ; smaller values suppress edge responses less aggressively.
- The response is rotation-invariant by construction: and are both rotation-invariant quantities.
- Contrast scaling: under intensity scaling , the eigenvalues scale as and scales as . The threshold must be adapted per-image; a fixed across exposures yields inconsistent detection rates.
- The detector is not scale-invariant. Response peaks at the scale of the integration window . Multi-scale detection requires running the detector on a Gaussian image pyramid.
- The detector responds to general 2D intensity variation; it does not encode X-junction geometry. For chessboard calibration targets, a domain-specific detector yields higher selectivity.
- Scope: this page covers the corner-detection path only. The original paper additionally defines edge pixels as local minima in the dominant gradient direction, producing a combined edge-vertex map; that path is omitted here.
- ORB uses the Harris response to rank FAST keypoints rather than as a stand-alone detector: at each pyramid level, FAST-9 generates candidates, then orders them and the top- per level survive. The two-stage arrangement avoids the per-pixel cost of a full Harris pass while keeping the edge-rejection property — Harris's term penalises the edge-aligned eigenvalue ratio that FAST is known to produce on linear features.
When to choose Harris over Shi-Tomasi
Shi-Tomasi computes the same structure tensor but uses instead of . The two operators differ in three concrete ways:
| Harris | Shi-Tomasi | |
|---|---|---|
| Response form | ||
| Eigendecomposition | not required (uses and ) | required at every pixel |
| Hyperparameter | sensitivity constant | none beyond the threshold |
| Threshold meaning | algebraic units of gradient-squared-squared | directly the smaller principal curvature |
Choose Harris when (1) you want to skip the per-pixel for eigenvalue extraction — this matters in tight per-pixel budgets, e.g. embedded vision; (2) you are happy tuning once for your imagery and want a smooth response surface (Shi-Tomasi's has a non-differentiable ridge at ); (3) you need exactly the operator from the literature, since most reference implementations and many follow-on detectors (FAST, ChESS) use Harris as a comparison baseline.
Choose Shi-Tomasi when the threshold's meaning matters — Shi-Tomasi's -eigenvalue threshold transfers more cleanly across exposures and image sizes since it is the smaller principal curvature in physical units, whereas Harris's scales as under intensity scaling and requires per-image rescaling.
When to choose Harris over FAST
FAST replaces the gradient-based operator with a binary segment-test on a 16-pixel Bresenham ring around each candidate. The operators detect overlapping but distinct corner sets:
| Harris | FAST | |
|---|---|---|
| Per-pixel cost | 3 convolutions + arithmetic | 4–9 integer comparisons (cardinal-test early reject) |
| Output | continuous response, ranks corners | binary classification, no ranking |
| Noise robustness | strong (gradient averaging integrates noise) | weak (point intensity comparisons amplify noise) |
| Rotation invariance | yes by construction | yes, but the segment-test threshold shifts the operating point |
| Repeatability under blur | high | degrades faster than Harris |
Choose Harris when (1) you need a continuous ranking — Shi-Tomasi-style NMS works on Harris responses, while FAST returns a binary mask that downstream code typically re-ranks with Harris anyway; (2) you have a noise budget — Rosten 2006 reports FAST's repeatability collapses faster than Harris under additive noise (Figure 6D); (3) the speed gap doesn't pay for itself in your pipeline, since pairing FAST with Harris for ranking eliminates FAST's main advantage.
Choose FAST when sub-millisecond per-pixel detection is the bottleneck and downstream stages tolerate the binary output — typical in real-time visual SLAM front-ends where the detector's recall, not its rank quality, is what matters.
When to choose Harris over ChESS
ChESS is X-junction-specific by construction: it computes a sign-alternation count on a 16-pixel ring tuned to detect 4-quadrant alternating intensity patterns. Harris responds to any 2D intensity variation — corners, blobs, T-junctions, generic interest points — without target-specific tuning.
| Harris | ChESS | |
|---|---|---|
| Detects | general 2D corners | X-junctions only |
| On a chessboard | fires at X-corners, edge intersections, and texture | fires only at X-corners (rejects edges and blobs) |
| Per-pixel cost | comparable | comparable (16 ring samples + small arithmetic) |
| Sub-pixel refinement | not built in | not built in (commonly paired with Hessian saddle) |
| Contrast sensitivity | scales as | adaptive threshold via mean-of-ring (more contrast-portable) |
Choose Harris when the input is not a chessboard — Harris is the right tool for SfM front-ends, image matching, and tracking where any salient interest point is useful. Choose ChESS when the input is a chessboard calibration target and you need to suppress every false candidate (edge crossings, board borders, background texture) — Harris on a chessboard typically requires a downstream topology filter (Shu, Laureano) to remove the non-X-corner candidates that ChESS rejects at source.
When to choose Harris over SIFT
SIFT is a four-stage cascade — DoG scale-space extrema, sub-pixel refinement, canonical orientation, 128-D descriptor — engineered for scale-invariant matching across images. Harris is a single-scale corner-response operator. The two solve different problems and overlap only at "where to detect" in a fixed-scale image.
| Harris | SIFT | |
|---|---|---|
| Scale handling | single scale of integration window ; multi-scale by external pyramid | scale-invariant by construction (DoG pyramid, automatic scale selection) |
| Output | corner-response score per pixel | keypoint + 128-D descriptor |
| Per-detection cost | gradient + 3 convolutions + arithmetic | full pyramid, refinement, orientation histogram, descriptor — orders of magnitude more |
| Descriptor | none — pair with external descriptor | bundled 128-D L2-normalised vector |
| Repeatability under viewpoint change | falls off with any zoom or scale change | tolerates wide-baseline scale and rotation |
Choose Harris when (1) input scale is controlled — calibration-target detection, real-time tracking with bounded zoom, video stabilisation; (2) you need a per-pixel response surface for downstream filtering or saddle refinement, not a sparse keypoint set; (3) per-frame compute budget is sub-millisecond and the downstream stage does not need a descriptor.
Choose SIFT when (1) the matching task spans different image resolutions, distances, or focal lengths and the detector must repeat across scale; (2) a self-contained detect-and-describe pipeline is required (matching, recognition, panorama assembly, structure-from-motion); (3) wide-baseline accuracy on textured rigid scenes is the success criterion and Harris+ad-hoc descriptor pairings have proven inadequate.
References
- C. Harris, M. J. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conference, 1988. DOI: 10.5244/c.2.23
- J. Shi, C. Tomasi. Good Features to Track. IEEE CVPR, 1994. DOI: 10.1109/CVPR.1994.323794