Harris Corner Detector | VitaVision
Back to atlas

Harris Corner Detector

11 min readIntermediateView in graph
Based on
A Combined Corner and Edge Detector
Harris, Stephens · Alvey Vision Conference 1988
DOI ↗

Goal

Detect corners in a grayscale image I:ΩRI: \Omega \to \mathbb{R} on pixel domain ΩZ2\Omega \subset \mathbb{Z}^2. Output: a set of integer pixel locations {(xi,yi)}Ω\{(x_i, y_i)\} \subset \Omega at which intensity varies strongly in two independent directions. The detector responds to local image structure where both principal curvatures of the local autocorrelation function are large — flat regions and edges are suppressed.

Algorithm

Let Ix,Iy:ΩRI_x, I_y: \Omega \to \mathbb{R} denote the image derivatives in the horizontal and vertical directions. The original kernel is the central difference Ix=I(1,0,1)I_x = I \otimes (-1, 0, 1) and Iy=I(1,0,1)I_y = I \otimes (-1, 0, 1)^\top; Sobel or Scharr kernels are common alternatives with improved high-frequency response. Let w(u,v)w(u, v) denote a Gaussian window with standard deviation σ\sigma:

w(u,v)=exp ⁣(u2+v22σ2).w(u, v) = \exp\!\left(-\frac{u^2 + v^2}{2\sigma^2}\right).

Let α,β\alpha, \beta denote the eigenvalues of MM at a given pixel.

Definition
Structure tensor (M)

The 2×22 \times 2 symmetric matrix summarising gradient covariance in a local neighbourhood.

M=(u,v)w(u,v)[Ix2IxIyIxIyIy2],M = \sum_{(u,v)} w(u,v) \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix},

with entries A=Ix2wA = I_x^2 \otimes w, B=Iy2wB = I_y^2 \otimes w, C=(IxIy)wC = (I_x I_y) \otimes w, so that tr(M)=A+B=α+β\operatorname{tr}(M) = A + B = \alpha + \beta and det(M)=ABC2=αβ\det(M) = AB - C^2 = \alpha\beta.

Definition
Harris response (R)

A rotation-invariant corner score expressed in terms of tr(M)\operatorname{tr}(M) and det(M)\det(M) to avoid explicit eigendecomposition.

R=det(M)ktr(M)2,R = \det(M) - k \cdot \operatorname{tr}(M)^2,

where kk is an empirical sensitivity constant, typically in the range 0.040.040.060.06. R>0R > 0 in corner regions, R<0R < 0 on edges, and R|R| small in flat regions.

Procedure

Algorithm
Harris corner detection
Input: Grayscale image II on domain Ω\Omega; Gaussian window parameter σ\sigma; sensitivity constant kk; threshold τ\tau.
Output: Set of integer pixel locations {(xi,yi)}\{(x_i, y_i)\} marking detected corners.
  1. Compute IxI_x and IyI_y across the image via central differences or Sobel operators.
  2. Form the per-pixel products Ix2I_x^2, Iy2I_y^2, and IxIyI_x I_y.
  3. Convolve each product with the Gaussian window ww to obtain the entries AA, BB, CC of MM at every pixel.
  4. Compute R(x,y)=(ABC2)k(A+B)2R(x, y) = (A \cdot B - C^2) - k \cdot (A + B)^2 at every pixel.
  5. Discard pixels with R(x,y)τR(x, y) \leq \tau.
  6. Apply non-maximum suppression: retain only pixels whose RR value is an 8-way local maximum.

Implementation

The per-pixel response computation in Rust:

fn harris_response(img: &[f32], w: usize, h: usize, sigma: f32, k: f32) -> Vec<f32> {
    // Step 1: gradients via central differences
    let (ix, iy) = gradients(img, w, h);
    // Step 2: per-pixel products
    let ixx: Vec<f32> = ix.iter().zip(&ix).map(|(a, b)| a * b).collect();
    let iyy: Vec<f32> = iy.iter().zip(&iy).map(|(a, b)| a * b).collect();
    let ixy: Vec<f32> = ix.iter().zip(&iy).map(|(a, b)| a * b).collect();
    // Step 3: Gaussian-blur each component to obtain M's entries A, B, C
    let a = gaussian_blur(&ixx, w, h, sigma);
    let b = gaussian_blur(&iyy, w, h, sigma);
    let c = gaussian_blur(&ixy, w, h, sigma);
    // Step 4: R = det(M) - k * tr(M)^2 per pixel
    a.iter().zip(&b).zip(&c)
        .map(|((&aa, &bb), &cc)| (aa * bb - cc * cc) - k * (aa + bb).powi(2))
        .collect()
}

gradients and gaussian_blur are standard helpers; gradients returns central-difference derivatives and gaussian_blur applies a separable Gaussian kernel. Each line of the kernel corresponds directly to a step in the Procedure above.

Remarks

  • Complexity: O(Ω)O(|\Omega|) per scale — gradient computation, three convolutions (one per product Ix2I_x^2, Iy2I_y^2, IxIyI_x I_y), and a fixed scalar arithmetic pass per pixel. Separable Gaussian convolutions vectorize over rows.
  • The sensitivity constant kk is typically in 0.04–0.06. The response is a continuous function of kk; smaller values suppress edge responses less aggressively.
  • The response is rotation-invariant by construction: tr(M)\operatorname{tr}(M) and det(M)\det(M) are both rotation-invariant quantities.
  • Contrast scaling: under intensity scaling IρII \to \rho I, the eigenvalues α,β\alpha, \beta scale as ρ2\rho^2 and RR scales as ρ4\rho^4. The threshold τ\tau must be adapted per-image; a fixed τ\tau across exposures yields inconsistent detection rates.
  • The detector is not scale-invariant. Response peaks at the scale of the integration window σ\sigma. Multi-scale detection requires running the detector on a Gaussian image pyramid.
  • The detector responds to general 2D intensity variation; it does not encode X-junction geometry. For chessboard calibration targets, a domain-specific detector yields higher selectivity.
  • Scope: this page covers the corner-detection path only. The original paper additionally defines edge pixels as R<0R < 0 local minima in the dominant gradient direction, producing a combined edge-vertex map; that path is omitted here.
  • ORB uses the Harris response to rank FAST keypoints rather than as a stand-alone detector: at each pyramid level, FAST-9 generates candidates, then RR orders them and the top-NN per level survive. The two-stage arrangement avoids the per-pixel cost of a full Harris pass while keeping the edge-rejection property — Harris's det(M)\det(M) term penalises the edge-aligned eigenvalue ratio that FAST is known to produce on linear features.

When to choose Harris over Shi-Tomasi

Shi-Tomasi computes the same structure tensor MM but uses R=min(λ1,λ2)R = \min(\lambda_1, \lambda_2) instead of det(M)ktr(M)2\det(M) - k\,\mathrm{tr}(M)^2. The two operators differ in three concrete ways:

Harris Shi-Tomasi
Response form λ1λ2k(λ1+λ2)2\lambda_1\lambda_2 - k(\lambda_1+\lambda_2)^2 min(λ1,λ2)\min(\lambda_1, \lambda_2)
Eigendecomposition not required (uses det\det and tr\mathrm{tr}) required at every pixel
Hyperparameter sensitivity constant k[0.04,0.06]k \in [0.04, 0.06] none beyond the threshold
Threshold meaning algebraic units of gradient-squared-squared directly the smaller principal curvature

Choose Harris when (1) you want to skip the per-pixel \sqrt{\,} for eigenvalue extraction — this matters in tight per-pixel budgets, e.g. embedded vision; (2) you are happy tuning kk once for your imagery and want a smooth response surface (Shi-Tomasi's min\min has a non-differentiable ridge at λ1=λ2\lambda_1 = \lambda_2); (3) you need exactly the operator from the literature, since most reference implementations and many follow-on detectors (FAST, ChESS) use Harris as a comparison baseline.

Choose Shi-Tomasi when the threshold's meaning matters — Shi-Tomasi's min\min-eigenvalue threshold transfers more cleanly across exposures and image sizes since it is the smaller principal curvature in physical units, whereas Harris's RR scales as ρ4\rho^4 under intensity scaling IρII \to \rho I and requires per-image rescaling.

When to choose Harris over FAST

FAST replaces the gradient-based operator with a binary segment-test on a 16-pixel Bresenham ring around each candidate. The operators detect overlapping but distinct corner sets:

Harris FAST
Per-pixel cost 3 convolutions + arithmetic 4–9 integer comparisons (cardinal-test early reject)
Output continuous response, ranks corners binary classification, no ranking
Noise robustness strong (gradient averaging integrates noise) weak (point intensity comparisons amplify noise)
Rotation invariance yes by construction yes, but the segment-test threshold tt shifts the operating point
Repeatability under blur high degrades faster than Harris

Choose Harris when (1) you need a continuous ranking — Shi-Tomasi-style NMS works on Harris responses, while FAST returns a binary mask that downstream code typically re-ranks with Harris anyway; (2) you have a noise budget — Rosten 2006 reports FAST's repeatability collapses faster than Harris under additive noise (Figure 6D); (3) the speed gap doesn't pay for itself in your pipeline, since pairing FAST with Harris for ranking eliminates FAST's main advantage.

Choose FAST when sub-millisecond per-pixel detection is the bottleneck and downstream stages tolerate the binary output — typical in real-time visual SLAM front-ends where the detector's recall, not its rank quality, is what matters.

When to choose Harris over ChESS

ChESS is X-junction-specific by construction: it computes a sign-alternation count on a 16-pixel ring tuned to detect 4-quadrant alternating intensity patterns. Harris responds to any 2D intensity variation — corners, blobs, T-junctions, generic interest points — without target-specific tuning.

Harris ChESS
Detects general 2D corners X-junctions only
On a chessboard fires at X-corners, edge intersections, and texture fires only at X-corners (rejects edges and blobs)
Per-pixel cost comparable comparable (16 ring samples + small arithmetic)
Sub-pixel refinement not built in not built in (commonly paired with Hessian saddle)
Contrast sensitivity RR scales as ρ4\rho^4 adaptive threshold via mean-of-ring (more contrast-portable)

Choose Harris when the input is not a chessboard — Harris is the right tool for SfM front-ends, image matching, and tracking where any salient interest point is useful. Choose ChESS when the input is a chessboard calibration target and you need to suppress every false candidate (edge crossings, board borders, background texture) — Harris on a chessboard typically requires a downstream topology filter (Shu, Laureano) to remove the non-X-corner candidates that ChESS rejects at source.

When to choose Harris over SIFT

SIFT is a four-stage cascade — DoG scale-space extrema, sub-pixel refinement, canonical orientation, 128-D descriptor — engineered for scale-invariant matching across images. Harris is a single-scale corner-response operator. The two solve different problems and overlap only at "where to detect" in a fixed-scale image.

Harris SIFT
Scale handling single scale of integration window σ\sigma; multi-scale by external pyramid scale-invariant by construction (DoG pyramid, automatic scale selection)
Output corner-response score per pixel keypoint (x,y,σ,θ)(x, y, \sigma, \theta) + 128-D descriptor
Per-detection cost gradient + 3 convolutions + arithmetic full pyramid, refinement, orientation histogram, descriptor — orders of magnitude more
Descriptor none — pair with external descriptor bundled 128-D L2-normalised vector
Repeatability under viewpoint change falls off with any zoom or scale change tolerates wide-baseline scale and rotation

Choose Harris when (1) input scale is controlled — calibration-target detection, real-time tracking with bounded zoom, video stabilisation; (2) you need a per-pixel response surface for downstream filtering or saddle refinement, not a sparse keypoint set; (3) per-frame compute budget is sub-millisecond and the downstream stage does not need a descriptor.

Choose SIFT when (1) the matching task spans different image resolutions, distances, or focal lengths and the detector must repeat across scale; (2) a self-contained detect-and-describe pipeline is required (matching, recognition, panorama assembly, structure-from-motion); (3) wide-baseline accuracy on textured rigid scenes is the success criterion and Harris+ad-hoc descriptor pairings have proven inadequate.

References

  1. C. Harris, M. J. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conference, 1988. DOI: 10.5244/c.2.23
  2. J. Shi, C. Tomasi. Good Features to Track. IEEE CVPR, 1994. DOI: 10.1109/CVPR.1994.323794

Feeds into

Has learned alternative

  • SuperPoint

    SuperPoint replaces classical sparse keypoint+descriptor pipelines (Harris/Shi-Tomasi + SIFT/ORB) with a single learned model; it does not literally re-implement the Harris response.