Goal

Detect corners in a grayscale image $I: \Omega \to \mathbb{R}$ on pixel domain $\Omega \subset \mathbb{Z}^2$ . Output: a set of integer pixel locations $\{(x_i, y_i)\} \subset \Omega$ at which intensity varies strongly in two independent directions. The response is the smaller eigenvalue of the gradient structure tensor — the quantity that directly bounds the worst-conditioned direction of local intensity variation. The acceptance threshold $\tau$ is not a heuristic: it is derived in §3 of the paper from the condition that the linear system governing feature displacement must be both above the noise level and well-conditioned for reliable tracking.

Algorithm

Let $I_x, I_y: \Omega \to \mathbb{R}$ denote the image derivatives in the horizontal and vertical directions, approximated by central differences or Sobel operators. Let $w(u, v)$ denote a Gaussian window with standard deviation $\sigma$ :

w(u, v) = \exp\!\left(-\frac{u^2 + v^2}{2\sigma^2}\right).

Let $\lambda_1 \geq \lambda_2$ denote the eigenvalues of $M$ at a given pixel.

Definition

Structure tensor (M)

The $2 \times 2$ symmetric matrix summarising gradient covariance in a local neighbourhood, identical to the structure tensor of Harris.

M = \sum_{(u,v)} w(u,v) \begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix},

with entries $A = I_x^2 \otimes w$ , $B = I_y^2 \otimes w$ , $C = (I_x I_y) \otimes w$ , so that $\operatorname{tr}(M) = A + B = \lambda_1 + \lambda_2$ and $\det(M) = AB - C^2 = \lambda_1 \lambda_2$ .

Definition

Shi-Tomasi response (R)

The smaller eigenvalue of $M$ :

R = \lambda_2 = \min(\lambda_1, \lambda_2).

Computed in closed form from $\operatorname{tr}(M)$ and $\det(M)$ as

\lambda_{1,2} = \tfrac{1}{2}\left(\operatorname{tr}(M) \pm \sqrt{\operatorname{tr}(M)^2 - 4\det(M)}\right),

R = \tfrac{1}{2}\left(\operatorname{tr}(M) - \sqrt{\operatorname{tr}(M)^2 - 4\det(M)}\right).

Procedure

Algorithm

Shi-Tomasi corner detection

Input: Grayscale image

I

on domain

\Omega

; Gaussian window parameter

\sigma

; threshold

\tau

Output: Set of integer pixel locations

\{(x_i, y_i)\}

marking detected corners.

Compute $I_x$ and $I_y$ across the image via central differences or Sobel operators.
Form the per-pixel products $I_x^2$ , $I_y^2$ , and $I_x I_y$ .
Convolve each product with the Gaussian window $w$ to obtain the entries $A$ , $B$ , $C$ of $M$ at every pixel.
Compute $R(x, y) = \tfrac{1}{2}\!\left((A+B) - \sqrt{(A+B)^2 - 4(AB - C^2)}\right)$ at every pixel.
Discard pixels with $R(x, y) \leq \tau$ . The threshold $\tau$ encodes a feature-tracking quality bound: a small $\lambda_2$ implies an aperture problem in at least one direction, preventing reliable displacement estimation (§3 of the paper).
Apply non-maximum suppression: retain only pixels whose $R$ value is an 8-way local maximum.

Implementation

The per-pixel response computation in Rust:

fn shi_tomasi_response(img: &[f32], w: usize, h: usize, sigma: f32) -> Vec<f32> {
    let (ix, iy) = gradients(img, w, h);
    let ixx: Vec<f32> = ix.iter().zip(&ix).map(|(a, b)| a * b).collect();
    let iyy: Vec<f32> = iy.iter().zip(&iy).map(|(a, b)| a * b).collect();
    let ixy: Vec<f32> = ix.iter().zip(&iy).map(|(a, b)| a * b).collect();

    let a = gaussian_blur(&ixx, w, h, sigma);
    let b = gaussian_blur(&iyy, w, h, sigma);
    let c = gaussian_blur(&ixy, w, h, sigma);

    a.iter().zip(&b).zip(&c)
        .map(|((&aa, &bb), &cc)| {
            let trace = aa + bb;
            let det = aa * bb - cc * cc;
            let disc = (trace * trace - 4.0 * det).max(0.0).sqrt();
            0.5 * (trace - disc)
        })
        .collect()
}

gradients and gaussian_blur are standard helpers; gradients returns central-difference derivatives and gaussian_blur applies a separable Gaussian kernel. The .max(0.0) clamp before .sqrt() guards against floating-point round-off: the discriminant $\operatorname{tr}(M)^2 - 4\det(M) = (\lambda_1 - \lambda_2)^2$ is non-negative by construction, but finite-precision arithmetic can produce a small negative value.

Remarks

Complexity: $O(|\Omega|)$ per scale — gradient computation, three convolutions (one per product $I_x^2$ , $I_y^2$ , $I_x I_y$ ), and a fixed scalar arithmetic pass per pixel. Identical to Harris.
The closed form for $\min(\lambda_1, \lambda_2)$ avoids explicit eigendecomposition; the only extra cost over Harris is one square root per pixel.
The response is rotation-invariant: $R$ is a function of $\operatorname{tr}(M)$ and $\det(M)$ , both of which are rotation-invariant quantities.
The detector is not scale-invariant. Response peaks at the scale of the integration window $\sigma$ . Multi-scale detection requires running the detector on a Gaussian image pyramid.
No free sensitivity parameter analogous to Harris's $k$ : the threshold $\tau$ has a direct interpretation as a tracking-quality bound derived from the linear displacement system (equation (8) of the paper).
Compared with Harris: see When to choose Harris over Shi-Tomasi on the Harris page, which hosts the comparison per the older-paper-hosts rule.

When to choose Shi-Tomasi over SIFT

SIFT detects scale-space extrema in a DoG pyramid and bundles a 128-D descriptor; Shi-Tomasi computes a single-scale corner-response score from the structure-tensor minimum eigenvalue. They solve different problems — Shi-Tomasi locates corners at a fixed integration scale, SIFT produces matchable keypoints across scale.

	Shi-Tomasi	SIFT
Scale handling	single scale; multi-scale only by external pyramid	scale-invariant by construction
Output	corner-response score per pixel	keypoint $(x, y, \sigma, \theta)$ + 128-D descriptor
Per-detection cost	gradient + 3 convolutions + 1 sqrt	full DoG pyramid, sub-pixel refinement, orientation histogram, descriptor
Descriptor	none	bundled 128-D L2-normalised vector
Threshold meaning	minimum-eigenvalue value, transfers across exposures	DoG contrast threshold + Hessian-ratio edge test

Choose Shi-Tomasi when (1) scale is controlled by the application — KLT tracking is the canonical use case (Shi-Tomasi corners are the highest-quality features for the linear displacement model the paper derives); (2) the exposure-portable threshold matters more than scale invariance; (3) downstream code does its own descriptor or operates on raw patches.

Choose SIFT when (1) images differ in scale, distance, or focal length and corners must repeat across scale; (2) a single-step detect-and-describe pipeline is required; (3) the matching task is wide-baseline rather than frame-to-frame tracking.

References

J. Shi, C. Tomasi. Good Features to Track. IEEE CVPR, 1994. DOI: 10.1109/CVPR.1994.323794
C. Harris, M. J. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conference, 1988. DOI: 10.5244/c.2.23

Shi-Tomasi Corner Detector

Goal

Algorithm

Procedure

Implementation

Remarks

When to choose Shi-Tomasi over SIFT

References

Prerequisites

Extends

Compared with

Has learned alternative