Image Gradient | VitaVision
Back to atlas

Image Gradient

9 min readIntermediateView in graph

Definition

The image gradient at a pixel (x,y)(x, y) is the 2-vector of partial derivatives of the image intensity function I(x,y)I(x, y):

I(x,y)=[IxIy]=[I/xI/y].\nabla I(x, y) = \begin{bmatrix} I_x \\ I_y \end{bmatrix} = \begin{bmatrix} \partial I / \partial x \\ \partial I / \partial y \end{bmatrix}.

It points in the direction of steepest ascent of II and has magnitude equal to the rate of change in that direction. Because II is defined on a discrete pixel grid, I\nabla I is computed by convolving II with a discrete derivative kernel rather than by analytic differentiation.

The gradient is the foundational quantity in image analysis: edges, corners, texture descriptors, optical flow, and calibration target detectors are all built on IxI_x and IyI_y. It is not itself a feature; it is the raw material from which features are constructed.

Mathematical Description

Discrete derivative kernels

On a discrete image I:Z2RI: \mathbb{Z}^2 \to \mathbb{R}, differentiation in the xx-direction is approximated by a finite-difference kernel dxd_x convolved row-wise. The three standard choices trade isotropy against noise sensitivity:

Definition
Forward difference

The simplest approximation; one-pixel support.

dx=[011],dy=[011].d_x = \begin{bmatrix} 0 & -1 & 1 \end{bmatrix}, \quad d_y = \begin{bmatrix} 0 \\ -1 \\ 1 \end{bmatrix}.
Definition
Central difference

Symmetric; better approximation of the derivative at the sample point. Not separable as a 2-D kernel but used column- or row-wise.

dx=[12012].d_x = \begin{bmatrix} -\tfrac{1}{2} & 0 & \tfrac{1}{2} \end{bmatrix}.
Definition
Sobel kernel

Combines differentiation in one axis with smoothing in the perpendicular axis. Separable: kx=ksdxk_x = k_s \otimes d_x where ks=[1,2,1]Tk_s = [1, 2, 1]^T is a binomial smoother.

kx=[101202101],ky=[121000121].k_x = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}, \quad k_y = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}.

The Scharr kernel is a 3×3 optimised variant of Sobel with improved rotational isotropy. The Prewitt kernel is the same structure with uniform row weights [1,1,1][1, 1, 1] instead of [1,2,1][1, 2, 1].

Gradient of a Gaussian

Smoothing before differentiation is the standard practice for noisy images. Let GσG_\sigma denote the Gaussian with standard deviation σ\sigma. By the commutativity of convolution and differentiation:

Ix(σ)=IxGσ=IGσdx,I_x^{(\sigma)} = I * \partial_x G_\sigma = I * G_\sigma * d_x,

so computing the smoothed gradient is equivalent to convolving with the derivative of a Gaussian. The parameter σ\sigma sets the spatial scale at which structure is detected: small σ\sigma resolves fine structure but amplifies noise; large σ\sigma suppresses noise but smears thin edges.

Gradient magnitude and direction

Definition
Gradient magnitude

The Euclidean norm of the gradient vector; measures edge strength.

I(x,y)=Ix2+Iy2.|\nabla I|(x, y) = \sqrt{I_x^2 + I_y^2}.
Definition
Gradient direction

The angle of steepest ascent, measured from the positive xx-axis.

θ(x,y)=atan2(Iy,Ix)(π,π].\theta(x, y) = \operatorname{atan2}(I_y, I_x) \in (-\pi, \pi].

For edge detection and oriented descriptor computation the gradient direction is taken modulo π\pi (unsigned orientation), since an edge gradient points perpendicular to the edge and the sign depends on which side is brighter. For optical flow and other vector-field applications the full 2π2\pi range is retained.

Separability and the structure tensor

The outer product I(I)T\nabla I \, (\nabla I)^T is a 2×22\times 2 rank-1 matrix whose entries are Ix2I_x^2, IxIyI_x I_y, and Iy2I_y^2. Summing this outer product over a local window produces the structure tensor, which encodes the dominant gradient orientations in the neighbourhood. Every corner detector based on the structure tensor — Harris, Shi-Tomasi — is directly built from IxI_x and IyI_y.

Numerical Concerns

Discretization error. No 3×3 kernel exactly recovers the continuous derivative of the underlying scene radiance. The Sobel kernel has a flat frequency response over low frequencies but rolls off at high frequencies. The Scharr kernel minimises the angular error in the frequency domain. The central-difference kernel has better theoretical accuracy but no built-in smoothing.

Pre-smoothing is mandatory before differentiation. Differentiating a raw image amplifies high-frequency noise quadratically in the derivative power spectrum. In practice, computing (GσI)\nabla(G_\sigma * I) with σ0.5\sigma \geq 0.5 pixels is the minimum viable preprocessing. Skipping pre-smoothing produces gradient maps dominated by sensor noise rather than scene structure.

Scale selection. The choice of σ\sigma is a free parameter. Gradient-based detectors are only consistent across images if the same σ\sigma is used. Multi-scale detectors (scale-space methods) compute I\nabla I at several values of σ\sigma and select the response scale that maximizes a scale-normalized measure.

Border handling. Convolution is undefined at pixel locations within k/2\lfloor k/2 \rfloor of the image boundary (for a kernel of width kk). Common strategies are: replicate the border pixel (constant extrapolation), reflect the image (symmetric extension), or zero-pad. Each choice affects the gradient values in a border strip of width equal to the kernel radius.

Sub-pixel interpretation. Even though IxI_x and IyI_y are computed at integer pixel positions, they are used in downstream algorithms (e.g., Lucas-Kanade optical flow, subpixel corner refinement) as if they represent the derivative of a continuous function interpolated from the discrete values. The interpolation model is implicit and depends on the kernel used.

Dynamic range and normalization. The Sobel kernel output for an 8-bit image spans roughly [4×255,+4×255][-4 \times 255, +4 \times 255] before any normalization. Implementations that accumulate Ix2+Iy2I_x^2 + I_y^2 in integer arithmetic must use 32-bit accumulators to avoid overflow when forming the structure tensor.

Numerical accuracy of atan2. Computing the gradient direction via atan2(Iy, Ix) is well-defined except at (0,0)(0, 0), where the gradient is undefined. Downstream algorithms that bin gradient orientations (e.g., SIFT descriptor histograms) must guard against the zero-magnitude case.

Where it appears

The image gradient is the lowest-level quantity on which feature detection and image analysis are built. Nearly every algorithm on this site that operates on raw pixel data computes I\nabla I as its first step.

  • harris-corner-detector — builds the structure tensor from Ix2I_x^2, IxIyI_x I_y, Iy2I_y^2; the Harris response R=det(M)ktr(M)2R = \det(M) - k\,\mathrm{tr}(M)^2 is a function of these gradient products.
  • shi-tomasi-corner-detector — uses the identical gradient-based structure tensor; replaces the Harris response with the minimum eigenvalue min(λ1,λ2)\min(\lambda_1, \lambda_2).
  • chess-corners — the ChESS detector samples gradient-derived intensity contrasts on a ring pattern; gradient orientation is used to compute the dominant direction.
  • fast-corner-detector — does not use gradients directly; pixel-intensity comparisons on a circle substitute for gradient computation, which is one reason FAST is faster than Harris.
  • pyramidal-blur-aware-xcorner — operates on an image pyramid, computing gradients at each pyramid level; scale selection is driven by gradient-based saddle-point measures.
  • loy-fast-radial-symmetry — votes along the gradient orientation g^(p)=g(p)/g(p)\hat{\mathbf{g}}(p) = \mathbf{g}(p)/\|\mathbf{g}(p)\| at each pixel; positively- and negatively-affected pixels at distance nn accumulate magnitude and orientation contributions, yielding a symmetry-contribution map per radius.
  • sift — gradient magnitude m(x,y)=(Lx+1Lx1)2+(Ly+1Ly1)2m(x,y) = \sqrt{(L_{x+1}-L_{x-1})^2 + (L_{y+1}-L_{y-1})^2} and orientation θ(x,y)=arctan((Ly+1Ly1)/(Lx+1Lx1))\theta(x,y) = \arctan\bigl((L_{y+1}-L_{y-1})/(L_{x+1}-L_{x-1})\bigr) are the fundamental inputs to both orientation assignment (36-bin histogram, σw=1.5×σkeypoint\sigma_w = 1.5 \times \sigma_\text{keypoint}) and descriptor construction (4×4 array of 8-bin histograms; 128-D total). One of the most cited downstream consumers of image-gradient computation.
  • lucas-kanade — enters the registration update in two roles: as the per-pixel rate of intensity change with respect to displacement (linearising the photometric residual) and as the outer-product sum that forms the n×nn \times n coefficient matrix of the per-iteration normal equation.
  • horn-schunck — extends the gradient to the temporal axis: (Ex,Ey,Et)(E_x, E_y, E_t) is estimated as the average of four first differences taken along parallel edges of a 2×2×22 \times 2 \times 2 spatiotemporal cube, ensuring all three partial derivatives refer to the same cube-centre in (x,y,t)(x, y, t). The brightness-constancy equation Exu+Eyv+Et=0E_x u + E_y v + E_t = 0 links the spatiotemporal gradient to the per-pixel flow velocity.

References

  • C. Harris, M. Stephens. A Combined Corner and Edge Detector. Alvey Vision Conference, 1988. Defines the structure tensor directly from image partial derivatives.
  • D. Forsyth, J. Ponce. Computer Vision: A Modern Approach. 2nd ed. Prentice Hall, 2011. §5 covers linear filtering and discrete differentiation.
  • R. Szeliski. Computer Vision: Algorithms and Applications. 2nd ed. Springer, 2022. §3.2 covers image gradients; §3.3 covers Gaussian blur and scale-space.
  • J. Prewitt. "Object Enhancement and Extraction." Picture Processing and Psychopictorics, 1970. Original Prewitt kernel.
  • H. Scharr. Optimale Operatoren in der Digitalen Bildverarbeitung. Dissertation, Universität Heidelberg, 2000. Derivation of the Scharr kernel as the isotropy-optimal 3×3 derivative filter.