Image Pyramid | VitaVision
Back to atlas

Image Pyramid

6 min readIntermediateView in graph
Based on
Distinctive Image Features from Scale-Invariant Keypoints
Lowe · International Journal of Computer Vision 2004
DOI ↗

Definition

An image pyramid is a discrete multi-resolution representation of a single image: a finite sequence of images at progressively coarser spatial resolution, each derived from its predecessor by smoothing and downsampling.

Definition
Gaussian image pyramid

Given an input image L0=IL_0 = I, each successive pyramid level is formed by convolving with a Gaussian kernel GσG_\sigma and then halving the spatial resolution,

Lk+1=down ⁣(GσLk),L_{k+1} = \mathrm{down}\!\left(G_\sigma * L_k\right),

where down()\mathrm{down}(\cdot) discards every other row and column. Input: a 2-D image II of size W×HW \times H. Output: a stack of K+1K+1 images at resolutions W/2k×H/2kW/2^k \times H/2^k.

The pyramid groups levels into octaves: within an octave the resolution is fixed while the effective smoothing scale doubles; between octaves the image is downsampled by two. Each octave is divided into ss intra-octave sub-levels, giving a geometric progression of effective scales σk,j=σ02k+j/s\sigma_{k,j} = \sigma_0 \cdot 2^{k + j/s}. The continuous object the pyramid samples is the scale space — the one-parameter family L(x,y;σ)=GσIL(x,y;\sigma) = G_\sigma * I; the pyramid is its sampled, downsampled realisation.

Mathematical Description

Gaussian pyramid

Within a single octave the images are obtained by successive blurring at scale ratios kjk^j with

k=21/s,k = 2^{1/s},

and the blur at sub-level jj of octave oo is σo,j=σ02o+j/s\sigma_{o,j} = \sigma_0 \cdot 2^{o + j/s}. The image starting octave o+1o+1 is the downsampled last sub-level of octave oo. SIFT uses s=3s = 3 sub-levels per octave, so k=21/31.26k = 2^{1/3} \approx 1.26, and holds s+3=6s + 3 = 6 Gaussian-blurred images per octave so that Difference-of-Gaussian extrema can be bracketed over a full octave without boundary gaps. The base scale is σ0=1.6\sigma_0 = 1.6, and the input is upsampled by two before the first octave to recover fine structure.

Difference-of-Gaussian pyramid

The Difference-of-Gaussian (DoG) pyramid subtracts adjacent Gaussian levels within an octave,

D(x,y;σ)=L(x,y;kσ)L(x,y;σ),D(x, y; \sigma) = L(x, y;\, k\sigma) - L(x, y;\, \sigma),

an efficient approximation to the scale-normalised Laplacian: by the heat equation L(;kσ)L(;σ)(k1)σ22GIL(\cdot; k\sigma) - L(\cdot; \sigma) \approx (k-1)\,\sigma^2\,\nabla^2 G * I. The DoG achieves Laplacian-based extrema detection by image subtraction rather than explicit second-derivative convolution. SIFT detects keypoints at 3-D local extrema of DD in a 3×3×33 \times 3 \times 3 neighbourhood spanning the scale above and below.

Fixed-image, growing-filter alternative

SURF inverts the construction: instead of downsampling the image, it keeps the image at full resolution and enlarges the filter. The Hessian entries are approximated by box filters evaluated on an integral image in O(1)O(1) per pixel; the first octave uses filter sizes 9×99 \times 9, 15×1515 \times 15, 21×2121 \times 21, 27×2727 \times 27 (steps of 6 pixels, doubling per octave). All scale levels keep the input resolution, so detected extrema have identical spatial precision at every scale and no resampling error is incurred — at the cost of an integral image precomputed over the full input.

Numerical Concerns

Aliasing from insufficient pre-smoothing. Downsampling by two without prior blurring aliases frequencies above the coarser grid's Nyquist limit; pre-smoothing to at least σ=1.0\sigma = 1.0 pixel before halving satisfies the Nyquist condition.

Assumed input blur. SIFT assumes the input already carries σ=0.5\sigma = 0.5 blur from camera optics; reaching the base scale σ0\sigma_0 requires a corrective Gaussian σΔ=σ02σassumed2\sigma_\Delta = \sqrt{\sigma_0^2 - \sigma_{\text{assumed}}^2}. If the assumed blur is wrong, under-smoothing produces spurious DoG extrema and over-smoothing destroys fine features; the corrective computation is undefined when the assumed blur already exceeds σ0\sigma_0.

Interpolation in upsampling. Resampling a coarse-level result back to full resolution by bilinear interpolation introduces a smoothing bias proportional to the local image Hessian — negligible for display, significant for sub-pixel localisation on upsampled level data.

Level count. The pyramid terminates when the image is too small to support meaningful detection — typically when the shorter side falls below roughly 8–16 pixels; carrying further levels yields responses dominated by Gaussian boundary effects rather than image structure.

Fixed-image vs fixed-filter trade-off. The classical pyramid discards half the pixels per octave — less memory and computation at coarse scales, but cross-octave coordinate mapping is required. The integral-image approach keeps full resolution at every scale at the cost of O(WH)O(WH) integral-image storage and box-filter-size quantisation to odd integers.

Where it appears

The image pyramid is the shared multi-resolution data structure behind every algorithm that must detect or describe features across scale change.

  • sift — the canonical Gaussian/DoG pyramid: s+3=6s + 3 = 6 Gaussian images per octave with s=3s = 3, differenced to a DoG pyramid whose 26-neighbour extrema are the keypoints; the scale of the extremal response is the keypoint's assigned scale.
  • surf — the fixed-image, growing-filter alternative: Hessian-determinant responses from integral-image box filters at sizes 9×99 \times 9 to 27×2727 \times 27, every level at input resolution.
  • orb — a scale pyramid for multi-scale FAST detection, applying the FAST detector independently at each level for scale-invariant keypoints without a Gaussian scale-space.
  • pyramidal-blur-aware-xcorner — a Gaussian pyramid for chessboard X-corner detection under blur, selecting per corner the level that maximises the response-to-resolution ratio.

The continuous object the pyramid discretises is the scale-space concept.

References

  1. P. J. Burt, E. H. Adelson. The Laplacian Pyramid as a Compact Image Code. IEEE Transactions on Communications, 31(4)
    –540, 1983.
  2. D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2)
    –110, 2004.
  3. H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. ECCV, 2006.
  4. P. Abeles. Pyramidal Blur Aware X-Corner Chessboard Detector. arXiv
    .13793, 2021.
  5. T. Lindeberg. Scale-Space Theory in Computer Vision. Kluwer Academic Publishers, 1994.