Definition
An image pyramid is a discrete multi-resolution representation of a single image: a finite sequence of images at progressively coarser spatial resolution, each derived from its predecessor by smoothing and downsampling.
Given an input image , each successive pyramid level is formed by convolving with a Gaussian kernel and then halving the spatial resolution,
where discards every other row and column. Input: a 2-D image of size . Output: a stack of images at resolutions .
The pyramid groups levels into octaves: within an octave the resolution is fixed while the effective smoothing scale doubles; between octaves the image is downsampled by two. Each octave is divided into intra-octave sub-levels, giving a geometric progression of effective scales . The continuous object the pyramid samples is the scale space — the one-parameter family ; the pyramid is its sampled, downsampled realisation.
Mathematical Description
Gaussian pyramid
Within a single octave the images are obtained by successive blurring at scale ratios with
and the blur at sub-level of octave is . The image starting octave is the downsampled last sub-level of octave . SIFT uses sub-levels per octave, so , and holds Gaussian-blurred images per octave so that Difference-of-Gaussian extrema can be bracketed over a full octave without boundary gaps. The base scale is , and the input is upsampled by two before the first octave to recover fine structure.
Difference-of-Gaussian pyramid
The Difference-of-Gaussian (DoG) pyramid subtracts adjacent Gaussian levels within an octave,
an efficient approximation to the scale-normalised Laplacian: by the heat equation . The DoG achieves Laplacian-based extrema detection by image subtraction rather than explicit second-derivative convolution. SIFT detects keypoints at 3-D local extrema of in a neighbourhood spanning the scale above and below.
Fixed-image, growing-filter alternative
SURF inverts the construction: instead of downsampling the image, it keeps the image at full resolution and enlarges the filter. The Hessian entries are approximated by box filters evaluated on an integral image in per pixel; the first octave uses filter sizes , , , (steps of 6 pixels, doubling per octave). All scale levels keep the input resolution, so detected extrema have identical spatial precision at every scale and no resampling error is incurred — at the cost of an integral image precomputed over the full input.
Numerical Concerns
Aliasing from insufficient pre-smoothing. Downsampling by two without prior blurring aliases frequencies above the coarser grid's Nyquist limit; pre-smoothing to at least pixel before halving satisfies the Nyquist condition.
Assumed input blur. SIFT assumes the input already carries blur from camera optics; reaching the base scale requires a corrective Gaussian . If the assumed blur is wrong, under-smoothing produces spurious DoG extrema and over-smoothing destroys fine features; the corrective computation is undefined when the assumed blur already exceeds .
Interpolation in upsampling. Resampling a coarse-level result back to full resolution by bilinear interpolation introduces a smoothing bias proportional to the local image Hessian — negligible for display, significant for sub-pixel localisation on upsampled level data.
Level count. The pyramid terminates when the image is too small to support meaningful detection — typically when the shorter side falls below roughly 8–16 pixels; carrying further levels yields responses dominated by Gaussian boundary effects rather than image structure.
Fixed-image vs fixed-filter trade-off. The classical pyramid discards half the pixels per octave — less memory and computation at coarse scales, but cross-octave coordinate mapping is required. The integral-image approach keeps full resolution at every scale at the cost of integral-image storage and box-filter-size quantisation to odd integers.
Where it appears
The image pyramid is the shared multi-resolution data structure behind every algorithm that must detect or describe features across scale change.
- sift — the canonical Gaussian/DoG pyramid: Gaussian images per octave with , differenced to a DoG pyramid whose 26-neighbour extrema are the keypoints; the scale of the extremal response is the keypoint's assigned scale.
- surf — the fixed-image, growing-filter alternative: Hessian-determinant responses from integral-image box filters at sizes to , every level at input resolution.
- orb — a scale pyramid for multi-scale FAST detection, applying the FAST detector independently at each level for scale-invariant keypoints without a Gaussian scale-space.
- pyramidal-blur-aware-xcorner — a Gaussian pyramid for chessboard X-corner detection under blur, selecting per corner the level that maximises the response-to-resolution ratio.
The continuous object the pyramid discretises is the scale-space concept.
References
- P. J. Burt, E. H. Adelson. The Laplacian Pyramid as a Compact Image Code. IEEE Transactions on Communications, 31(4)–540, 1983.
- D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2)–110, 2004.
- H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. ECCV, 2006.
- P. Abeles. Pyramidal Blur Aware X-Corner Chessboard Detector. arXiv.13793, 2021.
- T. Lindeberg. Scale-Space Theory in Computer Vision. Kluwer Academic Publishers, 1994.