Definition
The (linear) scale space of an image is the family of images
indexed by the scale parameter , where is the isotropic Gaussian kernel with standard deviation :
At , (the original image). As increases, fine structure is progressively suppressed and only coarser structure survives. The scale parameter sets the spatial resolution at which the image is examined: structure at spatial frequency is attenuated by the factor .
Scale space is not an algorithm; it is a representation. Algorithms that detect or describe image features at multiple scales operate by applying their feature operators to for a discrete set of values, then selecting the scale at which each feature has the strongest or most stable response.
Mathematical Description
Axiomatic characterization
Lindeberg (1994) showed that under four modest axioms — linearity (superposition holds), spatial shift invariance (no preferred position), isotropic scale invariance (no preferred orientation), and causality (no new features are created as increases) — the Gaussian family is the unique one-parameter group of smoothing operators that can generate a scale space. Any other rotationally symmetric kernel that satisfies these axioms is equivalent to a reparameterization of the Gaussian.
Heat equation connection
The scale-space family satisfies the linear diffusion (heat) equation:
This identifies scale-space generation with isotropic heat diffusion, and makes the causality property transparent: the maximum principle for the heat equation prevents the creation of local extrema in as increases.
Scale-normalized derivatives
Comparing derivative magnitudes across scales requires normalization to compensate for the factor by which Gaussian smoothing reduces derivative amplitudes. The -normalized -th order derivative is:
For (scale normalization), the response of a blob-like feature of characteristic size is constant across all scales , enabling automatic scale selection by finding extrema of the normalized response over .
Difference of Gaussians (DoG)
The Laplacian of Gaussian is approximated efficiently by the difference of two Gaussians at adjacent scales:
Lowe's SIFT keypoint detector finds 3-D extrema (over , , and ) in the DoG pyramid. The DoG is preferred over the Laplacian because it is computed by subtraction rather than second-derivative convolution.
Discrete scale-space pyramids
The continuous scale space is discretized by sampling at a geometric progression for integer . An octave groups scales by a factor-of-2 range: within an octave, the image is at the same resolution; between octaves, the image is downsampled by 2 and the Gaussian kernel is reset.
An octave at resolution level contains intermediate scale samples plus two overlapping samples, giving images per octave.
Anti-aliasing requires that before downsampling from octave to , the image be blurred to pixel at the new resolution (Nyquist condition). In practice, the last image of octave at is used as the input to octave .
The Laplacian pyramid of Burt and Adelson (1983) is an alternative discretization: each level stores the difference between adjacent Gaussian-pyramid levels, giving a compact multi-scale bandpass decomposition. SIFT's DoG pyramid is the scale-space counterpart of the Laplacian pyramid.
Characteristic scale
The characteristic scale of a feature is the scale at which the scale-normalized Laplacian achieves a local maximum over . For a circular blob of radius , . Selecting features at their characteristic scale makes descriptors invariant to scale change between images.
Numerical Concerns
Octave structure and anti-aliasing. Downsampling by 2 without prior blurring introduces aliasing. The input image at each octave must be pre-blurred to at least pixel before halving the resolution. SIFT pre-blurs the original image to pixel (assumed already present from camera optics) and begins the first octave at pixels.
Separable implementation. The 2-D Gaussian is separable into . Convolving with a 2-D kernel of size has complexity per pixel; the separable implementation runs two 1-D passes each of length , giving per pixel. For , this is a factor of speedup.
Incremental Gaussian generation. At each scale step within an octave, the next blurred image is obtained by blurring the previous one (not blurring the original each time). If the current level has and the target has , the additional blur is , by the semigroup property . This avoids re-blurring from the original, at the cost of accumulating quantization errors.
Integer vs floating-point pixels. Computing DoG on integer-quantized images introduces quantization noise in the difference. For calibration-target corner detection, images are typically 8-bit; the DoG response is small (– gray-level units) and quantization can produce false extrema. Floating-point intermediate representations are preferred.
Scale ratio and detection coverage. The scale ratio determines how densely is sampled. For SIFT, gives ; between two adjacent DoG levels, the scale changes by 26%. Features whose characteristic scale falls between two sample levels are detected at neither, causing scale-sampling gaps. Smaller (more samples per octave) improves coverage at the cost of additional convolutions.
Boundary effects in pyramids. At coarse scales (large ), the Gaussian kernel radius approaches or exceeds the image size. Border handling (replicate, reflect, or zero-pad) produces artifacts in the outermost pixels. Feature detection near borders at coarse scales is unreliable and should be masked.
Where it appears
Scale space underlies every algorithm that must detect or describe features consistently across image resolutions or under scale change. Calibration-target corner detectors, in particular, use scale space to handle targets that appear at varying distances from the camera.
- chess-corners — ChESS computes its ring-pattern response on the image at multiple scales; RING5 corresponds to a ring radius of 5 pixels, which maps to a specific scale in the scale-space sense. Applying the detector across scales and selecting the peak response makes detection robust to target scale variation.
- pyramidal-blur-aware-xcorner — explicitly constructs a Gaussian image pyramid and runs its X-corner detector at each pyramid level; the "pyramidal" in the name refers to this multi-scale search; blur-aware scale selection picks the pyramid level whose blur matches the detector's response model.
- sift — the canonical worked example of DoG scale-space extrema detection. SIFT uses intervals per octave (), initial blur, and constructs a complete Gaussian pyramid before differencing adjacent levels — a direct practical instantiation of Lindeberg's scale-normalized Laplacian theory.
References
- T. Lindeberg. Scale-Space Theory in Computer Vision. Kluwer Academic Publishers, 1994. The axiomatic foundation of scale space; introduces -normalized derivatives and characteristic scale selection.
- P. J. Burt, E. H. Adelson. "The Laplacian Pyramid as a Compact Image Code." IEEE Transactions on Communications 31(4), 1983. Introduces the image pyramid; the Laplacian pyramid is the discrete-scale-space precursor to SIFT's DoG pyramid.
- D. G. Lowe. "Distinctive Image Features from Scale-Invariant Keypoints." International Journal of Computer Vision 60(2), 2004. Uses the DoG pyramid for keypoint detection with automatic scale selection; SIFT descriptors computed at the characteristic scale.
- R. Szeliski. Computer Vision: Algorithms and Applications. 2nd ed. Springer, 2022. §3.5 covers Gaussian pyramids and scale space; §7.1 covers SIFT and multi-scale feature detection.
- J. Koenderink. "The Structure of Images." Biological Cybernetics 50(5), 1984. Early scale-space paper showing that the Gaussian is the only kernel consistent with local image measurements.