Definition
Convolution is the linear, shift-invariant operation that produces each output pixel as a weighted sum of the input image values in a local neighbourhood, the weights given by a kernel.
Given a discrete image and a kernel of finite support, the convolution of with is
Input: an image and a kernel . Output: a filtered image of the same domain, each pixel a linear combination of input pixels weighted by . Every linear, shift-invariant filter is exactly characterised by its kernel via this operation.
The index reversal — rather than — is the defining difference between convolution and cross-correlation. For a symmetric kernel the two operations coincide; for an asymmetric kernel they do not, and conflating them is a common source of implementation error.
Mathematical Description
Linearity and shift-invariance
A filter is linear if it commutes with scalar multiplication and addition, and shift-invariant if translating the input translates the output by the same amount. Every filter satisfying both properties is representable as convolution with some kernel — there are no other linear shift-invariant filters. The kernel is therefore a complete characterisation of any such operation.
Separability
A 2-D kernel is separable if it factors into a product of two 1-D kernels, . The 2-D convolution then decomposes into two successive 1-D passes, reducing the per-pixel cost from multiplications to for a kernel. The Gaussian
is separable, . The Canny edge detector exploits this: its 2-D Gaussian convolution decomposes into two 1-D passes.
Gaussian and derivative-of-Gaussian kernels
Convolving an image with produces a smoothed image that attenuates high-frequency noise while preserving low-frequency structure; the scale controls the smoothing extent. Because convolution commutes with differentiation, the smoothed gradient satisfies — differentiating a smoothed image equals convolving with the derivative of a Gaussian. Canny's variational analysis establishes the first derivative of a Gaussian,
as the optimal 1-D step-edge filter under simultaneous signal-to-noise, localisation, and single-response criteria. The discrete derivative kernels used in practice are treated in the image-gradient concept.
Boundary handling
Convolution is undefined within pixels of the image border for a width- kernel. Zero padding sets exterior pixels to zero, introducing a dark-border artefact; replicate padding repeats border pixels, biasing gradient estimates near edges; reflect padding mirrors the image at the boundary, preserving local gradient structure and being the usual choice for derivative kernels.
Convolution theorem and FFT evaluation
By the convolution theorem, spatial-domain convolution equals pointwise multiplication in the frequency domain, . Direct spatial convolution costs for an -pixel image and a kernel; FFT-based convolution costs regardless of kernel size, making it preferable for large kernels such as wide Gaussians.
Learned convolution in CNNs
In a convolutional neural network the kernel is not fixed analytically but learned from data by gradient descent. Each layer applies a bank of kernels of size to a -channel input, producing output maps. Weight sharing — the same kernel at every spatial position — cuts the parameter count to and enforces translation equivariance. AlexNet uses first-layer kernels with stride 4, then and ; VGG replaced large first-layer kernels with stacks of convolutions — two stacked layers cover a receptive field, three cover , at parameter cost versus for one layer. The stacked block became the standard convolutional primitive.
Numerical Concerns
Kernel normalisation. A kernel whose coefficients do not sum to 1 scales the output's mean brightness; smoothing kernels should sum to 1, derivative kernels to 0 (antisymmetric). A non-zero-sum derivative kernel introduces a DC offset in the gradient map.
Gaussian truncation. The Gaussian has infinite support and is truncated in practice; the standard rule retains , giving a width of . Truncating at produces visible ringing; the truncated kernel must be renormalised.
Accumulator precision. A kernel over an 8-bit image accumulates up to ; unnormalised 8-bit accumulation overflows for kernels larger than . Floating-point accumulators avoid overflow but require casting input pixels to float.
Separable-pass intermediate precision. When a separable convolution is split into a horizontal then a vertical pass, the intermediate result must be stored in sufficient precision — 16-bit integer intermediates introduce quantisation error that the second pass amplifies; 32-bit float intermediates are the safe choice.
Boundary bias. Padding introduces artificial signal near the border — zero padding depresses gradient magnitudes, replicate padding inflates corner gradients. Algorithms detecting features near the image boundary should either exclude a border strip of width or use reflect padding throughout.
Cross-correlation convention. Most deep-learning frameworks implement cross-correlation and call it convolution; this equals convolution with the flipped kernel. For symmetric kernels the distinction is immaterial, but applying cross-correlation where convolution is intended flips the sign of an asymmetric kernel such as a derivative-of-Gaussian — a silent directional error.
Where it appears
Convolution is the shared computational primitive of every spatial filtering operation in the atlas.
- canny-edge-detector — smoothing by and gradient computation by , are both convolutions; the first derivative of a Gaussian is established by variational analysis as the optimal step-edge kernel.
- image-gradient — every discrete derivative kernel (forward difference, central difference, Sobel, Scharr) is a convolution kernel; the derivative-of-Gaussian identity follows from convolution's commutativity with differentiation.
- scale-space — the Gaussian scale-space is a family of convolutions with Gaussians of increasing width; SIFT and SURF approximate the Laplacian by differences of Gaussian convolutions.
- convolutional-neural-network — the learned kernel is the central abstraction; AlexNet established the multi-layer learned-convolution pipeline and VGG showed that stacked kernels dominate larger single kernels.
References
- J. Canny. A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6)–698, 1986.
- A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS, 2012.
- K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2015.
- R. C. Gonzalez, R. E. Woods. Digital Image Processing, 4th ed. Pearson, 2018.
- R. Szeliski. Computer Vision: Algorithms and Applications, 2nd ed. Springer, 2022.