Goal
Compute a dense, locally-normalised gradient-orientation histogram descriptor from a fixed-size image window and produce a single real-valued feature vector for linear SVM classification. Input: an RGB (or grayscale or LAB) image window, default 64×128 pixels for pedestrian detection. Output: a 3780-dimensional descriptor vector formed by concatenating normalised per-block histograms over a regular grid. The defining property is the combination of magnitude-weighted unsigned-orientation histograms over 8×8 px cells, grouped into overlapping 2×2 cell blocks with L2-Hys normalisation, without any prior image smoothing — a design shown by ablation to be jointly necessary for high detection accuracy at false positives per window (FPPW).
Algorithm
Let denote the input colour image on pixel domain . Let denote the horizontal and vertical gradient components at a pixel. Let denote gradient magnitude. Let denote unsigned gradient orientation (sign discarded). Let index a cell; default cell size is px. Let index a block; default block is cells ( px), stride px. Let denote the raw histogram for cell within block . Let denote the concatenated raw histograms for block ( cells bins). Let denote the L2-Hys-normalised block vector. Let denote the final descriptor (concatenation of all ). Let px denote the Gaussian spatial window applied within each block (half block width). Let denote the L2-Hys clip threshold (adopted from SIFT [2]). Let denote a small regulariser preventing division by zero.
The gradient is computed by the centred derivative filter , applied in and with no prior smoothing ():
For colour images, the gradient vector is taken from the colour channel with the largest magnitude at each pixel.
Each pixel in cell casts a magnitude-weighted vote into the 9-bin unsigned-orientation histogram over . Votes are bilinearly interpolated across the two adjacent bin centres in orientation and across adjacent cell boundaries in position:
where is the bilinear weight for bin from the orientation , and is the Gaussian weight with px centred on the block centre .
Given the raw block vector , L2-Hys proceeds in two passes:
- Compute at every pixel with the centred filter and no prior smoothing; for colour input, select the channel with largest .
- For each cell , accumulate a 9-bin magnitude-weighted histogram over unsigned orientations with bilinear interpolation in orientation and position, weighted by the Gaussian spatial window px centred on the enclosing block.
- Group every adjacent cells into a block ; concatenate their histograms to form ; advance the block window by the stride of px to produce overlapping block positions for the default window.
- Normalise each with L2-Hys (L2-normalise → clip at → renormalise) to obtain .
- Concatenate all 105 vectors into the final descriptor ; feed to a linear SVM for classification.
Implementation
The per-cell histogram accumulation and L2-Hys normalisation in Rust:
const BINS: usize = 9;
const CELLS_PER_BLOCK: usize = 4; // 2×2 cells
const BLOCK_DIM: usize = BINS * CELLS_PER_BLOCK; // 36
const CLIP: f32 = 0.2;
const EPS: f32 = 1e-5;
/// Centred gradient at pixel (x, y) in a row-major f32 plane of given width.
fn gradient(plane: &[f32], width: usize, x: usize, y: usize) -> (f32, f32) {
let gx = plane[y * width + (x + 1)] - plane[y * width + (x - 1)];
let gy = plane[(y + 1) * width + x] - plane[(y - 1) * width + x];
(gx, gy)
}
/// Accumulate one pixel's vote into a 9-bin unsigned-orientation histogram.
/// `theta` is in [0, π); `mag` is the gradient magnitude; `hist` has BINS slots.
fn accumulate_bin(hist: &mut [f32; BINS], theta: f32, mag: f32) {
let bin_width = std::f32::consts::PI / BINS as f32;
let bin_f = theta / bin_width;
let bin0 = bin_f.floor() as usize % BINS;
let bin1 = (bin0 + 1) % BINS;
let w1 = bin_f - bin_f.floor(); // fractional part
hist[bin0] += mag * (1.0 - w1);
hist[bin1] += mag * w1;
}
/// L2-Hys normalisation of a 36-element raw block vector in-place.
fn l2_hys(v: &mut [f32; BLOCK_DIM]) {
// First L2 normalise
let norm = v.iter().map(|x| x * x).sum::<f32>().sqrt() + EPS;
for x in v.iter_mut() { *x /= norm; }
// Clip to τ = 0.2
for x in v.iter_mut() { *x = x.min(CLIP); }
// Renormalise
let norm2 = v.iter().map(|x| x * x).sum::<f32>().sqrt() + EPS;
for x in v.iter_mut() { *x /= norm2; }
}
Remarks
- Descriptor extraction is per window in image area; a full sliding-window scale pyramid adds a factor of pyramid levels.
- Prior Gaussian smoothing before gradient computation damages performance: moving from to reduces recall from 89% to 80% at FPPW. The no-smoothing rule is the single most damaging parameter to violate.
- HOG is not rotation-invariant; orientation is computed in image coordinates, so the descriptor encodes absolute image-plane direction rather than object-relative direction.
- Hard-example mining — retraining the linear SVM on its own false positives from the negative set — adds approximately 5% detection rate at FPPW.
- Unsigned orientation (, 9 bins) suits pedestrian detection because clothing contrast polarity is variable; signed orientation () is preferable for object classes with consistent contrast polarity.
- Deformable Part Models (Felzenszwalb et al., 2010) build a deformable mixture of HOG filters on top of this descriptor.
- The HOG + linear-SVM and HOG + DPM sliding-window pipelines were superseded for general object detection by Faster R-CNN, which replaces handcrafted gradient histograms with shared convolutional features and a learned Region Proposal Network.