HOG: Histograms of Oriented Gradients | VitaVision
Back to atlas

HOG: Histograms of Oriented Gradients

5 min readIntermediateView in graph
Based on
Histograms of Oriented Gradients for Human Detection
Dalal, Triggs · CVPR 2005
DOI ↗

Goal

Compute a dense, locally-normalised gradient-orientation histogram descriptor from a fixed-size image window and produce a single real-valued feature vector for linear SVM classification. Input: an RGB (or grayscale or LAB) image window, default 64×128 pixels for pedestrian detection. Output: a 3780-dimensional descriptor vector formed by concatenating normalised per-block histograms over a regular grid. The defining property is the combination of magnitude-weighted unsigned-orientation histograms over 8×8 px cells, grouped into overlapping 2×2 cell blocks with L2-Hys normalisation, without any prior image smoothing — a design shown by ablation to be jointly necessary for high detection accuracy at 10410^{-4} false positives per window (FPPW).

Algorithm

Let I:ΩR3I : \Omega \to \mathbb{R}^3 denote the input colour image on pixel domain Ω\Omega. Let (gx,gy)(g_x, g_y) denote the horizontal and vertical gradient components at a pixel. Let g=gx2+gy2\|g\| = \sqrt{g_x^2 + g_y^2} denote gradient magnitude. Let θ[0,π)\theta \in [0, \pi) denote unsigned gradient orientation (sign discarded). Let cc index a cell; default cell size is 8×88 \times 8 px. Let bb index a block; default block is 2×22 \times 2 cells (16×1616 \times 16 px), stride 88 px. Let hc,bR9h_{c,b} \in \mathbb{R}^9 denote the raw histogram for cell cc within block bb. Let vBR36v_B \in \mathbb{R}^{36} denote the concatenated raw histograms for block BB (44 cells ×\times 99 bins). Let vBR36v_B' \in \mathbb{R}^{36} denote the L2-Hys-normalised block vector. Let fR3780f \in \mathbb{R}^{3780} denote the final descriptor (concatenation of all vBv_B'). Let σw=8\sigma_w = 8 px denote the Gaussian spatial window applied within each block (half block width). Let τ=0.2\tau = 0.2 denote the L2-Hys clip threshold (adopted from SIFT [2]). Let ε>0\varepsilon > 0 denote a small regulariser preventing division by zero.

Definition
Gradient

The gradient is computed by the centred derivative filter [1,  0,  1][-1,\; 0,\; 1], applied in xx and yy with no prior smoothing (σ=0\sigma = 0):

gx(p)=I(p+x^)I(px^),gy(p)=I(p+y^)I(py^).g_x(p) = I(p + \hat{x}) - I(p - \hat{x}), \quad g_y(p) = I(p + \hat{y}) - I(p - \hat{y}).

For colour images, the gradient vector is taken from the colour channel with the largest magnitude g\|g\| at each pixel.

Definition
Cell histogram

Each pixel pp in cell cc casts a magnitude-weighted vote into the 9-bin unsigned-orientation histogram over [0,π)[0, \pi). Votes are bilinearly interpolated across the two adjacent bin centres in orientation and across adjacent cell boundaries in position:

hc,b[k]=pcg(p)worient(p,k)wspatial(p),h_{c,b}[k] = \sum_{p \in c} \|g(p)\| \cdot w_{\text{orient}}(p, k) \cdot w_{\text{spatial}}(p),

where worientw_{\text{orient}} is the bilinear weight for bin kk from the orientation θ(p)\theta(p), and wspatialw_{\text{spatial}} is the Gaussian weight exp ⁣(ppc2/(2σw2))\exp\!\bigl(-\|p - p_c\|^2 / (2\sigma_w^2)\bigr) with σw=8\sigma_w = 8 px centred on the block centre pcp_c.

Definition
L2-Hys block normalisation

Given the raw block vector vBv_B, L2-Hys proceeds in two passes:

vB(1)=vBvB22+ε2,vB(2)=min ⁣(vB(1),  τ)(component-wise clip to τ=0.2),vB=vB(2)vB(2)22+ε2.\begin{aligned} v_B^{(1)} &= \frac{v_B}{\sqrt{\|v_B\|_2^2 + \varepsilon^2}}, \\ v_B^{(2)} &= \min\!\bigl(v_B^{(1)},\; \tau\bigr) \quad (\text{component-wise clip to } \tau = 0.2), \\ v_B' &= \frac{v_B^{(2)}}{\sqrt{\|v_B^{(2)}\|_2^2 + \varepsilon^2}}. \end{aligned}
Algorithm
HOG descriptor
Input: Colour or grayscale image window; default 64×12864 \times 128 px
Output: 3780-dimensional descriptor vector ff
  1. Compute (gx,gy)(g_x, g_y) at every pixel with the centred filter [1,0,1][-1, 0, 1] and no prior smoothing; for colour input, select the channel with largest g\|g\|.
  2. For each cell cc, accumulate a 9-bin magnitude-weighted histogram hc,bh_{c,b} over unsigned orientations [0,π)[0, \pi) with bilinear interpolation in orientation and position, weighted by the Gaussian spatial window σw=8\sigma_w = 8 px centred on the enclosing block.
  3. Group every 2×22 \times 2 adjacent cells into a block BB; concatenate their histograms to form vBR36v_B \in \mathbb{R}^{36}; advance the block window by the stride of 88 px to produce 7×15=1057 \times 15 = 105 overlapping block positions for the default window.
  4. Normalise each vBv_B with L2-Hys (L2-normalise → clip at τ=0.2\tau = 0.2 → renormalise) to obtain vBv_B'.
  5. Concatenate all 105 vectors vBv_B' into the final descriptor fR3780f \in \mathbb{R}^{3780}; feed to a linear SVM for classification.

Implementation

The per-cell histogram accumulation and L2-Hys normalisation in Rust:

const BINS: usize = 9;
const CELLS_PER_BLOCK: usize = 4; // 2×2 cells
const BLOCK_DIM: usize = BINS * CELLS_PER_BLOCK; // 36
const CLIP: f32 = 0.2;
const EPS: f32 = 1e-5;

/// Centred gradient at pixel (x, y) in a row-major f32 plane of given width.
fn gradient(plane: &[f32], width: usize, x: usize, y: usize) -> (f32, f32) {
    let gx = plane[y * width + (x + 1)] - plane[y * width + (x - 1)];
    let gy = plane[(y + 1) * width + x] - plane[(y - 1) * width + x];
    (gx, gy)
}

/// Accumulate one pixel's vote into a 9-bin unsigned-orientation histogram.
/// `theta` is in [0, π); `mag` is the gradient magnitude; `hist` has BINS slots.
fn accumulate_bin(hist: &mut [f32; BINS], theta: f32, mag: f32) {
    let bin_width = std::f32::consts::PI / BINS as f32;
    let bin_f = theta / bin_width;
    let bin0 = bin_f.floor() as usize % BINS;
    let bin1 = (bin0 + 1) % BINS;
    let w1 = bin_f - bin_f.floor(); // fractional part
    hist[bin0] += mag * (1.0 - w1);
    hist[bin1] += mag * w1;
}

/// L2-Hys normalisation of a 36-element raw block vector in-place.
fn l2_hys(v: &mut [f32; BLOCK_DIM]) {
    // First L2 normalise
    let norm = v.iter().map(|x| x * x).sum::<f32>().sqrt() + EPS;
    for x in v.iter_mut() { *x /= norm; }
    // Clip to τ = 0.2
    for x in v.iter_mut() { *x = x.min(CLIP); }
    // Renormalise
    let norm2 = v.iter().map(|x| x * x).sum::<f32>().sqrt() + EPS;
    for x in v.iter_mut() { *x /= norm2; }
}

Remarks

  • Descriptor extraction is O(WH)O(W \cdot H) per window in image area; a full sliding-window scale pyramid adds a factor of O(Npyr)O(N_{\text{pyr}}) pyramid levels.
  • Prior Gaussian smoothing before gradient computation damages performance: moving from σ=0\sigma = 0 to σ=2\sigma = 2 reduces recall from 89% to 80% at 10410^{-4} FPPW. The no-smoothing rule is the single most damaging parameter to violate.
  • HOG is not rotation-invariant; orientation is computed in image coordinates, so the descriptor encodes absolute image-plane direction rather than object-relative direction.
  • Hard-example mining — retraining the linear SVM on its own false positives from the negative set — adds approximately 5% detection rate at 10410^{-4} FPPW.
  • Unsigned orientation ([0,π)[0, \pi), 9 bins) suits pedestrian detection because clothing contrast polarity is variable; signed orientation ([0,2π)[0, 2\pi)) is preferable for object classes with consistent contrast polarity.
  • Deformable Part Models (Felzenszwalb et al., 2010) build a deformable mixture of HOG filters on top of this descriptor.
  • The HOG + linear-SVM and HOG + DPM sliding-window pipelines were superseded for general object detection by Faster R-CNN, which replaces handcrafted gradient histograms with shared convolutional features and a learned Region Proposal Network.

References

  1. N. Dalal, B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005. HAL
  2. D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2)
    –110, 2004. — source of the L2-Hys clip threshold 0.2. Springer

Compared with

  • medium
    Viola–Jones Object Detector

    Both classical sliding-window detectors. Viola-Jones: Haar + AdaBoost cascade on faces. HOG: gradient histograms + linear SVM on pedestrians.

Feeds into

  • Deformable Part Models

    DPM uses HOG cells (k=8 px, α=0.2) with an analytic 31-dim projection of the 36-dim HOG vector as its base feature pyramid.