Goal
Detect and localise instances of a target object category in a still image by scoring every position and scale with a mixture of multiscale deformable part models. Input: an RGB or greyscale image; axis-aligned bounding-box annotations for training only. Output: a ranked list of axis-aligned bounding boxes. Each model is a star-structured configuration of one coarse root filter covering the full object at HOG pyramid level and finer part filters at twice the spatial resolution ( levels deeper), scored jointly by filter responses minus a quadratic deformation penalty; the best configuration at each root location is recovered by dynamic programming over pre-computed distance-transform arrays. A mixture of components handles intraclass variation in viewpoint and aspect ratio; all parameters — filter weights, deformation costs, and bias — are trained end-to-end with latent SVM and hard-negative mining from bounding-box supervision alone.
Algorithm
Let denote the HOG feature pyramid built from the input image. Let denote the HOG sub-window feature vector at position in the pyramid ( in the final system). Let denote the root filter covering the full object at pyramid level . Let , , denote part filters placed at pyramid level . Let denote the anchor position of part relative to root position . Let denote the learned deformation-cost coefficient vector for part . Let denote the model bias scalar. Let denote a complete object hypothesis (root position plus part positions). Let denote the displacement of part from its anchor (Eq. 3). Let denote the filter response map of part filter at pyramid level . Let denote the concatenated model parameter vector (all filter weights, deformation costs, and bias).
Feature pyramid
Each level of the pyramid provides a -dimensional feature map with cell size px and truncation . The final feature vector is 31-dimensional: 9 contrast-insensitive orientation channels + 18 contrast-sensitive orientation channels + 4 normalisation-energy channels.
The pyramid is sampled at levels per octave ( during training, at test time). Parts are placed levels below the root level, giving twice the spatial resolution.
Hypothesis score
The displacement of a part from its anchor is encoded as a four-dimensional quadratic feature:
The score of a complete hypothesis is:
The first sum is the data term (root and part filter responses); the second sum penalises each part's quadratic displacement from its anchor.
Efficient inference via distance transform
For each part and pyramid level , the best-scoring placement over all displacements from a given root location is captured by the generalised distance transform:
gives the maximum score achievable by part when the root is placed at the corresponding pyramid location, taking the best possible displacement.
The root score aggregation adds all part distance-transform values evaluated at the expected anchor positions (Eq. 9). Because is a separable quadratic, the maximisation in Eq. 8 is computed in time by a 1-D pass in followed by a 1-D pass in .
Mixture model
A model with components trains independent star models; detection reports the component and configuration with the highest score. All PASCAL VOC experiments use components, covering frontal and side aspects. The latent variable encodes both the component label and the full configuration .
Latent SVM training
Let be the detector score for example (Eq. 13). The training objective is
The hinge loss is convex in for negative examples () — the semi-convex property — enabling a coordinate-descent procedure that alternates between relabelling positive latent values and solving a convex QP over negatives.
Hard-negative mining iteratively removes easy negatives () from the cache and adds new hard examples; convergence to is guaranteed when the cache eventually contains all hard examples (Theorem 1).
Procedure
- Build the HOG feature pyramid from at levels per octave; cell size px; dimensions per cell.
- For each pyramid level , compute the root filter response map at every root position.
- For each part , compute the part response map at the corresponding finer pyramid level.
- For each part , compute the distance-transform array by two 1-D passes ( then ) over using the quadratic cost .
- At each root position , aggregate the root response with all part distance-transform values sampled at anchor positions to obtain (Eq. 9).
- Collect all positions with as candidate detections; apply non-maximum suppression (suppress any box overlapping a higher-scoring box by ).
- Optionally apply bounding-box regression: fit output boxes via least-squares regression on the -dimensional configuration feature .
Implementation
The per-root-position score in Rust — combining root filter response with pre-computed part distance-transform values (Eq. 2 + Eq. 8):
/// Quadratic deformation cost: d · φ_d(dx, dy).
/// d_coeffs = [a, b, c, e] corresponding to φ_d = (dx, dy, dx², dy²).
#[inline]
fn deformation_cost(d_coeffs: [f32; 4], dx: i32, dy: i32) -> f32 {
let (fx, fy) = (dx as f32, dy as f32);
d_coeffs[0] * fx + d_coeffs[1] * fy
+ d_coeffs[2] * fx * fx + d_coeffs[3] * fy * fy
}
/// Eq. 2: aggregate root response with part distance-transform values at the
/// anchors. Deformation costs are already folded into `dt_values` by Eq. 8.
#[inline]
fn root_score(root_resp: f32, dt_values: &[f32], bias: f32) -> f32 {
root_resp + dt_values.iter().sum::<f32>() + bias
}
/// One-axis quadratic distance transform.
/// Fills out[x] = max_dx [ resp[x + dx] − (c · dx + e · dx²) ] for each x.
/// Apply twice (x then y) to realise the full 2-D maximisation in Eq. 8.
fn dt_1d(resp: &[f32], c: f32, e: f32, out: &mut [f32]) {
let n = resp.len();
let mut best = f32::NEG_INFINITY;
for x in 0..n as i32 {
let cand = resp[x as usize] - c * (x as f32) - e * (x * x) as f32;
if cand > best { best = cand; }
out[x as usize] = best + c * (x as f32) + e * (x * x) as f32;
}
}
Remarks
- Inference is per pyramid level after filter response maps are computed, where is the number of parts and is the number of positions at each pyramid level; the dominant cost is convolution of root and part filters over the pyramid.
- Training is offline and expensive: PASCAL experiments require approximately 4 hours to train and 3 hours to evaluate on an 8-core 2.8 GHz Xeon; test-time average is approximately 2 seconds per image. The method is unsuitable for real-time use.
- A mixture of components handles aspect-ratio and viewpoint variation; categories with three or more distinct aspects are poorly served — bird AP is 0.006 in PASCAL VOC 2007 Table 2.
- The deformable assumption is a star topology: each part connects only to the root, not to other parts. Within-part articulation and joint-chain kinematics (e.g. extreme limb foreshortening) violate the quadratic deformation model.
- Hard-negative mining requires multiple passes over the training set; convergence to is guaranteed only when the cache contains all hard examples, but memory limits in practice impose a fixed iteration count rather than full convergence.
- Dominant on PASCAL VOC 2007/2008 — first or second place in 17 of 20 categories (Table 3) — until the R-CNN family (Girshick et al., 2014) replaced HOG + latent SVM with CNN features. The end-to-end learned replacement in general object detection is Faster R-CNN, which folds proposal generation into a shared-conv RPN head and supersedes the sliding-window + deformable-template pipeline on PASCAL VOC 2007 (DPM ≈33% mAP vs Faster R-CNN VGG-16 73.2%, ResNet-101 76.4%).
References
- Felzenszwalb, P. F., Girshick, R. B., McAllester, D., Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE TPAMI, 32(9)–1645, 2010. PDF
- Dalal, N., Triggs, B. Histograms of Oriented Gradients for Human Detection. CVPR, 2005. HAL
- Viola, P., Jones, M. Rapid Object Detection Using a Boosted Cascade of Simple Features. CVPR, 2001. PDF