Deformable Part Models | VitaVision
Back to atlas

Deformable Part Models

8 min readAdvancedView in graph
Based on
Object Detection with Discriminatively Trained Part-Based Models
Felzenszwalb, Girshick, McAllester, Ramanan · IEEE TPAMI 2010
DOI ↗

Goal

Detect and localise instances of a target object category in a still image by scoring every position and scale with a mixture of multiscale deformable part models. Input: an RGB or greyscale image; axis-aligned bounding-box annotations for training only. Output: a ranked list of axis-aligned bounding boxes. Each model is a star-structured configuration of one coarse root filter covering the full object at HOG pyramid level l0l_0 and nn finer part filters at twice the spatial resolution (λ\lambda levels deeper), scored jointly by filter responses minus a quadratic deformation penalty; the best configuration at each root location is recovered by dynamic programming over pre-computed distance-transform arrays. A mixture of mm components handles intraclass variation in viewpoint and aspect ratio; all parameters — filter weights, deformation costs, and bias — are trained end-to-end with latent SVM and hard-negative mining from bounding-box supervision alone.

Algorithm

Let HH denote the HOG feature pyramid built from the input image. Let ϕ(H,p)Rd\phi(H, p) \in \mathbb{R}^d denote the HOG sub-window feature vector at position pp in the pyramid (d=31d = 31 in the final system). Let F0F_0 denote the root filter covering the full object at pyramid level l0l_0. Let FiF_i, i=1,,ni = 1, \ldots, n, denote part filters placed at pyramid level li=l0λl_i = l_0 - \lambda. Let viv_i denote the anchor position of part ii relative to root position p0p_0. Let diR4d_i \in \mathbb{R}^4 denote the learned deformation-cost coefficient vector for part ii. Let bb denote the model bias scalar. Let z=(p0,p1,,pn)z = (p_0, p_1, \ldots, p_n) denote a complete object hypothesis (root position plus nn part positions). Let (dxi,dyi)=(xi,yi)(2(x0,y0)+vi)(dx_i, dy_i) = (x_i, y_i) - (2(x_0, y_0) + v_i) denote the displacement of part ii from its anchor (Eq. 3). Let Ri,l(x,y)=Fiϕ(Hl,(x,y))R_{i,l}(x,y) = F_i' \cdot \phi(H_l, (x,y)) denote the filter response map of part filter ii at pyramid level ll. Let β\beta denote the concatenated model parameter vector (all filter weights, deformation costs, and bias).

Feature pyramid

Definition
HOG feature map

Each level of the pyramid provides a dd-dimensional feature map with cell size k=8k = 8 px and truncation α=0.2\alpha = 0.2. The final feature vector is 31-dimensional: 9 contrast-insensitive orientation channels + 18 contrast-sensitive orientation channels + 4 normalisation-energy channels.

The pyramid is sampled at λ\lambda levels per octave (λ=5\lambda = 5 during training, λ=10\lambda = 10 at test time). Parts are placed λ\lambda levels below the root level, giving twice the spatial resolution.

Hypothesis score

Definition
Deformation feature vector \phi_d

The displacement (dx,dy)(dx, dy) of a part from its anchor is encoded as a four-dimensional quadratic feature:

ϕd(dx,dy)=(dx,  dy,  dx2,  dy2).(Eq. 4)\phi_d(dx, dy) = (dx,\; dy,\; dx^2,\; dy^2). \tag{Eq. 4}
Definition
Hypothesis score

The score of a complete hypothesis z=(p0,,pn)z = (p_0, \ldots, p_n) is:

score(p0,,pn)=i=0nFiϕ(H,pi)i=1ndiϕd(dxi,dyi)+b.(Eq. 2)\operatorname{score}(p_0, \ldots, p_n) = \sum_{i=0}^{n} F_i' \cdot \phi(H, p_i) - \sum_{i=1}^{n} d_i \cdot \phi_d(dx_i, dy_i) + b. \tag{Eq. 2}

The first sum is the data term (root and part filter responses); the second sum penalises each part's quadratic displacement from its anchor.

Efficient inference via distance transform

For each part ii and pyramid level ll, the best-scoring placement over all displacements from a given root location is captured by the generalised distance transform:

Definition
Part distance-transform array D_{i,l}
Di,l(x,y)=maxdx,dy[Ri,l(x+dx,  y+dy)diϕd(dx,dy)].(Eq. 8)D_{i,l}(x, y) = \max_{dx,\, dy}\bigl[R_{i,l}(x+dx,\; y+dy) - d_i \cdot \phi_d(dx, dy)\bigr]. \tag{Eq. 8}

Di,l(x,y)D_{i,l}(x,y) gives the maximum score achievable by part ii when the root is placed at the corresponding pyramid location, taking the best possible displacement.

The root score aggregation adds all part distance-transform values evaluated at the expected anchor positions (Eq. 9). Because diϕdd_i \cdot \phi_d is a separable quadratic, the maximisation in Eq. 8 is computed in O(Ri,l)O(|R_{i,l}|) time by a 1-D pass in xx followed by a 1-D pass in yy.

Mixture model

A model with mm components trains mm independent star models; detection reports the component and configuration with the highest score. All PASCAL VOC experiments use m=2m = 2 components, covering frontal and side aspects. The latent variable zz encodes both the component label and the full configuration (p0,,pn)(p_0, \ldots, p_n).

Latent SVM training

Definition
Latent SVM objective

Let fβ(x)=maxzZ(x)βΦ(x,z)f_\beta(x) = \max_{z \in Z(x)} \beta \cdot \Phi(x, z) be the detector score for example xx (Eq. 13). The training objective is

LD(β)=12β2+Cimax ⁣(0,  1yifβ(xi)).(Eq. 14)L_D(\beta) = \tfrac{1}{2}\|\beta\|^2 + C \sum_i \max\!\bigl(0,\; 1 - y_i f_\beta(x_i)\bigr). \tag{Eq. 14}

The hinge loss is convex in β\beta for negative examples (yi=1y_i = -1) — the semi-convex property — enabling a coordinate-descent procedure that alternates between relabelling positive latent values and solving a convex QP over negatives.

Hard-negative mining iteratively removes easy negatives (yifβ(xi)>1y_i f_\beta(x_i) > 1) from the cache and adds new hard examples; convergence to β(D)\beta^*(D) is guaranteed when the cache eventually contains all hard examples (Theorem 1).

Procedure

Algorithm
DPM inference (single component)
Input: Image II; trained model (F0,{Pi}i=1n,b)(F_0, \{P_i\}_{i=1}^n, b) with Pi=(Fi,vi,di)P_i = (F_i, v_i, d_i); detection threshold tt.
Output: List of (bounding box, score) pairs for all positions exceeding tt, after non-maximum suppression.
  1. Build the HOG feature pyramid HH from II at λ=10\lambda = 10 levels per octave; cell size k=8k = 8 px; d=31d = 31 dimensions per cell.
  2. For each pyramid level l0l_0, compute the root filter response map R0,l0(x,y)=F0ϕ(Hl0,(x,y))R_{0,l_0}(x, y) = F_0' \cdot \phi(H_{l_0}, (x,y)) at every root position.
  3. For each part i=1,,ni = 1, \ldots, n, compute the part response map Ri,l0λR_{i, l_0 - \lambda} at the corresponding finer pyramid level.
  4. For each part ii, compute the distance-transform array Di,l0λD_{i,l_0-\lambda} by two 1-D passes (xx then yy) over Ri,l0λR_{i,l_0-\lambda} using the quadratic cost diϕdd_i \cdot \phi_d.
  5. At each root position (x0,y0)(x_0, y_0), aggregate the root response with all part distance-transform values sampled at anchor positions 2(x0,y0)+vi2(x_0, y_0) + v_i to obtain score(z)\operatorname{score}(z^*) (Eq. 9).
  6. Collect all positions with score(z)>t\operatorname{score}(z^*) > t as candidate detections; apply non-maximum suppression (suppress any box overlapping a higher-scoring box by 50%\geq 50\%).
  7. Optionally apply bounding-box regression: fit output boxes via least-squares regression on the 2n+32n+3-dimensional configuration feature g(z)g(z^*).

Implementation

The per-root-position score in Rust — combining root filter response with pre-computed part distance-transform values (Eq. 2 + Eq. 8):

/// Quadratic deformation cost: d · φ_d(dx, dy).
/// d_coeffs = [a, b, c, e] corresponding to φ_d = (dx, dy, dx², dy²).
#[inline]
fn deformation_cost(d_coeffs: [f32; 4], dx: i32, dy: i32) -> f32 {
    let (fx, fy) = (dx as f32, dy as f32);
    d_coeffs[0] * fx + d_coeffs[1] * fy
        + d_coeffs[2] * fx * fx + d_coeffs[3] * fy * fy
}

/// Eq. 2: aggregate root response with part distance-transform values at the
/// anchors. Deformation costs are already folded into `dt_values` by Eq. 8.
#[inline]
fn root_score(root_resp: f32, dt_values: &[f32], bias: f32) -> f32 {
    root_resp + dt_values.iter().sum::<f32>() + bias
}

/// One-axis quadratic distance transform.
/// Fills out[x] = max_dx [ resp[x + dx] − (c · dx + e · dx²) ] for each x.
/// Apply twice (x then y) to realise the full 2-D maximisation in Eq. 8.
fn dt_1d(resp: &[f32], c: f32, e: f32, out: &mut [f32]) {
    let n = resp.len();
    let mut best = f32::NEG_INFINITY;
    for x in 0..n as i32 {
        let cand = resp[x as usize] - c * (x as f32) - e * (x * x) as f32;
        if cand > best { best = cand; }
        out[x as usize] = best + c * (x as f32) + e * (x * x) as f32;
    }
}

Remarks

  • Inference is O(nk)O(nk) per pyramid level after filter response maps are computed, where nn is the number of parts and kk is the number of positions at each pyramid level; the dominant cost is convolution of root and part filters over the pyramid.
  • Training is offline and expensive: PASCAL experiments require approximately 4 hours to train and 3 hours to evaluate on an 8-core 2.8 GHz Xeon; test-time average is approximately 2 seconds per image. The method is unsuitable for real-time use.
  • A mixture of m=2m = 2 components handles aspect-ratio and viewpoint variation; categories with three or more distinct aspects are poorly served — bird AP is 0.006 in PASCAL VOC 2007 Table 2.
  • The deformable assumption is a star topology: each part connects only to the root, not to other parts. Within-part articulation and joint-chain kinematics (e.g. extreme limb foreshortening) violate the quadratic deformation model.
  • Hard-negative mining requires multiple passes over the training set; convergence to β(D)\beta^*(D) is guaranteed only when the cache contains all hard examples, but memory limits in practice impose a fixed iteration count rather than full convergence.
  • Dominant on PASCAL VOC 2007/2008 — first or second place in 17 of 20 categories (Table 3) — until the R-CNN family (Girshick et al., 2014) replaced HOG + latent SVM with CNN features. The end-to-end learned replacement in general object detection is Faster R-CNN, which folds proposal generation into a shared-conv RPN head and supersedes the sliding-window + deformable-template pipeline on PASCAL VOC 2007 (DPM ≈33% mAP vs Faster R-CNN VGG-16 73.2%, ResNet-101 76.4%).

References

  1. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE TPAMI, 32(9)
    –1645, 2010. PDF
  2. Dalal, N., Triggs, B. Histograms of Oriented Gradients for Human Detection. CVPR, 2005. HAL
  3. Viola, P., Jones, M. Rapid Object Detection Using a Boosted Cascade of Simple Features. CVPR, 2001. PDF

Prerequisites

Compared with

  • medium
    Viola–Jones Object Detector

    Different operational regimes — VJ is real-time cascade for rigid faces; DPM is offline part-based for general deformable objects.

Fed by

Has learned alternative

  • Faster R-CNN
  • medium
    Mask R-CNN

    Mask R-CNN's CNN backbone, region proposals, and RoIAlign replace DPM's HOG features, root + part filters, and latent-SVM scoring; Mask R-CNN also outputs per-instance masks beyond DPM's bounding boxes.

  • YOLOv1

    Replaces sliding-window deformable templates with single-pass CNN regression; reframes detection as regression rather than classification of proposals.