Goal
Detect all instances of a trained single object class — demonstrated on frontal, roughly upright human faces — in a grayscale image of arbitrary resolution. Input: a grayscale image . Output: a set of axis-aligned bounding rectangles localising each detected instance, computed by sliding a pixel sub-window across the image at multiple scales. The method distinguishes itself from prior sliding-window detectors by coupling three mechanisms — an rectangular-sum primitive, a boosting-based feature-selection step that selects a small discriminative subset from over 180,000 candidate features, and a multi-stage rejection cascade — so that the vast majority of sub-windows are discarded after examining fewer than ten features.
Algorithm
Let denote the grayscale input image on pixel domain . Let denote the pixel value at column , row . Let denote the integral image at . Let denote the column prefix sum at . Let denote the -th rectangle feature evaluated on a sub-window. Let denote the -th AdaBoost weak classifier. Let denote the strong classifier formed by a weighted combination of weak classifiers. Let denote the number of boosting rounds (weak classifiers selected) for a given stage. Let denote the log-odds weight assigned to weak classifier . Let denote the weighted training error of . Let and denote the per-stage detection rate and false-positive rate at cascade stage . Let and denote the overall cascade detection rate and false-positive rate.
Integral image
The integral image is a two-dimensional prefix sum of :
It is computed in a single pass over using the recurrences
with boundary conditions and . The sum of pixel values within any axis-aligned rectangle is recovered in exactly four array reads: for a rectangle with corners and ,
Variance normalisation requires a second integral image computed identically over . The sub-window standard deviation is
where is the sub-window mean and is the mean of squared pixel values, both obtained from and in .
Rectangle features
Three families of rectangle feature are defined over the sub-window:
- Two-rectangle: difference of pixel sums in two horizontally or vertically adjacent same-size rectangles.
- Three-rectangle: sum of the two outer rectangles minus the centre rectangle, arranged horizontally or vertically.
- Four-rectangle: difference of pixel sums in two diagonal rectangle pairs.
The exhaustive set of all valid placements and sizes of these three families over a sub-window exceeds 180,000 features.
Each feature value is computed from in using the four-read rectangular-sum formula.
AdaBoost feature selection
AdaBoost iterates over rounds. At each round , all candidate features are evaluated on the training set; the feature with minimum weighted classification error is selected.
Given feature , threshold , and polarity ,
After selecting with weighted error , its coefficient is set to
Sample weights are updated to down-weight correctly classified examples by a factor and renormalised.
The strong classifier combines weak classifiers by a weighted sign threshold:
A sub-window is classified as a positive instance when .
Attentional cascade
The cascade is a sequence of strong classifiers, each trained to meet a per-stage minimum detection rate and maximum false-positive rate . A sub-window advances to stage only if stage accepts it. The overall rates satisfy
Stages are constructed greedily: additional weak classifiers are added to stage until and targets are met. False positives from each partial cascade pass over 9,544 non-face images (approximately 350 million sub-windows) are collected as hard negatives for subsequent stages. The trained cascade has stages and 6,061 features in total; the first five stages contain 1, 10, 25, 25, and 50 features respectively. The first stage (a single two-feature strong classifier) achieves approximately 100% detection rate at approximately 40% false-positive rate.
Procedure
- Compute the integral image from using the two-pass recurrence over and .
- Compute the squared integral image from by the same recurrence.
- For each scale in the sequence until the scaled sub-window exceeds the image dimensions:
- For each sub-window position with stride :
- Compute from and ; skip the sub-window if .
- Evaluate on the normalised sub-window; reject and advance to the next position if .
- Evaluate in sequence; reject on the first negative response.
- If all stages accept, add the scaled bounding rectangle to a candidate set.
- For each sub-window position with stride :
- Partition overlapping candidates into groups; replace each group with the mean of its bounding-box corners.
- Return the merged set .
Implementation
The two-pass integral-image build and the four-read rectangular-sum query in Rust:
fn build_integral_image(pixels: &[u32], width: usize, height: usize) -> Vec<u64> {
let mut ii = vec![0u64; width * height];
for x in 0..width {
let col_sum = pixels[x] as u64;
ii[x] = if x == 0 { col_sum } else { ii[x - 1] + col_sum };
}
for y in 1..height {
let mut col_sum = 0u64;
for x in 0..width {
col_sum += pixels[y * width + x] as u64;
ii[y * width + x] = ii[(y - 1) * width + x] + col_sum;
}
}
ii
}
fn rect_sum(ii: &[u64], width: usize, x1: usize, y1: usize, x2: usize, y2: usize) -> u64 {
let a = ii[y2 * width + x2];
let b = if x1 == 0 { 0 } else { ii[y2 * width + (x1 - 1)] };
let c = if y1 == 0 { 0 } else { ii[(y1 - 1) * width + x2] };
let d = if x1 == 0 || y1 == 0 { 0 } else { ii[(y1 - 1) * width + (x1 - 1)] };
a - b - c + d
}
The weak-classifier evaluation in Python, operating directly on a pre-built integral image:
import numpy as np
def eval_weak_classifier(ii, x_off, y_off, feature, alpha):
def rsum(r):
x1, y1, rw, rh = r
x1 += x_off; y1 += y_off
x2, y2 = x1 + rw - 1, y1 + rh - 1
s = ii[y2, x2]
s -= ii[y1 - 1, x2] if y1 > 0 else 0
s -= ii[y2, x1 - 1] if x1 > 0 else 0
s += ii[y1 - 1, x1 - 1] if (y1 > 0 and x1 > 0) else 0
return s
f_val = rsum(feature["rect_A"]) - rsum(feature["rect_B"])
h = 1 if feature["p"] * f_val < feature["p"] * feature["theta"] else 0
return alpha * h
Remarks
- The integral image is built in a single pass; each entry requires two additions. Rectangular-sum queries cost exactly four array reads regardless of rectangle dimensions.
- The cascade amortises evaluation cost over the asymmetry between positives and negatives: on the MIT+CMU test set (507 labelled faces, 75,081,800 sub-windows), an average of 10 features out of 6,061 are evaluated per sub-window.
- Sub-windows are normalised by their standard deviation before feature evaluation; this requires a second integral image over squared pixel values. Sub-windows with (near-uniform regions) are safe to reject without evaluation.
- The method is single-class and pose-restricted: the classifier is trained on frontal, roughly upright instances of one target class. Profile or strongly rotated instances are not detected.
- Training requires hard-negative bootstrapping: false positives produced by each partial cascade pass over the non-face set are collected as training negatives for subsequent stages. The training set consists of 4,916 labelled positive examples; the non-face set spans approximately 350 million sub-windows across 9,544 images.
- Overlapping detections across scale and position are merged by partitioning into groups of mutually overlapping rectangles and replacing each group with the mean of its members' bounding-box corners.
- For full-body pedestrian detection the peer classical sliding-window detector is HOG + linear SVM (Dalal & Triggs, CVPR 2005), which outperforms Haar-wavelet-based detectors — including an extended version of this feature set — by more than an order of magnitude in false-positives-per-window on INRIA. See
hog-descriptor. - The deep-learning paradigm-level replacement is Faster R-CNN: a learned Region Proposal Network sharing conv features with a Fast R-CNN head detects 20–80 general categories at 5–17 fps on GPU. Note the change of regime — Viola-Jones is single-class real-time CPU detection on grayscale images; Faster R-CNN is multi-class GPU detection on RGB, so the replacement is paradigm-level, not drop-in.
References
- P. Viola, M. Jones. Rapid Object Detection using a Boosted Cascade of Simple Features. CVPR, 2001. PDF