Goal
Rectify a single monocular perspective image to a geometrically correct bird's-eye (overhead) view. Input: one RGB image of a scene containing a dominant flat ground plane; no calibration target and no prior camera calibration. Output: a homography that warps the image so that ground-plane geometry is metrically correct up to an overall scale. The homography is constructed in closed form from two projective entities — the vertical vanishing point and the horizon line — regressed by a CNN. No correspondence solve, no DLT, no SVD, and no RANSAC is performed; the entire geometric construction is algebraic once the two entities are known.
Algorithm
Let denote the input perspective image. Let denote the rectifying homography mapping to the bird's-eye view. Let denote the camera calibration matrix. Let denote the vertical vanishing point (image of the world vertical direction). Let denote the horizon line with homogeneous line coefficients , satisfying . Let denote the image of the absolute conic, . Let denote the camera focal length in pixels. Let and denote the image width and height in pixels. Let denote the camera roll angle. Let denote the camera tilt angle. Let denote the rotation matrix removing camera roll. Let denote the rotation matrix removing camera tilt. Let denote the optional rotation aligning the principal horizontal direction to a coordinate axis. Let denote the translation matrix mapping the warped corners to the output canvas. Let denote the intermediate rotational homography. Let denote the number of regression bins per scalar (). Let denote the number of top bins used in the weighted decode (). Let denote the stereographic sphere radius.
Under the simplified pinhole model (square pixels, principal point at the image centre), the calibration matrix is:
(eq. 3). Focal length is the single unknown; it is recovered from the horizon and vanishing point.
The image of the absolute conic satisfies the linear constraint
(eq. 4). Given and , this equation determines in closed form under the calibration model above.
The combined roll-and-tilt correction is the homography
(eq. 5). It maps image pixels to a roll- and tilt-corrected overhead coordinate frame.
The full bird's-eye-view homography is
(eq. 6). translates the warped image so all corners lie in the positive canvas region; is optional.
Procedure
- Regress and from using the CNN (four scalars total, decoded from the stereographic-sphere representation).
- Recover and form (eq. 3) by solving with (eq. 4).
- Compute roll from the horizon line ; form .
- Compute tilt , where here is the perpendicular distance from the vertical vanishing point to the principal point; form .
- Form the intermediate rotational homography (eq. 5).
- Map the four image corners through to determine the bounding box of the warped canvas; build to shift all corners into the positive quadrant.
- Compose the rectifying homography (eq. 6).
- Warp by (bilinear interpolation) to produce the bird's-eye-view image.
CNN regression target
The vertical vanishing point and the horizon line may lie at or near the image boundary or entirely outside the image frame. Representing them as raw pixel coordinates is numerically unsafe when values are large or infinite.
Stereographic-sphere encoding. A projective point or line that may be at infinity is mapped to a finite scalar pair in via a stereographic construction: the point is first lifted onto a sphere of radius centred at , and then projected orthogonally back to the plane. A line is encoded via the normal to the plane it defines with the sphere centre. The resulting four scalars — two for , two for — are bounded regardless of the projective position of the geometric entity. This is the regression target the CNN is trained to predict.
Regression-by-classification head. Each of the four scalars is discretised into equal-width bins spanning the encoded range . The CNN head predicts a softmax distribution over these bins. The decoded scalar is the probability-weighted average of the top bins by softmax probability. This decoding strategy reduces effective quantisation error relative to hard argmax and smooths the output without requiring a separate regression branch.
If focal length is known from prior calibration, only the horizon line (two scalars) is needed; the tilt and roll are then fully determined from alone.
Implementation
Given the focal length recovered from (eq. 4), the closed-form homography construction in Rust:
use nalgebra::Matrix3;
pub fn bev_homography(
vz: (f64, f64), // vertical vanishing point, pixels
horizon: (f64, f64, f64), // horizon line (a, b, c): a·x + b·y + c = 0
f: f64, // focal length, recovered from h = ω·v_z (eq. 4)
w: f64,
h: f64,
) -> Matrix3<f64> {
let (a, b, _) = horizon;
// eq. 3 — calibration matrix K (square pixels, centred principal point)
#[rustfmt::skip]
let k = Matrix3::new(f, 0.0, w / 2.0,
0.0, f, h / 2.0,
0.0, 0.0, 1.0);
let k_inv = k.try_inverse().expect("K is invertible for f > 0");
// Step A — roll θ_z from the horizon line
let tz = (-a).atan2(b);
#[rustfmt::skip]
let r_roll = Matrix3::new(tz.cos(), -tz.sin(), 0.0,
tz.sin(), tz.cos(), 0.0,
0.0, 0.0, 1.0);
// Step B — tilt θ_x from the vertical vanishing point
let d = (vz.0 - w / 2.0).hypot(vz.1 - h / 2.0);
let tx = std::f64::consts::FRAC_PI_2 - d.atan2(f);
#[rustfmt::skip]
let r_tilt = Matrix3::new(1.0, 0.0, 0.0,
0.0, tx.cos(), -tx.sin(),
0.0, tx.sin(), tx.cos());
// eq. 5 — intermediate rotational homography
let h_rot = k * r_tilt * k_inv * r_roll;
// eq. 6 — H = R_align · T_scene · H_rot; T_scene shifts the warped corners
// into the canvas, R_align is optional. Identity placeholders shown here.
let (t_scene, r_align) = (Matrix3::<f64>::identity(), Matrix3::<f64>::identity());
r_align * t_scene * h_rot
}
The CNN front-end — a standard image-classification trunk followed by the regression-by-classification head — is a conventional network and is not shown; the geometric construction above is the method's specific content.
Remarks
- The closed-form construction is per image given the CNN output; the dominant cost is the single CNN forward pass.
- The method parameterises the BEV homography with 4 scalars for uncalibrated cameras, or 2 scalars when is known, versus 8 degrees of freedom for a general homography.
- Focal-length recovery from (eq. 4) is ill-conditioned when approaches the horizon, i.e. when the camera tilt . Wide fields of view also amplify focal-length error due to the steep slope of near .
- A non-planar ground surface breaks the bird's-eye-view assumption; objects above the ground plane (vehicles, people) are not correctly rectified even when the ground plane is correctly handled.
- The stereographic-sphere regression target is not image-observable: the encoded scalars do not correspond to any directly detectable feature in the image, a limitation noted in the paper.
- The output is metric only up to an overall scale; one known reference distance is required for absolute measurements.
References
- S. A. Abbas, A. Zisserman. A Geometric Approach to Obtain a Bird's Eye View From an Image. ICCV Workshops, 2019. arXiv.02231
- K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR, 2016. arXiv.03385
- R. Hartley, A. Zisserman. Multiple View Geometry in Computer Vision. 2nd ed., Cambridge University Press, 2003.
flowchart TB
I["Perspective image"] --> C["CNN regresses vertical vanishing point and horizon line"]
C --> K["Closed-form rectifying homography H — eq. 6"]
K --> O["Bird's-eye view"]