Geometric Bird's-Eye-View Rectification | VitaVision
Back to atlas

Geometric Bird's-Eye-View Rectification

8 min readIntermediateView in graph
Based on
A Geometric Approach to Obtain a Bird's Eye View From an Image
Abbas, Zisserman · ICCVW 2019 2019
arXiv ↗

Goal

Rectify a single monocular perspective image to a geometrically correct bird's-eye (overhead) view. Input: one RGB image of a scene containing a dominant flat ground plane; no calibration target and no prior camera calibration. Output: a 3×33 \times 3 homography HH that warps the image so that ground-plane geometry is metrically correct up to an overall scale. The homography is constructed in closed form from two projective entities — the vertical vanishing point vzv_z and the horizon line hh — regressed by a CNN. No correspondence solve, no DLT, no SVD, and no RANSAC is performed; the entire geometric construction is algebraic once the two entities are known.

Algorithm

Let II denote the input perspective image. Let HH denote the 3×33 \times 3 rectifying homography mapping II to the bird's-eye view. Let KK denote the 3×33 \times 3 camera calibration matrix. Let vzv_z denote the vertical vanishing point (image of the world vertical direction). Let hh denote the horizon line with homogeneous line coefficients (a,b,c)(a, b, c), satisfying ax+by+c=0ax + by + c = 0. Let ω\omega denote the image of the absolute conic, ω=(KKT)1\omega = (KK^T)^{-1}. Let ff denote the camera focal length in pixels. Let ww and himh_\mathrm{im} denote the image width and height in pixels. Let θz\theta_z denote the camera roll angle. Let θx\theta_x denote the camera tilt angle. Let RrollR_\mathrm{roll} denote the rotation matrix removing camera roll. Let RtiltR_\mathrm{tilt} denote the rotation matrix removing camera tilt. Let RalignR_\mathrm{align} denote the optional rotation aligning the principal horizontal direction to a coordinate axis. Let TsceneT_\mathrm{scene} denote the translation matrix mapping the warped corners to the output canvas. Let HrotH_\mathrm{rot} denote the intermediate rotational homography. Let bb denote the number of regression bins per scalar (b=500b = 500). Let cc denote the number of top bins used in the weighted decode (c=11c = 11). Let rr denote the stereographic sphere radius.

Definition
Calibration matrix

Under the simplified pinhole model (square pixels, principal point at the image centre), the calibration matrix is:

K=(f0w/20fhim/2001)K = \begin{pmatrix} f & 0 & w/2 \\ 0 & f & h_\mathrm{im}/2 \\ 0 & 0 & 1 \end{pmatrix}

(eq. 3). Focal length ff is the single unknown; it is recovered from the horizon and vanishing point.

Definition
Focal-length recovery via the absolute conic

The image of the absolute conic ω=(KKT)1\omega = (KK^T)^{-1} satisfies the linear constraint

h=ωvzh = \omega\, v_z

(eq. 4). Given vzv_z and hh, this equation determines ff in closed form under the calibration model above.

Definition
Intermediate rotational homography

The combined roll-and-tilt correction is the homography

Hrot=KRtiltK1RrollH_\mathrm{rot} = K R_\mathrm{tilt} K^{-1} R_\mathrm{roll}

(eq. 5). It maps image pixels to a roll- and tilt-corrected overhead coordinate frame.

Definition
Rectifying homography

The full bird's-eye-view homography is

H=RalignTsceneKRtiltK1RrollH = R_\mathrm{align}\, T_\mathrm{scene}\, K R_\mathrm{tilt} K^{-1} R_\mathrm{roll}

(eq. 6). TsceneT_\mathrm{scene} translates the warped image so all corners lie in the positive canvas region; RalignR_\mathrm{align} is optional.

Procedure

Algorithm
Geometric bird's-eye-view rectification
Input: Perspective RGB image II of width ww and height himh_\mathrm{im}
Output: Warped bird's-eye-view image; rectifying homography HH
  1. Regress vzv_z and hh from II using the CNN (four scalars total, decoded from the stereographic-sphere representation).
  2. Recover ff and form KK (eq. 3) by solving h=ωvzh = \omega v_z with ω=(KKT)1\omega = (KK^T)^{-1} (eq. 4).
  3. Compute roll θz=arctan(a/b)\theta_z = \arctan(-a/b) from the horizon line ax+by+c=0ax + by + c = 0; form RrollR_\mathrm{roll}.
  4. Compute tilt θx=π2arctan(vz/f)\theta_x = \tfrac{\pi}{2} - \arctan(v_z / f), where vzv_z here is the perpendicular distance from the vertical vanishing point to the principal point; form RtiltR_\mathrm{tilt}.
  5. Form the intermediate rotational homography Hrot=KRtiltK1RrollH_\mathrm{rot} = K R_\mathrm{tilt} K^{-1} R_\mathrm{roll} (eq. 5).
  6. Map the four image corners through HrotH_\mathrm{rot} to determine the bounding box of the warped canvas; build TsceneT_\mathrm{scene} to shift all corners into the positive quadrant.
  7. Compose the rectifying homography H=RalignTsceneKRtiltK1RrollH = R_\mathrm{align}\, T_\mathrm{scene}\, K R_\mathrm{tilt} K^{-1} R_\mathrm{roll} (eq. 6).
  8. Warp II by HH (bilinear interpolation) to produce the bird's-eye-view image.

CNN regression target

The vertical vanishing point vzv_z and the horizon line hh may lie at or near the image boundary or entirely outside the image frame. Representing them as raw pixel coordinates is numerically unsafe when values are large or infinite.

Stereographic-sphere encoding. A projective point or line that may be at infinity is mapped to a finite scalar pair in [r,r][-r, r] via a stereographic construction: the point is first lifted onto a sphere of radius rr centred at (0,0,r)(0, 0, r), and then projected orthogonally back to the plane. A line is encoded via the normal to the plane it defines with the sphere centre. The resulting four scalars — two for vzv_z, two for hh — are bounded regardless of the projective position of the geometric entity. This is the regression target the CNN is trained to predict.

Regression-by-classification head. Each of the four scalars is discretised into b=500b = 500 equal-width bins spanning the encoded range [r,r][-r, r]. The CNN head predicts a softmax distribution over these 500500 bins. The decoded scalar is the probability-weighted average of the top c=11c = 11 bins by softmax probability. This decoding strategy reduces effective quantisation error relative to hard argmax and smooths the output without requiring a separate regression branch.

If focal length ff is known from prior calibration, only the horizon line hh (two scalars) is needed; the tilt and roll are then fully determined from hh alone.

Implementation

Given the focal length ff recovered from h=ωvzh = \omega v_z (eq. 4), the closed-form homography construction in Rust:

use nalgebra::Matrix3;

pub fn bev_homography(
    vz: (f64, f64),            // vertical vanishing point, pixels
    horizon: (f64, f64, f64),  // horizon line (a, b, c): a·x + b·y + c = 0
    f: f64,                    // focal length, recovered from h = ω·v_z (eq. 4)
    w: f64,
    h: f64,
) -> Matrix3<f64> {
    let (a, b, _) = horizon;

    // eq. 3 — calibration matrix K (square pixels, centred principal point)
    #[rustfmt::skip]
    let k = Matrix3::new(f,   0.0, w / 2.0,
                         0.0, f,   h / 2.0,
                         0.0, 0.0, 1.0);
    let k_inv = k.try_inverse().expect("K is invertible for f > 0");

    // Step A — roll θ_z from the horizon line
    let tz = (-a).atan2(b);
    #[rustfmt::skip]
    let r_roll = Matrix3::new(tz.cos(), -tz.sin(), 0.0,
                              tz.sin(),  tz.cos(), 0.0,
                              0.0,       0.0,      1.0);

    // Step B — tilt θ_x from the vertical vanishing point
    let d = (vz.0 - w / 2.0).hypot(vz.1 - h / 2.0);
    let tx = std::f64::consts::FRAC_PI_2 - d.atan2(f);
    #[rustfmt::skip]
    let r_tilt = Matrix3::new(1.0, 0.0,       0.0,
                              0.0, tx.cos(), -tx.sin(),
                              0.0, tx.sin(),  tx.cos());

    // eq. 5 — intermediate rotational homography
    let h_rot = k * r_tilt * k_inv * r_roll;

    // eq. 6 — H = R_align · T_scene · H_rot; T_scene shifts the warped corners
    // into the canvas, R_align is optional. Identity placeholders shown here.
    let (t_scene, r_align) = (Matrix3::<f64>::identity(), Matrix3::<f64>::identity());
    r_align * t_scene * h_rot
}

The CNN front-end — a standard image-classification trunk followed by the regression-by-classification head — is a conventional network and is not shown; the geometric construction above is the method's specific content.

Remarks

  • The closed-form construction is O(1)O(1) per image given the CNN output; the dominant cost is the single CNN forward pass.
  • The method parameterises the BEV homography with 4 scalars for uncalibrated cameras, or 2 scalars when ff is known, versus 8 degrees of freedom for a general homography.
  • Focal-length recovery from h=ωvzh = \omega v_z (eq. 4) is ill-conditioned when vzv_z approaches the horizon, i.e. when the camera tilt θx0\theta_x \to 0. Wide fields of view also amplify focal-length error due to the steep slope of f1/tan(γ/2)f \propto 1 / \tan(\gamma/2) near γ=π/2\gamma = \pi/2.
  • A non-planar ground surface breaks the bird's-eye-view assumption; objects above the ground plane (vehicles, people) are not correctly rectified even when the ground plane is correctly handled.
  • The stereographic-sphere regression target is not image-observable: the encoded scalars do not correspond to any directly detectable feature in the image, a limitation noted in the paper.
  • The output is metric only up to an overall scale; one known reference distance is required for absolute measurements.

References

  1. S. A. Abbas, A. Zisserman. A Geometric Approach to Obtain a Bird's Eye View From an Image. ICCV Workshops, 2019. arXiv
    .02231
  2. K. He, X. Zhang, S. Ren, J. Sun. Deep Residual Learning for Image Recognition. CVPR, 2016. arXiv
    .03385
  3. R. Hartley, A. Zisserman. Multiple View Geometry in Computer Vision. 2nd ed., Cambridge University Press, 2003.
flowchart TB
    I["Perspective image"] --> C["CNN regresses vertical vanishing point and horizon line"]
    C --> K["Closed-form rectifying homography H — eq. 6"]
    K --> O["Bird's-eye view"]