Pose Estimation | VitaVision
Back to atlas

Pose Estimation

7 min readIntermediateView in graph
Based on
EPnP: An Accurate O(n) Solution to the PnP Problem
Lepetit, Moreno-Noguer, Fua · International Journal of Computer Vision 2009
DOI ↗

Definition

Pose estimation is the recovery of the 6-DOF rigid transformation that places a camera relative to a scene, an object, or a second camera. The transformation is a rotation RSO(3)R \in SO(3) and translation tR3t \in \mathbb{R}^3; together they form the extrinsic matrix [Rt][R \mid t] of the pinhole camera model. Two structurally distinct sub-problems are in common use, distinguished by the input correspondences.

Definition
Absolute pose — Perspective-n-Point

Given a calibrated camera with intrinsic matrix KK and nn correspondences between known 3-D world points M~i\tilde{M}_i and their observed image projections m~i\tilde{m}_i, recover [Rt][R \mid t] such that m~iK[Rt]M~i\tilde{m}_i \sim K\,[R \mid t]\,\tilde{M}_i for all ii. Input: n4n \geq 4 2D-3D point pairs and KK. Output: the camera pose [Rt][R \mid t].

Definition
Relative pose from two views

Given two calibrated views with corresponding image points xixi\mathbf{x}_i \leftrightarrow \mathbf{x}_i' in normalised coordinates, recover the rotation RR and translation direction T\mathbf{T} (unit norm, scale unobservable) satisfying the epipolar constraint xiTExi=0\mathbf{x}_i'^{\,T} E\, \mathbf{x}_i = 0, where E=[T]×RE = [\mathbf{T}]_\times R is the essential matrix. Input: n8n \geq 8 corresponding normalised points. Output: RR and T\mathbf{T} up to scale, with a four-fold ambiguity resolved by cheirality.

Mathematical Description

Absolute pose: Perspective-n-Point

The calibrated PnP problem asks for [Rt][R \mid t] given nn 2D-3D correspondences and KK. EPnP solves it in O(n)O(n) by expressing every reference point as a barycentric combination of four virtual control points,

piw=j=14αijcjw,j=14αij=1.\mathbf{p}_i^w = \sum_{j=1}^{4} \alpha_{ij}\,\mathbf{c}_j^w, \qquad \sum_{j=1}^{4} \alpha_{ij} = 1.

Because affine combinations are preserved by rigid motion, the same weights hold in camera coordinates, pic=jαijcjc\mathbf{p}_i^c = \sum_j \alpha_{ij}\,\mathbf{c}_j^c. Substituting the camera-frame decomposition into the perspective projection and eliminating the projective scale yields two linear equations per point; stacking nn points gives the 2n×122n \times 12 homogeneous system Mx=0M\mathbf{x} = \mathbf{0}, whose solution lies in the null space of the constant-size 12×1212 \times 12 matrix MMM^\top M,

x=i=1Nβivi,\mathbf{x} = \sum_{i=1}^{N} \beta_i\,\mathbf{v}_i,

with N{1,2,3,4}N \in \{1,2,3,4\} the effective null-space dimension. The βi\beta_i are fixed by requiring inter-control-point distances in camera coordinates to match the world distances, and the pose is extracted from the recovered control points by absolute orientation. The minimal cases are n=3n = 3 (planar) and n=4n = 4 (general position).

Relative pose: the essential matrix

For two calibrated views with normalised coordinates, the essential matrix EE encodes the relative pose through the bilinear epipolar constraint xiTExi=0\mathbf{x}_i'^{\,T} E\, \mathbf{x}_i = 0, with

E=RS,S=[T]×,E = R\,S, \qquad S = [\mathbf{T}]_\times,

the skew-symmetric matrix of the unit translation T\mathbf{T}. Each correspondence supplies one linear constraint on the nine entries of EE; eight correspondences determine their ratios by linear least squares, and the scale is fixed by tr(EE)=2\mathrm{tr}(E^\top E) = 2. The translation components follow from EEE^\top E, whose diagonal entries are 1Tλ21 - T_\lambda^2 and off-diagonal entries TλTμ-T_\lambda T_\mu; the rotation follows in closed form. The recovered pose carries a four-fold sign ambiguity, resolved by cheirality — all reconstructed points must lie in front of both cameras.

Pose from a calibration target

During calibration, per-view extrinsic pose is an implicit output. For a planar target the projection reduces to the homography sm~=HM~2Ds\,\tilde{m} = H\,\tilde{M}_{2D} with H=λK[r1    r2    t]H = \lambda K[r_1\;\;r_2\;\;t]; the rotation columns and translation are extracted directly,

r1=λK1h1,r2=λK1h2,t=λK1h3,r_1 = \lambda K^{-1} h_1, \qquad r_2 = \lambda K^{-1} h_2, \qquad t = \lambda K^{-1} h_3,

with λ=1/K1h1\lambda = 1/\|K^{-1}h_1\| enforcing unit norm and r3=r1×r2r_3 = r_1 \times r_2 completing the rotation.

Nonlinear refinement

Closed-form solutions seed a nonlinear refinement that minimises the total reprojection error idist2(K[Rt]M~i,m~i)\sum_i \mathrm{dist}^2(K[R\mid t]\tilde{M}_i,\, \tilde{m}_i) over the six pose degrees of freedom — and, during calibration, jointly over the intrinsics.

Numerical Concerns

Rotation parameterisation. A rotation matrix has nine entries under six constraints; unconstrained optimisation of the entries violates orthogonality. The Rodrigues 3-vector keeps all nine entries consistent and the Jacobian unconstrained; a unit quaternion is the alternative, with a unit-norm side constraint handled by normalisation.

Translation scale ambiguity. Relative pose recovers T\mathbf{T} only as a direction; absolute scale is unobservable from image data alone and requires a metric anchor — a known distance, a stereo baseline, or an IMU.

Planar and collinear degeneracy. A coplanar, tilted point set admits two valid PnP solutions; collinear points make the constraint matrix rank-deficient regardless of nn. For the eight-point algorithm, coplanar or collinear configurations similarly cause rank loss.

Sensitivity to correspondence noise. PnP solvers and the eight-point algorithm are pure model fitters — a single mismatch degrades the solution. The standard remedy wraps the solver in a RANSAC loop over minimal samples.

Minimal vs overdetermined solvers. The minimal calibrated absolute-pose solver is P3P; the minimal relative-pose solver is the five-point algorithm. EPnP targets the overdetermined regime (n4n \geq 4) and is suited to run after RANSAC inlier selection.

Reprojection error vs pose error. Reprojection error is the measurable surrogate for pose error, but minimising it does not strictly minimise rotation or translation error; near degeneracy or at high noise the two criteria disagree.

Where it appears

Pose estimation appears at every layer of the calibration and localisation pipeline.

  • epnp — the calibrated PnP solver: absolute pose in O(n)O(n) from n4n \geq 4 2D-3D correspondences with a known intrinsic matrix.
  • longuet-higgins-eight-point — the foundational linear algorithm for relative pose from n8n \geq 8 calibrated correspondences; introduces the essential matrix and cheirality resolution.
  • zhang-planar-calibration — recovers per-view extrinsic pose alongside intrinsics; pose extraction follows from the per-view homography factorisation.
  • tsai-versatile-calibration — recovers extrinsic pose from a 3-D calibration rig via the radial alignment constraint.

Pose is defined within the pinhole-camera-model: it is precisely the extrinsic component [Rt][R \mid t] of the central projection equation.

References

  1. V. Lepetit, F. Moreno-Noguer, P. Fua. EPnP: An Accurate O(n) Solution to the PnP Problem. International Journal of Computer Vision, 81(2)
    –166, 2009.
  2. H. C. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Nature, 293
    –135, 1981.
  3. Z. Zhang. A Flexible New Technique for Camera Calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11)
    –1334, 2000.
  4. G. Schweighofer, A. Pinz. Robust Pose Estimation from a Planar Target. IEEE TPAMI, 28(12)
    –2030, 2006.
  5. D. Nistér. An Efficient Solution to the Five-Point Relative Pose Problem. IEEE TPAMI, 26(6)
    –777, 2004.