Motivation
Produce dense, sub-pixel 2D-to-2D correspondences between an image pair without any keypoint detector. Input: a pair of grayscale images . Output: a set of matched positions at sub-pixel accuracy, consumed by pose estimators, SfM pipelines, or visual localisation engines. The defining property is a coarse-to-fine, detector-free design: both images are encoded by a shared CNN backbone; the resulting coarse feature maps are processed by a stack of interleaved self- and cross-attention layers using the Linear Transformer approximation, which gives every position a global receptive field and context-aware representation; a differentiable matching layer then selects confident mutual nearest-neighbour pairs; and a fine-level module refines each selected pair to sub-pixel accuracy. The global attention mechanism allows the model to establish correspondences in low-texture and homogeneous regions where repeatability-based keypoint detectors fail to find interest points.
Architecture
Family & shape. Hybrid encoder–decoder with Transformer cross-attention. Input: image pair . Stage 1 extracts coarse feature maps , at resolution and fine feature maps , at resolution via a shared CNN backbone (ResNet-like with FPN structure). Stages 2–4 operate on the flattened coarse maps, culminating in the final correspondence set .
Blocks. Four sub-modules executed in sequence:
-
Local feature CNN. A shared ResNet-like backbone with FPN structure extracts two feature-map pairs per image: coarse maps at resolution (, ) and fine maps at resolution (, ). 2D sinusoidal positional encoding (DETR-style) is added once to the coarse maps at backbone output.
-
Coarse-level Local Feature Transformer (LoFTR module). The coarse maps are flattened to 1D sequences and processed by interleaved self-attention and cross-attention layers. To reduce complexity from to , each attention operation uses the Linear Transformer kernel :
The ELU+1 kernel substitutes vanilla softmax attention with a non-negative kernel that admits the associativity trick, reducing per-sequence complexity from to when the feature dimension .
where the key–value product is computed once and shared across all queries.
Self-attention layers aggregate context within one image; cross-attention layers aggregate context across the image pair. Outputs are context- and position-dependent representations , .
- Coarse matching module. A score matrix is formed as:
where is a temperature parameter. Two matching variants are offered: (a) dual-softmax (LoFTR-DS) — applies row-wise and column-wise softmax and multiplies pointwise to form the confidence matrix ; (b) optimal transport (LoFTR-OT) — applies the Sinkhorn algorithm with 3 iterations. Coarse matches are selected by a confidence threshold combined with a mutual-nearest-neighbour (MNN) criterion.
The self/cross attention update at each layer, illustrating the residual structure:
import torch
import torch.nn as nn
import torch.nn.functional as F
class LinearAttention(nn.Module):
"""Self- or cross-attention layer with ELU+1 Linear Transformer kernel."""
def __init__(self, dim: int):
super().__init__()
self.to_qkv = nn.Linear(dim, 3 * dim, bias=False)
self.to_out = nn.Linear(dim, dim)
def forward(self, x: torch.Tensor, source: torch.Tensor) -> torch.Tensor:
# x: (B, N, D) source: (B, M, D) — same as x for self-attention
q, _, _ = self.to_qkv(x).chunk(3, dim=-1)
_, k, v = self.to_qkv(source).chunk(3, dim=-1)
phi = lambda t: F.elu(t) + 1.0 # ELU+1 kernel
q, k = phi(q), phi(k)
# O(N) associativity: compute KV once, then Q·(KV)
kv = torch.einsum("bmd,bme->bde", k, v) # (B, D, D)
out = torch.einsum("bnd,bde->bne", q, kv) # (B, N, D)
return x + self.to_out(out) # residual
- Fine-level refinement. For each coarse match , a window is cropped from the fine feature maps , . A correlation volume over this window produces logits; the expected sub-pixel position is computed as a weighted sum over the grid, yielding the final match set with sub-pixel precision.
Training. The indoor model is trained on ScanNet; the outdoor model on MegaDepth — the same protocol as SuperGlue. Ground-truth coarse matches are derived from camera poses and depth maps: mutual nearest neighbours of the -resolution grids projected through known depth. The coarse loss is negative log-likelihood on over (NLL for dual-softmax; same formulation as SuperGlue for OT). The fine loss is weighted negative log-likelihood on fine-level window predictions, uncertainty-weighted so low-confidence predictions contribute less. Training runs end-to-end from random initialisation on 64 GTX 1080Ti GPUs, converging in approximately 24 hours for the indoor model. Images are resized to 840 long-side for training on MegaDepth and to for ScanNet; MegaDepth evaluation uses 1200 long-side. Headline results: on ScanNet indoor relative pose estimation, LoFTR improves the state of the art by 13% over SuperGlue at AUC@10° and by 61% over DRC-Net at AUC@10°. On HPatches homography estimation, LoFTR-DS achieves state-of-the-art AUC across the @3 px, @5 px, and @10 px corner-error thresholds.
Complexity. Runtime: 116 ms per 640×480 image pair for LoFTR-DS, 130 ms for LoFTR-OT (3 Sinkhorn iterations), measured on RTX 2080Ti. The LoFTR module operates on feature sequences; at 640×480 this gives sequences of length 4800, and the linear attention avoids the cost of full softmax attention. The paper does not report a parameter count.
Implementations
Official PyTorch release with Apache-2.0 code and Apache-2.0 pretrained weights. Indoor and outdoor weights are distributed separately.
Assessment
Novelty.
- Replaces the conventional detect-describe-match pipeline (SIFT/ORB/SuperPoint → matcher) with an end-to-end dense matching architecture that requires no keypoint detector at any stage.
- Cross-attention in the LoFTR module provides every position a global receptive field, enabling matches to be established in textureless regions where CNN-based detectors find no repeatable interest points — a persistent blind spot of classical and learned detector-based methods since SIFT.
- Coarse-to-fine refinement combines Transformer-style global reasoning at resolution with sub-pixel correlation-based refinement at resolution, independently from any detector's localisation quality.
Strengths.
- On ScanNet indoor relative pose estimation, LoFTR improves AUC@10° by 13% over SuperGlue and by 61% over DRC-Net (Table §4.2). On HPatches homography estimation, LoFTR-DS achieves state-of-the-art AUC at @3 px, @5 px, and @10 px corner-error thresholds (§4.1).
- Strong in precisely the scenarios where detector-based methods fail: low-texture surfaces, wide-baseline indoor scenes (ScanNet), and outdoor landmark scenes with illumination variation (MegaDepth).
- Apache-2.0 code and weights license — commercial deployment is unrestricted.
Limitations.
- Dense attention at resolution imposes a memory cost proportional to even with linear attention. At 640×480 the coarse sequence length is 4800; at inference time on high-resolution inputs, memory and runtime scale quadratically in the image area before any linear-attention savings.
- Slower per pair than sparse SuperPoint+SuperGlue at modest keypoint counts: 116 ms (LoFTR-DS) or 130 ms (LoFTR-OT) per 640×480 pair on RTX 2080Ti, whereas SuperGlue at 512 keypoints runs at approximately 69 ms on GTX 1080. The newer SuperPoint+LightGlue widens this gap further — Lindenberger et al. Fig. 1 reports approximately 8× the throughput of LoFTR at comparable accuracy on standard benchmarks.
- Two separate model weights exist for indoor (ScanNet) and outdoor (MegaDepth) scenes; inference-time domain mismatch between trained weights and scene type degrades performance, and no single universal model is available.
References
- J. Sun, Z. Shen, Y. Wang, H. Bao, X. Zhou. LoFTR: Detector-Free Local Feature Matching with Transformers. CVPR, 2021. arXiv.00680
- P. Sarlin, D. DeTone, T. Malisiewicz, A. Rabinovich. SuperGlue: Learning Feature Matching with Graph Neural Networks. CVPR, 2020. arXiv.11763
- D. DeTone, T. Malisiewicz, A. Rabinovich. SuperPoint: Self-Supervised Interest Point Detection and Description. CVPR Workshops, 2018. arXiv.07629
- G. Potje, F. Cadar, A. Araujo, R. Martins, E. R. Nascimento. XFeat: Accelerated Features for Lightweight Image Matching. CVPR, 2024. arXiv.19174
- P. Lindenberger, P. Sarlin, M. Pollefeys. LightGlue: Local Feature Matching at Light Speed. ICCV, 2023. arXiv.13643