Definition
The attention mechanism computes each output element as a weighted sum of value vectors, where the weights are determined by a learned, input-dependent compatibility function between query and key vectors.
Given a query matrix , key matrix , and value matrix , the attended representation is
Each output row is a softmax-weighted combination of the rows of , the weights reflecting the dot-product similarity between each query and all keys, scaled by to prevent saturation of the softmax. Input: , , . Output: , each row aggregating information from all positions weighted by their compatibility with the corresponding query.
The mechanism places no assumption on the positional arrangement of the inputs: every query can attend to every key regardless of distance. This distinguishes attention from convolution and recurrence, which constrain receptive fields by locality or sequential order.
Mathematical Description
Scaled dot-product attention
For query and key-value pairs , the scalar score is , the weight satisfies , and the output is . In the SuperGlue graph network each layer applies this as a residual update,
with linear projections of using independent parameters per layer.
Multi-head attention
Multi-head attention applies independent attention operations in parallel and concatenates them,
each using separate learned projections. SuperGlue, LightGlue, and LoFTR all use heads with hidden dimension ; each head learns a distinct compatibility function and the concatenated output is projected back to dimension .
Self-attention and cross-attention
The two attention modes differ in the source of queries, keys, and values. In self-attention all three derive from the same set, so each element aggregates context from the other elements of its own set. In cross-attention queries come from one set and keys and values from a second, so each element aggregates information from the other set. The matching networks alternate the two: in SuperGlue's attentional graph network, self-attention layers let a keypoint reason about the configuration of all keypoints in its own image, and cross-attention layers let it search for candidate correspondences in the other image. Ablation on the ScanNet benchmark shows that removing cross-attention degrades pose-estimation AUC@20° from 51.84% to 42.57%, and removing positional encoding drops it to 47.12% — both context types are essential.
Positional encoding
Attention weights depend only on query and key content, so without positional information the mechanism is permutation-invariant. LoFTR adds a 2-D sinusoidal absolute encoding to its backbone feature maps once, before the attention stack, allowing the transformer to distinguish positions on the -resolution grid even in textureless regions. LightGlue instead injects relative geometry at every self-attention layer via a rotary encoding,
where is block-diagonal with planar rotation blocks of angle for a learned basis . Keypoint positions must be normalised to ; raw pixel coordinates break the encoding.
Quadratic cost and the linear approximation
Full attention forms a dense score matrix, costing time and memory — for LoFTR's dense feature maps where . LoFTR applies a linear-attention approximation: replacing softmax with a non-negative kernel feature map and exploiting associativity,
reduces the cost to by computing the key-value aggregate once. The ELU+1 kernel keeps the weights non-negative — required for the factorisation to preserve normalisation — at the cost of a smoother, less peaked attention distribution than softmax.
Numerical Concerns
The scaling. Without it, the dot product has variance proportional to for unit-variance inputs; large magnitudes push the softmax into saturation where gradients vanish. Dividing by restores unit variance of the pre-softmax scores. With the denominator is .
Softmax stability. Naive exponentiation overflows in float32 once scores exceed roughly . The stable form subtracts the per-row maximum before exponentiating. SuperGlue's downstream Sinkhorn assignment runs entirely in the log domain to prevent overflow in its augmented score matrix, which is not -scaled.
Quadratic memory. At keypoints per image the cross-attention score matrix holds millions of float32 values per head; with 4 heads and both directions it reaches hundreds of megabytes. LightGlue prunes points classified as unmatchable after each layer, progressively shrinking the effective set size.
Linear-attention trade-off. The ELU+1 kernel produces smoother weights than softmax — attention collapse onto a single element is less likely, but so is strong selective attention to one dominant match, which slightly raises ambiguity in highly discriminative regions.
Low-precision inference. FP16 inference is standard; LightGlue's rotary rotation entries are trigonometric and well-conditioned because positions are normalised to . SuperGlue's matching descriptors are deliberately not -normalised — their magnitude encodes confidence — so scores are unbounded and 16-bit log-sum-exp precision in the Sinkhorn step matters.
Where it appears
Three registered models use the attention mechanism, each in a distinct way:
- superglue — an attentional graph network alternating self- and cross-attention layers; self-attention builds intra-image keypoint context, cross-attention discovers candidate matches. The attention layers are the context-enrichment stage feeding a Sinkhorn optimal-transport assignment.
- lightglue — extends the SuperGlue framework with rotary positional encoding and a per-layer confidence head that drives adaptive early exit; easy image pairs exit after an average of 4.7 of 9 layers, hard pairs run all 9, and token pruning reduces the quadratic cost at each layer.
- loftr — applies linear attention in interleaved self- and cross-attention layers over dense coarse feature maps at resolution, attending over all spatial positions so that matches emerge in low-texture regions where no detector fires.
The transformer architecture underlying all three originates outside computer vision, in sequence modelling for natural language processing; the matching networks adapt it to sets of local image features by pairing self-attention for intra-set context with cross-attention for inter-set correspondence search.
References
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin. Attention Is All You Need. NeurIPS, 2017.
- P.-E. Sarlin, D. DeTone, T. Malisiewicz, A. Rabinovich. SuperGlue: Learning Feature Matching with Graph Neural Networks. IEEE CVPR, 2020.
- P. Lindenberger, P.-E. Sarlin, M. Pollefeys. LightGlue: Local Feature Matching at Light Speed. IEEE ICCV, 2023.
- J. Sun, Z. Shen, Y. Wang, H. Bao, X. Zhou. LoFTR: Detector-Free Local Feature Matching with Transformers. IEEE CVPR, 2021.
- J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, Y. Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv.09864, 2021.