Motivation
Jointly detect interest points and compute discriminative descriptors for full images in a single forward pass, trained without any human keypoint annotations. The detector head produces a sparse keypoint set (after threshold + NMS); the descriptor head produces a dense 256-D map sampled at the detected positions. The system replaces sequential detect-then-describe pipelines (SIFT's DoG + 128-D descriptor; LIFT's staged supervised architecture) with a single shared VGG-style encoder feeding two parallel decoder heads, and circumvents the absence of labelled real-image keypoint data via a self-supervised procedure called Homographic Adaptation.
Architecture
Family & shape. Fully-convolutional CNN. Input: grayscale image . Two simultaneous outputs: a per-pixel keypoint score map (decoded from a cell softmax) and a dense descriptor map (bicubic-upsampled from and L2-normalised).
Blocks. Shared VGG-style encoder of eight convolutions with channel widths 64-64-64-64-128-128-128-128 and a max-pool after every pair of convolutions — three poolings yield , at the encoder output (§3.1). Every conv layer is followed by ReLU + BatchNorm. Each decoder head is a single conv (256 channels) followed by a conv to 65 (detector) or 256 (descriptor) channels.
Detector head. A 65-channel softmax per cell — 64 positions inside an image-cell plus a 65th "no-keypoint" dustbin class (§3.2). No learned upsampling: the 64 spatial classes are reshaped into the block via depth-to-space ("pixel shuffle"), recovering the resolution exactly. The dustbin lets the network express "no point in this cell" without inflating one of the 64 spatial bins.
Descriptor head. A semi-dense 256-D map at the encoder resolution is bicubic-upsampled to and L2-normalised at every spatial position (§3.3). Inspired by UCN (Choy et al. 2016).
flowchart TB
X["input<br/>H×W, gray"] --> E1["VGG encoder<br/>8 × conv 3×3<br/>3 × maxpool 2×2"]
E1 --> Hc["H/8 × W/8 × 128"]
Hc --> Dh["detector head<br/>3×3 conv → 1×1 conv<br/>65 channels"]
Hc --> Dd["descriptor head<br/>3×3 conv → 1×1 conv<br/>256 channels"]
Dh --> Sm["cell softmax + dustbin<br/>pixel-shuffle"]
Sm --> Out1["score map<br/>H × W"]
Dd --> Bi["bicubic upsample<br/>+ L2 normalise"]
Bi --> Out2["descriptor map<br/>H × W × 256"]
A self-supervised procedure for generating pseudo-ground-truth keypoint labels on real images. Apply random homographies to an unlabelled image, run the current detector on each warped copy, back-project detections through the inverse homography, and aggregate:
The aggregated heatmap promotes points consistently detected across many viewpoints; isolated false positives wash out. Repeatability saturates at — diminishing returns beyond that (§5.2).
Training. Two stages. (1) MagicPoint — train the encoder + detector head on Synthetic Shapes (rendered triangles, quadrilaterals, lines, ellipses, cubes, checkerboards, stars) with full supervision since corner ground truth is unambiguous on these primitives (§4). 200,000 iterations on-the-fly. (2) SuperPoint — apply MagicPoint to MS-COCO 2014 (80k grayscale images at ) under Homographic Adaptation with to generate pseudo-labels, then jointly train both heads on image pairs related by random homographies (§6). Combined loss (Eq. 1):
with . The detector loss is a cell-wise cross-entropy over the 65 classes (Eq. 2–3); the descriptor loss is a hinge loss on cell-pair correspondences induced by the homography (Eq. 5–6) with margins , and class-balance weight . Optimiser: Adam, lr , batch 32. Augmentation: Gaussian noise, motion blur, brightness changes.
Complexity. ~1.3M trainable parameters (estimate from eight VGG-style conv layers + two decoder heads — not stated explicitly in the paper). Inference: ~11 ms per image on Titan X GPU at → 70 FPS (§7.1). Descriptor sampling at detected points: ~1.5 ms on CPU. No FLOPs figure reported.
Implementations
The official Magic Leap PyTorch release ships pretrained weights (superpoint_v1.pth) and an inference notebook; the LICENSE at the pinned commit restricts use to noncommercial academic research — see Limitations.
Assessment
Novelty.
- First system to jointly detect keypoints and compute descriptors in a single fully-convolutional forward pass over full images (not patches), with no human keypoint annotations and no SfM supervision (contrast: LIFT requires SfM-derived labels; SIFT uses handcrafted DoG; ORB uses FAST + steered BRIEF).
- Homographic Adaptation as a multi-homography aggregation procedure for self-supervised label generation from any unlabelled image collection (Eq. 10). The recipe is general — not tied to any specific architecture.
- 65-class cell softmax with dustbin as a parameter-free upsampling scheme that avoids the checkerboard artefacts of learned deconvolution (§3.2).
Strengths.
- Descriptor discriminability under illumination change: NN mAP 0.821 on HPatches vs SIFT 0.694, LIFT 0.664, ORB 0.735 (Table 4).
- Real-time on GPU: 70 FPS at on Titan X (§7.1) — well below SIFT's CPU runtime.
- Self-supervised — re-training on a new domain requires only unlabelled images and a re-run of Homographic Adaptation; no manual annotation.
- Foundational influence: the two-head (shared encoder + 65-class detector + descriptor) design is directly reused by XFeat and inspired R2D2, DISK, and ALIKE.
Limitations.
- No subpixel localisation. MLE 1.158 px on HPatches vs SIFT 0.833 px (Table 4). The cell-aligned softmax output has no subpixel correction; downstream tasks requiring sub-pixel accuracy must add a refinement step (Hessian saddle, mean-shift) at the detected positions.
- Not rotation-invariant. Fails on extreme in-plane rotation outside the training homography distribution (§7.3, Figure 8 caption: "failure case … due to extreme in-plane rotation not seen in the training examples"). ORB's steered BRIEF handles this regime better.
- Not scale-invariant by construction. No scale-space, no orientation normalisation. Scale coverage is empirical, limited to the training homography range.
- Supervision from homographies only. Generalisation to non-planar parallax scenes is unverified by the paper's HPatches-centric evaluation protocol.
- Keypoint density ceiling. The cell design caps effective keypoint density to one point per 64 pixels; at the theoretical maximum is 4800 points (paper evaluates at ).
- Restrictive code and weights license. The official Magic Leap repository's LICENSE is a custom "academic or non-profit organization noncommercial research use only" agreement — the pretrained weights inherit the same restriction. Commercial deployment requires either a separate licensing agreement with Magic Leap or retraining from scratch on a redistributable dataset.
When to choose SuperPoint over XFeat
XFeat (Potje 2024) is the direct architectural successor to SuperPoint, inheriting the two-head (detector + descriptor) shared-encoder design and the 65-class cell-softmax-with-dustbin output convention. The structural departure: XFeat's keypoint head operates on unfolded raw-pixel blocks rather than the deep encoder output — a radical simplification motivated by inference on hardware-constrained devices.
| SuperPoint (2018) | XFeat (2024) | |
|---|---|---|
| Encoder | VGG-style, 8 conv layers, ~1.3M params | Featherweight, 6 basic blocks, channel widths {4, 8, 24, 64, 64, 128} |
| Keypoint head | shared encoder output | unfolded raw pixel blocks (parallel branch) |
| Descriptor dim | 256 | 64 |
| Inference | 70 FPS on Titan X GPU at | 27 FPS on i5-1135G7 CPU at |
| Training data | MS-COCO + Synthetic Shapes (self-supervised) | MegaDepth + synthetic COCO warps (6) |
| HPatches MLE | 1.158 px (Table 4) | similar regime — both lack subpixel refinement |
| Match refinement | not built in | MLP head on coarse-NN descriptor pair (§3.2) |
Choose SuperPoint when (1) you have GPU inference and want maximum descriptor capacity (256-D vs 64-D matters for very large keypoint corpora and for re-ID-style retrieval); (2) you specifically need the Homographic Adaptation pipeline for self-supervised retraining on a new domain — XFeat uses MegaDepth supervision and does not provide a self-supervised retrain recipe; (3) you need the foundational reference for downstream comparisons (R2D2, DISK, ALIKE all benchmark against SuperPoint, not XFeat). Choose XFeat when CPU inference is the requirement — XFeat is purpose-built for hardware-constrained deployment (Orange Pi Zero 3, mobile robots) where SuperPoint's encoder is impractical.
References
- D. DeTone, T. Malisiewicz, A. Rabinovich. SuperPoint: Self-Supervised Interest Point Detection and Description. CVPR Workshops, 2018. arXiv.07629
- G. Potje, F. Cadar, A. Araújo, R. Martins, E. R. Nascimento. XFeat: Accelerated Features for Lightweight Image Matching. IEEE CVPR, 2024. (Direct architectural successor.)
- E. Rosten, T. Drummond. Machine Learning for High-Speed Corner Detection. ECCV, 2006. (FAST — the prior art that "cast high-speed corner detection as a machine learning problem.")
- K. M. Yi, E. Trulls, V. Lepetit, P. Fua. LIFT: Learned Invariant Feature Transform. ECCV, 2016. (Supervised SfM-trained competitor; superseded on HPatches.)