Motivation
Extract sparse or semi-dense local correspondences between two images under a fixed CPU compute budget, in a single forward pass that produces keypoints, 64-D descriptors, and a per-feature reliability score. Input: grayscale image . Output: a keypoint heatmap , a descriptor field , and a reliability map ; at match time, an additional MLP refines coarse nearest-neighbour pairs to pixel-level offsets. The model is specific to an early-downsampling backbone with triple-rate channel progression and a keypoint head that is decoupled from the descriptor encoder, which together remove the resolution / depth trade-off that forces prior detectors either to share a heavy encoder (SuperPoint) or to evaluate high-resolution feature maps at match time (DISK, LoFTR).
Architecture
Family & shape. Three-headed convolutional encoder. Input: grayscale . Outputs: keypoint logits (64 intra-cell positions + dustbin), descriptors , reliability . A separate match-refinement MLP consumes concatenated nearest-neighbour descriptor pairs at inference time.
Blocks. Backbone is six basic blocks with channel progression and output resolutions respectively (§3.1, §B). A basic layer is a 2-D convolution with kernel , ReLU, and BatchNorm; the first basic layer of each block uses stride 2 for spatial halving. The defining choice is the triple-rate channel schedule — each spatial halving multiplies the channel count by roughly instead of VGG's (§3.1) — which starves the first two blocks of capacity where resolution is highest, and concentrates depth where the feature maps are small.
flowchart TB
I["H×W, 1"] --> B1["block1<br/>H/2×W/2, 4→8"]
B1 --> B2["block2<br/>H/4×W/4, 24"]
B2 --> B3["block3<br/>H/8×W/8, 64"]
B3 --> B4["block4<br/>H/16×W/16, 64"]
B4 --> B5["block5<br/>H/32×W/32, 128"]
B3 --> D["upsample+sum<br/>to H/8×W/8"]
B4 --> D
B5 --> D
D --> F["descriptor head<br/>F: H/8×W/8×64"]
D --> R["reliability head<br/>R: H/8×W/8"]
I --> KB["unfold 8×8<br/>→ H/8×W/8×64"]
KB --> KH["4× (1×1 conv)<br/>K: H/8×W/8×65"]
The three output heads share the encoder only for the descriptor and reliability branches (both feed from the fused multi-scale descriptor tensor). The keypoint branch is parallel and operates on a direct unfold of the 8×8 image grid into a 64-channel tensor, then applies four 1×1 convolutions (§3.2). This avoids the receptive-field conflict that a shared detector-descriptor encoder imposes on compact backbones, ablated in Table 5 row (iii).
Symmetric NLL of the diagonal of the descriptor similarity matrix under row-wise softmax in both directions, pulling matched features to mutual nearest neighbours.
At match time a refinement MLP converts a coarse descriptor pair into a pixel-level offset:
The MLP takes the two 64-D coarse descriptors, never high-resolution features — the whole refinement stage costs a single linear chain per matched pair (§3.2, Fig. 4).
Training. Data: MegaDepth + synthetically warped MS-COCO pairs in a 6
ratio, resized to (§B). Loss is a weighted sum (Eq. 7) of the dual-softmax descriptor loss above, an reliability loss supervised by the per-row maxima of the dual-softmax probabilities (Eq. 4), an NLL fine-offset loss on the offset logits (Eq. 5), and an NLL keypoint loss supervised by knowledge distillation from ALIKE-tiny's keypoint positions (Eq. 6). Optimiser: Adam, learning rate with exponential decay factor per gradient updates, batch size 10 image pairs, 160,000 iterations to convergence (§B). Reported results on Megadepth-1500 relative pose: / / at thresholds for XFeat sparse (4,096 keypoints) and / / for XFeat* semi-dense (10,000 keypoints) at and frames per second on an Intel i5-1135G7 CPU at VGA resolution (Table 1). HPatches homography MHA on the viewpoint split at -pixel thresholds: / / (Table 3).Complexity. Descriptors are 64-D at spatial resolution; 23 convolutional layers in the backbone (§B). The paper does not report a parameter count or FLOPs figure; the released PyTorch checkpoint at the pinned commit is on disk. Training peaked at VRAM on a single NVIDIA RTX 4090 (§B).
Implementations
Official PyTorch release with Apache-2.0 code and Apache-2.0 weights shipped in-tree at the pinned commit.
Assessment
Novelty.
- Replaces VGG's channel-doubling with a -per-halving schedule that starves the early (high-resolution) stages and concentrates depth at the small-resolution end (§3.1, ablation Table 5 row ii).
- Decouples keypoint detection from the descriptor encoder via a parallel -convolution branch on -unfolded raw pixel cells, contrasting SuperPoint's shared encoder and its ablation-confirmed degradation under a compact backbone (Table 5 row iii).
- Introduces a match-refinement MLP that consumes only coarse 64-D descriptor pairs, not high-resolution feature maps — contrast DISK's and LoFTR's refinement, which re-evaluate dense features at full resolution (§3.2, Fig. 4).
- Trains the keypoint head by distillation from ALIKE-tiny rather than from a hand-crafted ground truth, reusing the teacher's bias toward low-level corner/blob/line structures to match the receptive field of the small detector (Eq. 6).
Strengths.
- On Megadepth-1500, XFeat (sparse) reaches AUC@5° vs SuperPoint's at FPS vs FPS on the same i5-1135G7 CPU — throughput at matched descriptor dimensionality (Table 1).
- XFeat* (semi-dense) reaches AUC@5° vs DISK's at FPS vs FPS on the same CPU — throughput with AUC-point gap (Table 1).
- Generalises to indoor imagery: on ScanNet-1500, XFeat*'s AUC@5°/@10°/@20° / / outperforms SuperPoint's / / and DISK-10k's / / without retraining (Table 2).
- Official code and weights ship under Apache-2.0, covering commercial and redistribution use without explicit licensor permission.
Limitations.
- The keypoint head is trained by distillation from ALIKE-tiny (Eq. 6), so its recall ceiling is bounded by the teacher's bias toward low-level corner/blob structures — repetitive scenes without such structures are ablation-identified as a weak regime (Sec. 4.4).
- The paper reports neither parameter count nor FLOPs, and no inference-memory budget beyond the training footprint — exact deployment sizing requires measuring the checkpoint against the target device.
- Compact -D descriptors reach AUC@5° against DISK's on Megadepth-1500 (sparse setting, Table 1) — the capacity reduction costs absolute pose accuracy on wide-baseline pairs.
- Training used a -VRAM RTX 4090 for 36 hours on a 6 MegaDepth/synthetic-COCO mix; reproducing on consumer GPUs with less VRAM requires either gradient accumulation or a smaller batch, neither of which is evaluated in the paper.
- Compared with SuperPoint: see When to choose SuperPoint over XFeat on the SuperPoint page, which hosts the comparison per the older-paper-hosts rule.