Motivation
Takes an RGB image (224×224) as input and produces 1000-way ImageNet class logits. The architecture is discovered by platform-aware neural architecture search (NAS) that maximizes a multi-objective reward combining top-1 accuracy and directly measured on-device inference latency, rather than hand-designing the network or approximating latency with FLOPs. The search operates over a factorized hierarchical space of MBConv-based blocks, producing architectures that are Pareto-near-optimal for a specified latency target on a given mobile CPU.
Architecture
Family & shape. Searched CNN backbone; input RGB at 224×224; output 1000-way logits. The two primary released models are MnasNet-A1 (search space includes squeeze-and-excitation) and MnasNet-B1 (baseline without SE).
Blocks. The building block is the mobile inverted-bottleneck convolution (MBConv) inherited from MobileNetV2. Rather than stacking identical repeated cells (NASNet-style), MnasNet uses a factorized hierarchical search space (Sec. 4.1, Figure 4): the network is partitioned into predefined blocks with fixed input resolutions and filter-size schedules. Within each block , the search independently selects ConvOp {conv, dconv, MBConv}, KernelSize {3, 5}, SERatio {0, 0.25}, SkipOp {pool, identity, none}, output filter size , and layer count ; all layers within a block share one architecture. The total search space is approximately , compared with for a flat per-layer approach. Critically, latency is measured by running each sampled model on the single-thread big CPU core of a Pixel 1 phone (batch size 1) — not approximated by FLOPs, which the paper shows is an unreliable proxy: MobileNetV1 (575M MAdds, 113ms) and NASNet (564M MAdds, 183ms) have nearly identical FLOPs but a 1.6× latency difference (Table 1, Sec. 1).
Training. An RNN controller maps each architecture to a token sequence and is trained with Proximal Policy Optimization (PPO) to maximize the expected multi-objective reward over sampled architectures. The reward embeds latency directly into the objective:
Maximize accuracy and on-device latency jointly via a weighted product that approximates the Pareto frontier in a single search.
with , calibrated so that doubling latency trades for approximately 5% relative accuracy gain (Sec. 3, Eq. 2–3).
Approximately 8K models are sampled per search; the top 15 are transferred to full ImageNet training, and 1 is transferred to COCO. Each full search takes 4.5 days on 64 TPUv2 devices. Headline result: MnasNet-A1 achieves 75.2% top-1 at 78ms on a Pixel phone (Table 1).
Complexity. MnasNet-A1: ~3.9M parameters, ~312M MAdds; ~1.8× faster than MobileNetV2 at 0.5% higher accuracy, and ~2.3× faster than NASNet-A at 1.2% higher accuracy (Table 1, Sec. 1).
Implementations
Authors' TPU/TensorFlow release; torchvision provides the de-facto PyTorch port.
Assessment
Novelty.
- Embeds direct real-device latency in the NAS reward in place of FLOPs proxies used by NASNet and DARTS; the paper demonstrates the proxy failure empirically (MobileNetV1 vs. NASNet, Table 1).
- Introduces the factorized hierarchical search space — independently searching each predefined block rather than repeating a single cell — enabling per-block layer diversity.
- Uses the MBConv block from MobileNetV2 as the primary search primitive, extending it with optional squeeze-and-excitation (SERatio {0, 0.25}).
Strengths.
- MnasNet-A1 achieves 75.2% top-1 / 92.5% top-5 at 78ms on Pixel 1 with 3.9M parameters and 312M MAdds (Table 1).
- 1.8× faster than MobileNetV2 at 0.5% higher top-1 accuracy; 2.3× faster than NASNet-A at 1.2% higher accuracy (Table 1, Sec. 1).
- On COCO object detection (MnasNet-A1 + SSDLite): 23.0 mAP at 203ms with 4.9M parameters and 0.8B MAdds — comparable to SSD300 (23.2 mAP) at far fewer MAdds (Table 3).
- SE ablation (Table 2): removing SE moves MnasNet-A1 (75.2%) to MnasNet-B1 (74.5% at 77ms), confirming the search-space contribution.
Limitations.
- The latency reward is device-specific: an architecture optimized for the Pixel 1 ARM big core may not be Pareto-optimal on GPU, DSP, or newer ARM targets; the search must be re-run per hardware class.
- Search cost is high: ~8K model evaluations over 4.5 days on 64 TPUv2 devices; not reproducible without substantial TPU and device infrastructure.
- The 7-block macro-skeleton with fixed stride positions is a hard prior; the search cannot discover architectures requiring a different block partition.
- The reward exponent encodes a fixed accuracy-latency preference; different application requirements require recalibration, and the two regimes produce qualitatively different Pareto curves (Figure 3).
References
- M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, Q. V. Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. CVPR, 2019. arXiv.11626
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen. MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR, 2018. arXiv.04381
- A. Howard, M. Sandler, et al. Searching for MobileNetV3. ICCV, 2019. arXiv.02244
flowchart LR
C["RNN controller<br/>samples architecture"] --> T["Proxy-train<br/>ImageNet"]
T --> L["Measure latency<br/>on Pixel phone"]
L --> R["Multi-objective reward<br/>ACC × (LAT/T)^w"]
R --> U["PPO update<br/>controller weights"]
U --> C