One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search
Bingqian Lu, Jianyi Yang, Weiwen Jiang, Yiyu Shi, Shaolei Ren, Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 5, no. 3, Dec, 2021. (SIGMETRICS 2022)
@article{
luOneProxy2021,
title={One Proxy Device Is Enough for Hardware-Aware Neural Architecture Search},
author={Bingqian Lu and Jianyi Yang and Weiwen Jiang and Yiyu Shi and Shaolei Ren},
journal = {Proceedings of the ACM on Measurement and Analysis of Computing Systems},
month = Dec,
year = 2021,
volume = {5},
number = {3},
articleno = {34},
numpages = {35},
}
Given N target devices, our OneProxy approach can keep the total neural architecture search cost at O(1).
CNNs are used in numerous real-world applications such as vision-based autonomous driving and video content analysis. To run CNN inference on various target devices, hardware-aware neural architecture search (NAS) is crucial. A key requirement of efficient hardware-aware NAS is the fast evaluation of inference latencies in order to rank different architectures. While building a latency predictor for each target device has been commonly used in state of the art, this is a very time-consuming process, lacking scalability in the presence of extremely diverse devices.
Left: NAS without a supernet. Right: One-shot NAS with a supernet.
Cost Comparison of Hardware-aware NAS Algorithms for 𝑛 Target Devices.
We address the scalability challenge by exploiting latency monotonicity — the architecture latency rankings on different devices are often correlated. When strong latency monotonicity exists, we can re-use architectures searched for one proxy device on new target devices, without losing optimality.
To quantify the degree of latency monotonicity, we use the metric of Spearman’s Rank Correlation Coefficient (SRCC), which lies between -1 and 1 and assesses statistical dependence between the rankings of two variables using a monotonic function.
SRCC of 10k sampled models latencies in MobileNet-V2 space on different pairs of mobile and non-mobile devices.
We exploit the correlation among devices and propose efficient transfer learning to boost the otherwise possibly weak latency monotonicity for a target device.
In the MobileNet-V2 space, with S5e as default proxy device
In the NAS-Bench-201 search space on CIFAR-10 (left), CIFAR-100 (middle) and ImageNet16-120 (right) datasets, with Pixel3 as our proxy device
In the FBNet search spaces on CIFAR-100 (left) and ImageNet16-120 (right) datasets, with Pixel3 as our proxy device
SRCC for various devices in the NAS-Bench-201 search space with latencies collected from [19, 29, 49, 50]
Results for non-mobile target devices with the default S5e proxy and AdaProxy. The top row shows the evolutionary search results with real measured accuracies, and the bottom row shows the exhaustive search results based on 10k random architectures (in the MobileNet-V2 space) and predicted accuracies.
Exhaustive search results for different target devices on NAS-Bench-201 architectures (CIFAR-10 dataset). Pixel3 is the proxy.
HW-NAS-Bench: Hardware-Aware Neural Architecture Search Benchmark
Eagle: Efficient and Agile Performance Estimator and Dataset
Once for All: Train One Network and Specialize it for Efficient Deployment