Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruni...

Full description

Bibliographic Details
Main Authors:	Benjamin Hawks, Javier Duarte, Nicholas J. Fraser, Alessandro Pappalardo, Nhan Tran, Yaman Umuroglu
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2021-07-01
Series:	Frontiers in Artificial Intelligence
Subjects:	pruning quantization neural networks generalizability regularization batch normalization
Online Access:	https://www.frontiersin.org/articles/10.3389/frai.2021.676564/full

id	doaj-291d50aea7df4f608943e6da2f9ec5ff
record_format	Article
spelling	doaj-291d50aea7df4f608943e6da2f9ec5ff2021-07-09T07:52:34ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122021-07-01410.3389/frai.2021.676564676564Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network InferenceBenjamin Hawks0Javier Duarte1Nicholas J. Fraser2Alessandro Pappalardo3Nhan Tran4Nhan Tran5Yaman Umuroglu6Fermi National Accelerator Laboratory, Batavia, IL, United StatesUniversity of California San Diego, La Jolla, CA, United StatesXilinx Research, Dublin, IrelandXilinx Research, Dublin, IrelandFermi National Accelerator Laboratory, Batavia, IL, United StatesNorthwestern University, Evanston, IL, United StatesXilinx Research, Dublin, IrelandEfficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.https://www.frontiersin.org/articles/10.3389/frai.2021.676564/fullpruningquantizationneural networksgeneralizabilityregularizationbatch normalization
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Benjamin Hawks Javier Duarte Nicholas J. Fraser Alessandro Pappalardo Nhan Tran Nhan Tran Yaman Umuroglu
spellingShingle	Benjamin Hawks Javier Duarte Nicholas J. Fraser Alessandro Pappalardo Nhan Tran Nhan Tran Yaman Umuroglu Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference Frontiers in Artificial Intelligence pruning quantization neural networks generalizability regularization batch normalization
author_facet	Benjamin Hawks Javier Duarte Nicholas J. Fraser Alessandro Pappalardo Nhan Tran Nhan Tran Yaman Umuroglu
author_sort	Benjamin Hawks
title	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_short	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_full	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_fullStr	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_full_unstemmed	Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
title_sort	ps and qs: quantization-aware pruning for efficient low latency neural network inference
publisher	Frontiers Media S.A.
series	Frontiers in Artificial Intelligence
issn	2624-8212
publishDate	2021-07-01
description	Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.
topic	pruning quantization neural networks generalizability regularization batch normalization
url	https://www.frontiersin.org/articles/10.3389/frai.2021.676564/full
work_keys_str_mv	AT benjaminhawks psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT javierduarte psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT nicholasjfraser psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT alessandropappalardo psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT nhantran psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT nhantran psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference AT yamanumuroglu psandqsquantizationawarepruningforefficientlowlatencyneuralnetworkinference
_version_	1721311488983957504

Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Similar Items