Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference
Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruni...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2021-07-01
|
Series: | Frontiers in Artificial Intelligence |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/frai.2021.676564/full |