ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
碩士 === 國立清華大學 === 資訊工程學系所 === 107 === Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and ma...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2018
|
Online Access: | http://ndltd.ncl.edu.tw/handle/859cgm |
id |
ndltd-TW-107NTHU5392009 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-107NTHU53920092019-09-15T03:33:43Z http://ndltd.ncl.edu.tw/handle/859cgm ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator 基於深度卷積神經網路之具功率意識位元序列 串流加速器 Hsu, Lien-Chih 徐連志 碩士 國立清華大學 資訊工程學系所 107 Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and massive amount of data storages are two challenges for the hardware design of CNN. Although the existence of GPU can deal with the high computational complexity, the large energy consumption due to huge external memory access has pushed researchers towards dedicated CNN accelerator designs. Generally, the precision of the modern CNN accelerators are set to 16-bit fixed-point. To reduce data storages, Sakr et al. [1] show that less precision can be used under the constraint of 1% accuracy degradation in recognitions. Besides, per-layer precision assignments can reach lower bit-width requirements than uniform precision assignment for all layers. In this paper, we propose an energy-aware bit-serial streaming Deep CNN accelerator to tackle the computational complexity, data storage and external memory access issues. With the ring streaming dataflow and the output reuse strategy to decrease the data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x compared to that of no output reuse case on AlexNet. In addition, we optimize the hardware utilization and avoid the unnecessary computations by the loop tiling technique and by mapping strides of convolutional layers to the unit-ones for computational performance enhancement. Furthermore, the bit-serial processing element (PE) is designed for using less number of bits in weights, which can reduce both the amount of computation and external memory access. We evaluate our design with the well-known roofline model, which is an efficient way for evaluation compared to real hardware implementation. The design space is explored to find the solution with the best computational performance and comunication to computation (CTC) ratio. Assume using the same FPGA as Chen et al. [2], we can reach 1.36x speed up and reduce 41% energy consumption for external memory access compared to the design in [2]. On the aspect of the hardware implementation for our PE-Array architecture design, the implementation can reach the operating frequency of 119 MHz and consume 68 k gates with the power consumption of 10.08mW under TSMC 90 nm technology. Compared to the 15.4 MB external memory access for Eyeriss [3] on the convolutional layers of AlexNet, our work only need 4.36 MB external memory access that dramatically reduce the most energy-consuming part of power consumption. Chiu, Ching-Te 邱瀞德 2018 學位論文 ; thesis 67 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立清華大學 === 資訊工程學系所 === 107 === Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and massive amount of data storages are two challenges for the hardware design of CNN. Although the existence of GPU can deal with the high computational complexity, the large energy consumption due to huge external memory access has pushed researchers towards dedicated CNN accelerator designs. Generally, the precision of the modern CNN accelerators are set to 16-bit fixed-point. To reduce data storages, Sakr et al. [1] show that less precision can be used under the constraint of 1% accuracy degradation in recognitions. Besides, per-layer precision assignments can reach lower bit-width requirements than uniform precision assignment for all layers.
In this paper, we propose an energy-aware bit-serial streaming Deep CNN accelerator to tackle the computational complexity, data storage and external memory access issues. With the ring streaming dataflow and the output reuse strategy to decrease the data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x compared to that of no output reuse case on AlexNet. In addition, we optimize the hardware utilization and avoid the unnecessary computations by the loop tiling technique and by mapping strides of convolutional layers to the unit-ones for computational performance enhancement. Furthermore, the bit-serial processing element (PE) is designed for using less number of bits in weights, which can reduce both the amount of computation and external memory access.
We evaluate our design with the well-known roofline model, which is an efficient way for evaluation compared to real hardware implementation. The design space is explored to find the solution with the best computational performance and comunication to computation (CTC) ratio. Assume using the same FPGA as Chen et al. [2], we can reach 1.36x speed up and reduce 41% energy consumption for external memory access compared to the design in [2].
On the aspect of the hardware implementation for our PE-Array architecture design, the implementation can reach the operating frequency of 119 MHz and consume 68 k gates with the power consumption of 10.08mW under TSMC 90 nm technology. Compared to the 15.4 MB external memory access for Eyeriss [3] on the convolutional layers of AlexNet, our work only need 4.36 MB external memory access that dramatically reduce the most energy-consuming part of power consumption.
|
author2 |
Chiu, Ching-Te |
author_facet |
Chiu, Ching-Te Hsu, Lien-Chih 徐連志 |
author |
Hsu, Lien-Chih 徐連志 |
spellingShingle |
Hsu, Lien-Chih 徐連志 ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator |
author_sort |
Hsu, Lien-Chih |
title |
ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator |
title_short |
ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator |
title_full |
ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator |
title_fullStr |
ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator |
title_full_unstemmed |
ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator |
title_sort |
essa: an energy-aware bit-serial streaming deep convolutional neural network accelerator |
publishDate |
2018 |
url |
http://ndltd.ncl.edu.tw/handle/859cgm |
work_keys_str_mv |
AT hsulienchih essaanenergyawarebitserialstreamingdeepconvolutionalneuralnetworkaccelerator AT xúliánzhì essaanenergyawarebitserialstreamingdeepconvolutionalneuralnetworkaccelerator AT hsulienchih jīyúshēndùjuǎnjīshénjīngwǎnglùzhījùgōnglǜyìshíwèiyuánxùlièchuànliújiāsùqì AT xúliánzhì jīyúshēndùjuǎnjīshénjīngwǎnglùzhījùgōnglǜyìshíwèiyuánxùlièchuànliújiāsùqì |
_version_ |
1719250593295368192 |