ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator

碩士 === 國立清華大學 === 資訊工程學系所 === 107 === Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and ma...

Full description

Bibliographic Details
Main Authors:	Hsu, Lien-Chih, 徐連志
Other Authors:	Chiu, Ching-Te
Format:	Others
Language:	en_US
Published:	2018
Online Access:	http://ndltd.ncl.edu.tw/handle/859cgm

id	ndltd-TW-107NTHU5392009
record_format	oai_dc
spelling	ndltd-TW-107NTHU53920092019-09-15T03:33:43Z http://ndltd.ncl.edu.tw/handle/859cgm ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator 基於深度卷積神經網路之具功率意識位元序列串流加速器 Hsu, Lien-Chih 徐連志碩士國立清華大學資訊工程學系所 107 Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and massive amount of data storages are two challenges for the hardware design of CNN. Although the existence of GPU can deal with the high computational complexity, the large energy consumption due to huge external memory access has pushed researchers towards dedicated CNN accelerator designs. Generally, the precision of the modern CNN accelerators are set to 16-bit fixed-point. To reduce data storages, Sakr et al. [1] show that less precision can be used under the constraint of 1% accuracy degradation in recognitions. Besides, per-layer precision assignments can reach lower bit-width requirements than uniform precision assignment for all layers. In this paper, we propose an energy-aware bit-serial streaming Deep CNN accelerator to tackle the computational complexity, data storage and external memory access issues. With the ring streaming dataflow and the output reuse strategy to decrease the data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x compared to that of no output reuse case on AlexNet. In addition, we optimize the hardware utilization and avoid the unnecessary computations by the loop tiling technique and by mapping strides of convolutional layers to the unit-ones for computational performance enhancement. Furthermore, the bit-serial processing element (PE) is designed for using less number of bits in weights, which can reduce both the amount of computation and external memory access. We evaluate our design with the well-known roofline model, which is an efficient way for evaluation compared to real hardware implementation. The design space is explored to find the solution with the best computational performance and comunication to computation (CTC) ratio. Assume using the same FPGA as Chen et al. [2], we can reach 1.36x speed up and reduce 41% energy consumption for external memory access compared to the design in [2]. On the aspect of the hardware implementation for our PE-Array architecture design, the implementation can reach the operating frequency of 119 MHz and consume 68 k gates with the power consumption of 10.08mW under TSMC 90 nm technology. Compared to the 15.4 MB external memory access for Eyeriss [3] on the convolutional layers of AlexNet, our work only need 4.36 MB external memory access that dramatically reduce the most energy-consuming part of power consumption. Chiu, Ching-Te 邱瀞德 2018 學位論文 ; thesis 67 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立清華大學 === 資訊工程學系所 === 107 === Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and massive amount of data storages are two challenges for the hardware design of CNN. Although the existence of GPU can deal with the high computational complexity, the large energy consumption due to huge external memory access has pushed researchers towards dedicated CNN accelerator designs. Generally, the precision of the modern CNN accelerators are set to 16-bit fixed-point. To reduce data storages, Sakr et al. [1] show that less precision can be used under the constraint of 1% accuracy degradation in recognitions. Besides, per-layer precision assignments can reach lower bit-width requirements than uniform precision assignment for all layers. In this paper, we propose an energy-aware bit-serial streaming Deep CNN accelerator to tackle the computational complexity, data storage and external memory access issues. With the ring streaming dataflow and the output reuse strategy to decrease the data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x compared to that of no output reuse case on AlexNet. In addition, we optimize the hardware utilization and avoid the unnecessary computations by the loop tiling technique and by mapping strides of convolutional layers to the unit-ones for computational performance enhancement. Furthermore, the bit-serial processing element (PE) is designed for using less number of bits in weights, which can reduce both the amount of computation and external memory access. We evaluate our design with the well-known roofline model, which is an efficient way for evaluation compared to real hardware implementation. The design space is explored to find the solution with the best computational performance and comunication to computation (CTC) ratio. Assume using the same FPGA as Chen et al. [2], we can reach 1.36x speed up and reduce 41% energy consumption for external memory access compared to the design in [2]. On the aspect of the hardware implementation for our PE-Array architecture design, the implementation can reach the operating frequency of 119 MHz and consume 68 k gates with the power consumption of 10.08mW under TSMC 90 nm technology. Compared to the 15.4 MB external memory access for Eyeriss [3] on the convolutional layers of AlexNet, our work only need 4.36 MB external memory access that dramatically reduce the most energy-consuming part of power consumption.
author2	Chiu, Ching-Te
author_facet	Chiu, Ching-Te Hsu, Lien-Chih 徐連志
author	Hsu, Lien-Chih 徐連志
spellingShingle	Hsu, Lien-Chih 徐連志 ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
author_sort	Hsu, Lien-Chih
title	ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_short	ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_full	ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_fullStr	ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_full_unstemmed	ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_sort	essa: an energy-aware bit-serial streaming deep convolutional neural network accelerator
publishDate	2018
url	http://ndltd.ncl.edu.tw/handle/859cgm
work_keys_str_mv	AT hsulienchih essaanenergyawarebitserialstreamingdeepconvolutionalneuralnetworkaccelerator AT xúliánzhì essaanenergyawarebitserialstreamingdeepconvolutionalneuralnetworkaccelerator AT hsulienchih jīyúshēndùjuǎnjīshénjīngwǎnglùzhījùgōnglǜyìshíwèiyuánxùlièchuànliújiāsùqì AT xúliánzhì jīyúshēndùjuǎnjīshénjīngwǎnglùzhījùgōnglǜyìshíwèiyuánxùlièchuànliújiāsùqì
_version_	1719250593295368192

ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator

Similar Items