ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator

碩士 === 國立清華大學 === 資訊工程學系所 === 107 === Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and ma...

Full description

Bibliographic Details
Main Authors: Hsu, Lien-Chih, 徐連志
Other Authors: Chiu, Ching-Te
Format: Others
Language:en_US
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/859cgm
id ndltd-TW-107NTHU5392009
record_format oai_dc
spelling ndltd-TW-107NTHU53920092019-09-15T03:33:43Z http://ndltd.ncl.edu.tw/handle/859cgm ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator 基於深度卷積神經網路之具功率意識位元序列 串流加速器 Hsu, Lien-Chih 徐連志 碩士 國立清華大學 資訊工程學系所 107 Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and massive amount of data storages are two challenges for the hardware design of CNN. Although the existence of GPU can deal with the high computational complexity, the large energy consumption due to huge external memory access has pushed researchers towards dedicated CNN accelerator designs. Generally, the precision of the modern CNN accelerators are set to 16-bit fixed-point. To reduce data storages, Sakr et al. [1] show that less precision can be used under the constraint of 1% accuracy degradation in recognitions. Besides, per-layer precision assignments can reach lower bit-width requirements than uniform precision assignment for all layers. In this paper, we propose an energy-aware bit-serial streaming Deep CNN accelerator to tackle the computational complexity, data storage and external memory access issues. With the ring streaming dataflow and the output reuse strategy to decrease the data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x compared to that of no output reuse case on AlexNet. In addition, we optimize the hardware utilization and avoid the unnecessary computations by the loop tiling technique and by mapping strides of convolutional layers to the unit-ones for computational performance enhancement. Furthermore, the bit-serial processing element (PE) is designed for using less number of bits in weights, which can reduce both the amount of computation and external memory access. We evaluate our design with the well-known roofline model, which is an efficient way for evaluation compared to real hardware implementation. The design space is explored to find the solution with the best computational performance and comunication to computation (CTC) ratio. Assume using the same FPGA as Chen et al. [2], we can reach 1.36x speed up and reduce 41% energy consumption for external memory access compared to the design in [2]. On the aspect of the hardware implementation for our PE-Array architecture design, the implementation can reach the operating frequency of 119 MHz and consume 68 k gates with the power consumption of 10.08mW under TSMC 90 nm technology. Compared to the 15.4 MB external memory access for Eyeriss [3] on the convolutional layers of AlexNet, our work only need 4.36 MB external memory access that dramatically reduce the most energy-consuming part of power consumption. Chiu, Ching-Te 邱瀞德 2018 學位論文 ; thesis 67 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立清華大學 === 資訊工程學系所 === 107 === Over the past decade, deep convolutional neural networks (CNN) have been widely embraced in various visual recognition applications due to their extraordinary accuracies that even surpassed those of human beings. However, the high computational complexity and massive amount of data storages are two challenges for the hardware design of CNN. Although the existence of GPU can deal with the high computational complexity, the large energy consumption due to huge external memory access has pushed researchers towards dedicated CNN accelerator designs. Generally, the precision of the modern CNN accelerators are set to 16-bit fixed-point. To reduce data storages, Sakr et al. [1] show that less precision can be used under the constraint of 1% accuracy degradation in recognitions. Besides, per-layer precision assignments can reach lower bit-width requirements than uniform precision assignment for all layers. In this paper, we propose an energy-aware bit-serial streaming Deep CNN accelerator to tackle the computational complexity, data storage and external memory access issues. With the ring streaming dataflow and the output reuse strategy to decrease the data access, the amount of external DRAM access for the convolutional layers is reduced by 357.26x compared to that of no output reuse case on AlexNet. In addition, we optimize the hardware utilization and avoid the unnecessary computations by the loop tiling technique and by mapping strides of convolutional layers to the unit-ones for computational performance enhancement. Furthermore, the bit-serial processing element (PE) is designed for using less number of bits in weights, which can reduce both the amount of computation and external memory access. We evaluate our design with the well-known roofline model, which is an efficient way for evaluation compared to real hardware implementation. The design space is explored to find the solution with the best computational performance and comunication to computation (CTC) ratio. Assume using the same FPGA as Chen et al. [2], we can reach 1.36x speed up and reduce 41% energy consumption for external memory access compared to the design in [2]. On the aspect of the hardware implementation for our PE-Array architecture design, the implementation can reach the operating frequency of 119 MHz and consume 68 k gates with the power consumption of 10.08mW under TSMC 90 nm technology. Compared to the 15.4 MB external memory access for Eyeriss [3] on the convolutional layers of AlexNet, our work only need 4.36 MB external memory access that dramatically reduce the most energy-consuming part of power consumption.
author2 Chiu, Ching-Te
author_facet Chiu, Ching-Te
Hsu, Lien-Chih
徐連志
author Hsu, Lien-Chih
徐連志
spellingShingle Hsu, Lien-Chih
徐連志
ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
author_sort Hsu, Lien-Chih
title ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_short ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_full ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_fullStr ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_full_unstemmed ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator
title_sort essa: an energy-aware bit-serial streaming deep convolutional neural network accelerator
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/859cgm
work_keys_str_mv AT hsulienchih essaanenergyawarebitserialstreamingdeepconvolutionalneuralnetworkaccelerator
AT xúliánzhì essaanenergyawarebitserialstreamingdeepconvolutionalneuralnetworkaccelerator
AT hsulienchih jīyúshēndùjuǎnjīshénjīngwǎnglùzhījùgōnglǜyìshíwèiyuánxùlièchuànliújiāsùqì
AT xúliánzhì jīyúshēndùjuǎnjīshénjīngwǎnglùzhījùgōnglǜyìshíwèiyuánxùlièchuànliújiāsùqì
_version_ 1719250593295368192