Design of an Inference Accelerator for Compressed Convolution Neural Networks

碩士 === 國立清華大學 === 資訊工程學系所 === 106 === The state-of-the-art convolution neural network (CNNs) are wildly used in the intelligent applications such as the AI systems, nature language processing and image recognition. The huge and growing computation latency of the high-dimensional convolution has beco...

Full description

Bibliographic Details
Main Authors: Ho, Pin-Hui, 何品蕙
Other Authors: Huang, Chih-Tsun
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/bxwfdn
Description
Summary:碩士 === 國立清華大學 === 資訊工程學系所 === 106 === The state-of-the-art convolution neural network (CNNs) are wildly used in the intelligent applications such as the AI systems, nature language processing and image recognition. The huge and growing computation latency of the high-dimensional convolution has become a critical issue. Most of the multiplications in the convolution layers are ineffectual as they involve the multiplication that either one of the input data or both are zero. By pruning the redundant connections in the CNN models and clamping the features to zero by the rectified linear unit (ReLU), the input data of the convolution layers achieve a higher sparsity. The high data sparsity leaves a big room for improvement by compressing the data and skipping the ineffectual multiplication. In this thesis, we propose a design of the CNN accelerators which can accelerate the convolution layers by performing the sparse matrix multiplications and reducing the amount of the off-chip data transfer. The state-of-the-art Eyeriss architecture is an energy-efficient reconfigurable CNN accelerator which has the specialized data flow with the multi-level memory hierarchy and limited hardware resource. Improved over the Eyeriss architecture, our approach can perform the sparse matrix multiplications effectively. With the pruned and compressed kernel data, and dynamic encoding of the input feature into the compressed sparse row (CSR) format, our accelerator can reduce a significant amount of the off-chip data transfer, minimizing the memory accesses and skipping the ineffectual multiplication. One of the disadvantages of the compression scheme is that after the compression, the input workload becomes imbalance dynamically. Therefore, the technique will lower the data reusability and increase the off-chip data transfer. To analyze the data transfer further, we explore the relationship between the on-chip buffer size and off-chip data transfer. Our design needs to reaccess the off-chip data while the on-chip buffer can’t store all the output features. As a result, by reducing the significant amount of computation and memory accesses, our accelerator can still achieve the better performance. With Alexnet, a popular CNN model with 35% input data sparsity and 89.5% kernel data as the benckmark, our accelerator can achieve 1.12X speedup as compared with Eyeriss, and 3.42X speed up as compared with our baseline architecture.