An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training
碩士 === 國立臺灣大學 === 電子工程學研究所 === 107 === The recent resurgence of artificial intelligence is due to advances in deep learning. Deep neural network (DNN) has exceeded human capability in many computer vision applications, such as object detection, image classification and playing games like Go. The ide...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2019
|
Online Access: | http://ndltd.ncl.edu.tw/handle/y475rn |
id |
ndltd-TW-107NTU05428058 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-107NTU054280582019-11-16T05:27:58Z http://ndltd.ncl.edu.tw/handle/y475rn An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training 高效能卷積神經網路訓練系統加速晶片 Kung, Chu King 江子近 碩士 國立臺灣大學 電子工程學研究所 107 The recent resurgence of artificial intelligence is due to advances in deep learning. Deep neural network (DNN) has exceeded human capability in many computer vision applications, such as object detection, image classification and playing games like Go. The idea of deep learning dates back to as early as the 1950s, with the key algorithmic breakthroughs occurred in the 1980s. Yet, it has only been in the past few years, that powerful hardware accelerators became available to train neural networks. Even now, the demand for machine learning algorithms is still increasing; and it is affecting almost every industry. Therefore, designing a powerful and efficient hardware accelerator for deep learning algorithms is of critical importance for the time being. The accelerators that run the deep learning algorithm must be general enough to support deep neural networks with various computational structures. For instance, general-purpose graphics processing units (GP-GPUs) were widely adopted for deep learning tasks ever since they allow users to execute arbitrary code on them. Other than graphics processing units, researchers have also paid a lot of attention to hardware acceleration of deep neural networks (DNNs) in the last few years. Google developed its own chip called the Tensor Processing Unit (TPU) to power its own machine learning services [8]; while Intel unveiled its first generation of ASIC processor, called Nervana, for deep learning a few years ago [9]. ASICs usually provide a better performance, compared with FPGA and software implementations. Nevertheless, existing accelerators mostly focus on inference. However, local DNN training is still required to meet the needs of new applications, such as incremental learning and on-device personalization. Unlike inference, training requires high dynamic range in order to deliver high learning quality. In this work, we introduce the floating-point signed digit (FloatSD) data representation format for reducing computational complexity required for both the inference and the training of a convolutional neural network (CNN). By co-designing data representation and circuit, we demonstrate that we can achieve high raw performance and optimal efficiency – both energy and area – without sacrificing the quality of training. This work focuses on the design of FloatSD based system on chip (SOC) for AI training and inference. The SOC consists of an AI IP, integrated DDR3 controller and ARC HS34 CPU through AXI/AHB standard AMBA interfaces. The platform can be programmed by the CPU via the AHB slave port to fit various neural network topologies. The completed SOC has been tested and validated on the HAPS-80 FPGA platform. A synthesis and automated place and route (APR) flow is used to tape out a 28 nm test chip, after testing and verifying the correctness of the SOC. At its normal operating condition (e.g. 400MHz), the accelerator is capable of 1.38 TFLOPs peak performance and 2.34 TFLOPS/W. Tzi-Dar Chiueh 闕志達 2019 學位論文 ; thesis 128 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立臺灣大學 === 電子工程學研究所 === 107 === The recent resurgence of artificial intelligence is due to advances in deep learning. Deep neural network (DNN) has exceeded human capability in many computer vision applications, such as object detection, image classification and playing games like Go. The idea of deep learning dates back to as early as the 1950s, with the key algorithmic breakthroughs occurred in the 1980s. Yet, it has only been in the past few years, that powerful hardware accelerators became available to train neural networks.
Even now, the demand for machine learning algorithms is still increasing; and it is affecting almost every industry. Therefore, designing a powerful and efficient hardware accelerator for deep learning algorithms is of critical importance for the time being. The accelerators that run the deep learning algorithm must be general enough to support deep neural networks with various computational structures. For instance, general-purpose graphics processing units (GP-GPUs) were widely adopted for deep learning tasks ever since they allow users to execute arbitrary code on them.
Other than graphics processing units, researchers have also paid a lot of attention to hardware acceleration of deep neural networks (DNNs) in the last few years. Google developed its own chip called the Tensor Processing Unit (TPU) to power its own machine learning services [8]; while Intel unveiled its first generation of ASIC processor, called Nervana, for deep learning a few years ago [9]. ASICs usually provide a better performance, compared with FPGA and software implementations. Nevertheless, existing accelerators mostly focus on inference. However, local DNN training is still required to meet the needs of new applications, such as incremental learning and on-device personalization. Unlike inference, training requires high dynamic range in order to deliver high learning quality.
In this work, we introduce the floating-point signed digit (FloatSD) data representation format for reducing computational complexity required for both the inference and the training of a convolutional neural network (CNN). By co-designing data representation and circuit, we demonstrate that we can achieve high raw performance and optimal efficiency – both energy and area – without sacrificing the quality of training.
This work focuses on the design of FloatSD based system on chip (SOC) for AI training and inference. The SOC consists of an AI IP, integrated DDR3 controller and ARC HS34 CPU through AXI/AHB standard AMBA interfaces. The platform can be programmed by the CPU via the AHB slave port to fit various neural network topologies.
The completed SOC has been tested and validated on the HAPS-80 FPGA platform. A synthesis and automated place and route (APR) flow is used to tape out a 28 nm test chip, after testing and verifying the correctness of the SOC. At its normal operating condition (e.g. 400MHz), the accelerator is capable of 1.38 TFLOPs peak performance and 2.34 TFLOPS/W.
|
author2 |
Tzi-Dar Chiueh |
author_facet |
Tzi-Dar Chiueh Kung, Chu King 江子近 |
author |
Kung, Chu King 江子近 |
spellingShingle |
Kung, Chu King 江子近 An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training |
author_sort |
Kung, Chu King |
title |
An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training |
title_short |
An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training |
title_full |
An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training |
title_fullStr |
An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training |
title_full_unstemmed |
An Energy-Efficient Accelerator SOC for Convolutional Neural Network Training |
title_sort |
energy-efficient accelerator soc for convolutional neural network training |
publishDate |
2019 |
url |
http://ndltd.ncl.edu.tw/handle/y475rn |
work_keys_str_mv |
AT kungchuking anenergyefficientacceleratorsocforconvolutionalneuralnetworktraining AT jiāngzijìn anenergyefficientacceleratorsocforconvolutionalneuralnetworktraining AT kungchuking gāoxiàonéngjuǎnjīshénjīngwǎnglùxùnliànxìtǒngjiāsùjīngpiàn AT jiāngzijìn gāoxiàonéngjuǎnjīshénjīngwǎnglùxùnliànxìtǒngjiāsùjīngpiàn AT kungchuking energyefficientacceleratorsocforconvolutionalneuralnetworktraining AT jiāngzijìn energyefficientacceleratorsocforconvolutionalneuralnetworktraining |
_version_ |
1719292359360905216 |