Performance Prediction Model on HSA-Compatible General-Purpose GPU System

碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === In this thesis, we present a memory subsystem of customized general purpose GPU architecture. For fast development, the C++ simulated architecture should be kept as light-weight while timing accurate at the same time. Since most parts of benchmark simulation...

Full description

Bibliographic Details
Main Authors: Kuan-ChiehHsu, 許冠傑
Other Authors: Chung-Ho Chen
Format: Others
Language:en_US
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/pja3jx
Description
Summary:碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === In this thesis, we present a memory subsystem of customized general purpose GPU architecture. For fast development, the C++ simulated architecture should be kept as light-weight while timing accurate at the same time. Since most parts of benchmark simulation time come from memory subsystem-related latencies. For example, the level one cache miss will trigger Network on Chip (NoC) traffic; the cache coherence and memory controller scheduling policy also affect the latency viewed by streaming multiprocessor in this GPGPU architecture. Also, we discuss the memory space partitioning methods in one following section including coarse grain and fine grain partitioning methods. As for NoC module, we adopted previous research in this work and discuss geometry features of chosen topology – Mesh structure for robust reason. Another contribution of this work is that two machine learning models are used for predicting architecture performance and depicting the performance trend across plenty of hardware configuration settings. We aim to guess a reasonable summit value in performance surface by the following procedures. First, kmeans algorithm clusters training benchmarks into determined number of clusters. The multi-class Support Vector Machine (SVM) model is latter trained to fit memory-related only features. During validation phase, testing benchmarks’ summit performance values are predicted by the result from training phase. Under eight clusters setting, 46.48% predicted cycle performance counts across all tested benchmarks are less than 10% error comparing to real performance values. By varying the number of clusters, up to 57.97% points are less than 10% errors. Also, we show that summit performance not necessary happen under maximum hardware resources. Some discussions point out the memory traffic issues that significantly drag down the execution speed of certain accessing patterns from benchmarks. Combined the mentioned contributions together, we aim to provide a reliable and accurate early stage simulation platform for future IC chip implementation in an efficient way.