GPU Array Access Auto-Tuning

GPUs have been used for years in compute intensive applications. Their massive parallel processing capabilities can speedup calculations significantly. However, to leverage this speedup it is necessary to rethink and develop new algorithms that allow parallel processing. These algorithms are only on...

Full description

Bibliographic Details
Main Author: Weber, Nicolas
Format: Others
Language:en
Published: 2017
Online Access:https://tuprints.ulb.tu-darmstadt.de/6507/1/main.pdf
Weber, Nicolas <http://tuprints.ulb.tu-darmstadt.de/view/person/Weber=3ANicolas=3A=3A.html> (2017): GPU Array Access Auto-Tuning.Darmstadt, Technische Universität, [Ph.D. Thesis]
Description
Summary:GPUs have been used for years in compute intensive applications. Their massive parallel processing capabilities can speedup calculations significantly. However, to leverage this speedup it is necessary to rethink and develop new algorithms that allow parallel processing. These algorithms are only one piece to achieve high performance. Nearly as important as suitable algorithms is the actual implementation and the usage of special hardware features such as intra-warp communication, shared memory, caches, and memory access patterns. Optimizing these factors is usually a time consuming task that requires deep understanding of the algorithms and the underlying hardware. Unlike CPUs, the internal structure of GPUs has changed significantly and will likely change even more over the years. Therefore it does not suffice to optimize the code once during the development, but it has to be optimized for each new GPU generation that is released. To efficiently (re-)optimize code towards the underlying hardware, auto-tuning tools have been developed that perform these optimizations automatically, taking this burden from the programmer. In particular, NVIDIA -- the leading manufacturer for GPUs today -- applied significant changes to the memory hierarchy over the last four hardware generations. This makes the memory hierarchy an attractive objective for an auto-tuner. In this thesis we introduce the MATOG auto-tuner that automatically optimizes array access for NVIDIA CUDA applications. In order to achieve these optimizations, MATOG has to analyze the application to determine optimal parameter values. The analysis relies on empirical profiling combined with a prediction method and a data post-processing step. This allows to find nearly optimal parameter values in a minimal amount of time. Further, MATOG is able to automatically detect varying application workloads and can apply different optimization parameter settings at runtime. To show MATOG's capabilities, we evaluated it on a variety of different applications, ranging from simple algorithms up to complex applications on the last four hardware generations, with a total of 14 GPUs. MATOG is able to achieve equal or even better performance than hand-optimized code. Further, it is able to provide performance portability across different GPU types (low-, mid-, high-end and HPC) and generations. In some cases it is able to exceed the performance of hand-crafted code that has been specifically optimized for the tested GPU by dynamically changing data layouts throughout the execution.