Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors

Bibliographic Details
Main Author: Chen, Linchuan
Language:English
Published: The Ohio State University / OhioLINK 2015
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=osu1435407747
id ndltd-OhioLink-oai-etd.ohiolink.edu-osu1435407747
record_format oai_dc
collection NDLTD
language English
sources NDLTD
topic Computer Science
accelerators
pattern-specific optimizations
spellingShingle Computer Science
accelerators
pattern-specific optimizations
Chen, Linchuan
Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors
author Chen, Linchuan
author_facet Chen, Linchuan
author_sort Chen, Linchuan
title Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors
title_short Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors
title_full Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors
title_fullStr Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors
title_full_unstemmed Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors
title_sort accelerating applications with pattern-specific optimizations on accelerators and coprocessors
publisher The Ohio State University / OhioLINK
publishDate 2015
url http://rave.ohiolink.edu/etdc/view?acc_num=osu1435407747
work_keys_str_mv AT chenlinchuan acceleratingapplicationswithpatternspecificoptimizationsonacceleratorsandcoprocessors
_version_ 1719438398461050880
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-osu14354077472021-08-03T06:31:40Z Accelerating Applications with Pattern-specific Optimizations on Accelerators and Coprocessors Chen, Linchuan Computer Science accelerators pattern-specific optimizations Because of the bottleneck in the increase of clock frequency, multi-cores emerged asa way of improving the overall performance of CPUs. In the recent decade, many-coresbegin to play a more and more important role in scientific computing. The highly cost-effective nature of many-cores makes them extremely suitable for data-intensive computa-tions. Specifically, many-cores are in the forms of GPUs (e.g., NVIDIA or AMD GPUs)and more recently, coprocessers (Intel MIC). Even though these highly parallel architec-tures offer significant amount of computation power, it is very hard to program them, andharder to fully exploit the computation power of them. Combing the power of multi-coresand many-cores, i.e., making use of the heterogeneous cores is extremely complicated.Our efforts have been made on performing optimizations to important sets of appli-cations on such parallel systems. We address this issue from the perspective of commu-nication patterns. Scientific applications can be classified based on the properties (com-munication patterns), which have been specified in the Berkeley Dwarfs many years ago.By investigating the characteristics of each class, we are able to derive efficient executionstrategies, across different levels of the parallelism. We design a high-level programmingAPI, as well as implement an efficient runtime system with pattern-specific optimization-s, considering the characteristics of the hardware platform. Thus, instead of providing ageneral programming model, we provide separate APIs for each communication pattern.We have worked on a selected subset of the communication patterns, including MapRe-duce, generalized reductions, irregular reductions, stencil computations and graph process-ing. Our targeted platforms are single GPUs, coupled CPU-GPUs, heterogeneous clusters,and Intel Xeon Phis. Our work not only focuses on efficiently executing a communicationpattern on a single multi-core or many-core, but also considers inter-device and inter-nodetask scheduling. While implementing a specific communication pattern, we consider as-pects including lock-reducing, data locality, and load balancing.Our work starts with the optimization of the MapReduce on a single GPU, specificallyaiming to efficiently utilize the shared memory. We design a reduction based approach,which is able to keep the memory consumption low by avoiding the storage of intermediatekey-value pairs. To support such an approach, we design a general data structure, referredto as the reduction object, which is placed in the memory hierarchy of the GPU. The limitedmemory requirement of the reduction object allows us to extensively utilize the small butfast shared memory. Our approach performs well for a popular set of MapReduce appli-cations, especially the reduction intensive ones. The comparison with former state-of-artaccelerator based approaches shows that our approach is much more efficient at utilizingthe shared memory.Even though MapReduce significantly reduces the complexity of parallel programming,it is not easy to achieve efficient execution, for complicated applications, on heterogeneousclusters with multi-core and multiple GPUs within each node. In view of this, we designa programming framework, which aims to reduce the programming difficulty, as well asprovide automatic optimizations to applications. Our approach is to classify applicationsbased on communication patterns. The patterns we study include Generalized Reductions,Irregular Reductions and Stencil Computations, which are important ones that are frequent-ly used in scientific and data intensive computations. For each pattern, we design a simpleAPI, as well as a runtime with pattern-specific optimizations at different parallelism levels.Besides, we also investigate graph applications. We design a graph processing systemover the Intel Xeon Phi and CPU. We design a vertex-centric programming API, and anovel condensed static message buffer that supports less memory consumption and SIMDmessage reduction. We also use a pipelining scheme to avoid frequent locking. The hybridgraph partitioning is able to achieve load balance between CPU and Xeon Phi, as well asto reduce the communication overhead.Executing irregular applications on SIMD architectures is always challenging. The ir-regularity leads to problems including poor data access locality, data dependency, as wellas inefficient utilization of SIMD lanes. We propose a general optimization methodolo-gy for irregular applications, including irregular reductions, graph algorithms and sparsematrix matrix multiplications. The key observation of our approach is that the major datastructures accessed by irregular applications can be treated as sparse matrices. The steps ofour methodology include: matrix tiling, data access pattern identification, and conflict re-moval. As a consequence, our approach is able to efficiently utilize both SIMD and MIMDparallelism on the Intel Xeon Phi. 2015-10-08 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1435407747 http://rave.ohiolink.edu/etdc/view?acc_num=osu1435407747 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.