Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs

Scientiﬁc applications that operate on large data sets require huge amount of computation power and memory. These applications are typically run on High Performance Computing (HPC) systems that consist of multiple compute nodes, connected over an network interconnect such as InﬁniBand. Each compute...

Full description

Bibliographic Details
Main Author:	Chandan, G
Other Authors:	Bondhugula, Uday
Language:	en_US
Published:	2018
Subjects:	High Performance Computing Systems Computer Placement Data Analysis and Management Distributed-memory Automatic Parallelization Data-distribution Polyhedral Model Computation Placement Hyperplanes and Polyhedra HPC Systems Data Allocation and Management Data Movement Code Generation Computer Science
Online Access:	http://hdl.handle.net/2005/3111 http://etd.ncsi.iisc.ernet.in/abstracts/3971/G26343-Abs.pdf

id	ndltd-IISc-oai-etd.ncsi.iisc.ernet.in-2005-3111
record_format	oai_dc
spelling	ndltd-IISc-oai-etd.ncsi.iisc.ernet.in-2005-31112018-03-06T03:35:43ZEffective Automatic Computation Placement and Data Allocation for Parallelization of Regular ProgramsChandan, GHigh Performance Computing SystemsComputer PlacementData Analysis and ManagementDistributed-memoryAutomatic ParallelizationData-distributionPolyhedral ModelComputation PlacementHyperplanes and PolyhedraHPC SystemsData Allocation and ManagementData Movement Code GenerationComputer ScienceScientiﬁc applications that operate on large data sets require huge amount of computation power and memory. These applications are typically run on High Performance Computing (HPC) systems that consist of multiple compute nodes, connected over an network interconnect such as InﬁniBand. Each compute node has its own memory and does not share the address space with other nodes. A signiﬁcant amount of work has been done in past two decades on parallelizing for distributed-memory architectures. A majority of this work was done in developing compiler technologies such as high performance Fortran (HPF) and partitioned global address space (PGAS). However, several steps involved in achieving good performance remained manual. Hence, the approach currently used to obtain the best performance is to rely on highly tuned libraries such as ScaLAPACK. The objective of this work is to improve automatic compiler and runtime support for distributed-memory clusters for regular programs. Regular programs typically use arrays as their main data structure and array accesses are afﬁne functions of outer loop indices and program parameters. A lot of scientiﬁc applications such as linear-algebra kernels, stencils, partial differential equation solvers, data-mining applications and dynamic programming codes fall in this category. In this work, we propose techniques for ﬁnding computation mapping and data allocation when compiling regular programs for distributed-memory clusters. Techniques for transformation and detection of parallelism, relying on the polyhedral framework already exist. We propose automatic techniques to determine computation placements for identiﬁed parallelism and allocation of data. We model the problem of ﬁnding good computation placement as a graph partitioning problem with the constraints to minimize both communication volume and load imbalance for entire program. We show that our approach for computation mapping is more effective than those that can be developed using vendor-supplied libraries. Our approach for data allocation is driven by tiling of data spaces along with a compiler assisted runtime scheme to allocate and deallocate tiles on-demand and reuse them. Experimental results on some sequences of BLAS calls demonstrate a mean speedup of 1.82× over versions written with ScaLAPACK. Besides enabling weak scaling for distributed memory, data tiling also improves locality for shared-memory parallelization. Experimental results on a 32-core shared-memory SMP system shows a mean speedup of 2.67× over code that is not data tiled.Bondhugula, Uday2018-02-14T21:05:14Z2018-02-14T21:05:14Z2018-02-152014Thesishttp://hdl.handle.net/2005/3111http://etd.ncsi.iisc.ernet.in/abstracts/3971/G26343-Abs.pdfen_USG26343
collection	NDLTD
language	en_US
sources	NDLTD
topic	High Performance Computing Systems Computer Placement Data Analysis and Management Distributed-memory Automatic Parallelization Data-distribution Polyhedral Model Computation Placement Hyperplanes and Polyhedra HPC Systems Data Allocation and Management Data Movement Code Generation Computer Science
spellingShingle	High Performance Computing Systems Computer Placement Data Analysis and Management Distributed-memory Automatic Parallelization Data-distribution Polyhedral Model Computation Placement Hyperplanes and Polyhedra HPC Systems Data Allocation and Management Data Movement Code Generation Computer Science Chandan, G Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs
description	Scientiﬁc applications that operate on large data sets require huge amount of computation power and memory. These applications are typically run on High Performance Computing (HPC) systems that consist of multiple compute nodes, connected over an network interconnect such as InﬁniBand. Each compute node has its own memory and does not share the address space with other nodes. A signiﬁcant amount of work has been done in past two decades on parallelizing for distributed-memory architectures. A majority of this work was done in developing compiler technologies such as high performance Fortran (HPF) and partitioned global address space (PGAS). However, several steps involved in achieving good performance remained manual. Hence, the approach currently used to obtain the best performance is to rely on highly tuned libraries such as ScaLAPACK. The objective of this work is to improve automatic compiler and runtime support for distributed-memory clusters for regular programs. Regular programs typically use arrays as their main data structure and array accesses are afﬁne functions of outer loop indices and program parameters. A lot of scientiﬁc applications such as linear-algebra kernels, stencils, partial differential equation solvers, data-mining applications and dynamic programming codes fall in this category. In this work, we propose techniques for ﬁnding computation mapping and data allocation when compiling regular programs for distributed-memory clusters. Techniques for transformation and detection of parallelism, relying on the polyhedral framework already exist. We propose automatic techniques to determine computation placements for identiﬁed parallelism and allocation of data. We model the problem of ﬁnding good computation placement as a graph partitioning problem with the constraints to minimize both communication volume and load imbalance for entire program. We show that our approach for computation mapping is more effective than those that can be developed using vendor-supplied libraries. Our approach for data allocation is driven by tiling of data spaces along with a compiler assisted runtime scheme to allocate and deallocate tiles on-demand and reuse them. Experimental results on some sequences of BLAS calls demonstrate a mean speedup of 1.82× over versions written with ScaLAPACK. Besides enabling weak scaling for distributed memory, data tiling also improves locality for shared-memory parallelization. Experimental results on a 32-core shared-memory SMP system shows a mean speedup of 2.67× over code that is not data tiled.
author2	Bondhugula, Uday
author_facet	Bondhugula, Uday Chandan, G
author	Chandan, G
author_sort	Chandan, G
title	Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs
title_short	Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs
title_full	Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs
title_fullStr	Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs
title_full_unstemmed	Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs
title_sort	effective automatic computation placement and data allocation for parallelization of regular programs
publishDate	2018
url	http://hdl.handle.net/2005/3111 http://etd.ncsi.iisc.ernet.in/abstracts/3971/G26343-Abs.pdf
work_keys_str_mv	AT chandang effectiveautomaticcomputationplacementanddataallocationforparallelizationofregularprograms
_version_	1718615366751485952

Effective Automatic Computation Placement and Data Allocation for Parallelization of Regular Programs

Similar Items