Overlapping communications and computations in GPU-based iterative linear solvers

Krylov subspace methods such as Conjugate Gradient and Biconjugate Gradient Stabilized methods are well known approaches for solving symmetric and asymmetric systems of linear algebraic equations, such as systems usually arising from partial differential equations in computational mathematics proble...

Full description

Bibliographic Details
Main Authors: V. . Platonov, A. . Monakov
Format: Article
Language:English
Published: Ivannikov Institute for System Programming of the Russian Academy of Sciences 2018-10-01
Series:Труды Института системного программирования РАН
Subjects:
Online Access:https://ispranproceedings.elpub.ru/jour/article/view/28
Description
Summary:Krylov subspace methods such as Conjugate Gradient and Biconjugate Gradient Stabilized methods are well known approaches for solving symmetric and asymmetric systems of linear algebraic equations, such as systems usually arising from partial differential equations in computational mathematics problems, like Navier-Stokes equations in fluid dynamics. With increasing sizes of meshes and numbers of computational nodes, network communications time may become an issue: stalls during global reductions have increasing duration, preventing useful computations. This happens because, in original formulations of methods, computing a dot product requires a global reduce operation, and its value is required on the next step, so each process has to stop until all others reach this point, like in a barrier synchronization. We research alternative formulations of conjugate gradient methods (Preconditioned Conjugate Gradient and BiCGStab) for GPU-based iterative linear system solvers. They allow to have an overlap of parallel computations and communications, at the cost of increased amount of computations and memory requirements. We describe an implementation of our approach for GPU-accelerated hybrid systems in OpenFOAM, an open source framework for computational fluid dynamics. Asynchronous collective communications from MPI-3 parallel programming API are used to avoid full barrier synchronization and reduce latency. Experimental results on 2 and 4 million cases from standard OpenFOAM problems are presented.
ISSN:2079-8156
2220-6426