Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threa...

Full description

Bibliographic Details
Main Authors:	Xing Su, Fei Lei
Format:	Article
Language:	English
Published:	MDPI AG 2018-11-01
Series:	Electronics
Subjects:	GEMM BLAS high-performance computing linear algebra
Online Access:	https://www.mdpi.com/2079-9292/7/12/359

id	doaj-df705d4017b348f58309dfc08a8acbee
record_format	Article
spelling	doaj-df705d4017b348f58309dfc08a8acbee2020-11-25T02:28:18ZengMDPI AGElectronics2079-92922018-11-0171235910.3390/electronics7120359electronics7120359Hybrid-Grained Dynamic Load Balanced GEMM on NUMA ArchitecturesXing Su0Fei Lei1National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, ChinaNational Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, ChinaThe Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.https://www.mdpi.com/2079-9292/7/12/359GEMMBLAShigh-performance computinglinear algebra
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Xing Su Fei Lei
spellingShingle	Xing Su Fei Lei Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures Electronics GEMM BLAS high-performance computing linear algebra
author_facet	Xing Su Fei Lei
author_sort	Xing Su
title	Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures
title_short	Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures
title_full	Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures
title_fullStr	Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures
title_full_unstemmed	Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures
title_sort	hybrid-grained dynamic load balanced gemm on numa architectures
publisher	MDPI AG
series	Electronics
issn	2079-9292
publishDate	2018-11-01
description	The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.
topic	GEMM BLAS high-performance computing linear algebra
url	https://www.mdpi.com/2079-9292/7/12/359
work_keys_str_mv	AT xingsu hybridgraineddynamicloadbalancedgemmonnumaarchitectures AT feilei hybridgraineddynamicloadbalancedgemmonnumaarchitectures
_version_	1724839083707990016

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

Similar Items