Provably Efficient Algorithms for Numerical Tensor Algebra

<p> This thesis targets the design of parallelizable algorithms and communication-efficient parallel schedules for numerical linear algebra as well as computations with higher-order tensors. Communication is a growing bottleneck in the execution of most algorithms on parallel computers, which...

Full description

Bibliographic Details
Main Author: Solomonik, Edgar
Language:EN
Published: University of California, Berkeley 2015
Subjects:
Online Access:http://pqdtopen.proquest.com/#viewpdf?dispub=3686016
id ndltd-PROQUEST-oai-pqdtoai.proquest.com-3686016
record_format oai_dc
collection NDLTD
language EN
sources NDLTD
topic Computer Science
spellingShingle Computer Science
Solomonik, Edgar
Provably Efficient Algorithms for Numerical Tensor Algebra
description <p> This thesis targets the design of parallelizable algorithms and communication-efficient parallel schedules for numerical linear algebra as well as computations with higher-order tensors. Communication is a growing bottleneck in the execution of most algorithms on parallel computers, which manifests itself as data movement both through the network connecting different processors and through the memory hierarchy of each processor as well as synchronization between processors. We provide a rigorous theoretical model of communication and derive lower bounds as well as algorithms in this model. Our analysis concerns two broad areas of linear algebra and of tensor contractions. We demonstrate the practical quality of the new theoretically-improved algorithms by presenting results which show that our implementations outperform standard libraries and traditional algorithms. </p><p> We model the costs associated with local computation, interprocessor communication and synchronization, as well as memory to cache data transfers of a parallel schedule based on the most expensive execution path in the schedule. We introduce a new technique for deriving lower bounds on tradeoffs between these costs and apply them to algorithms in both dense and sparse linear algebra as well as graph algorithms. These lower bounds are attained by what we refer to as 2.5D algorithms, which we give for matrix multiplication, Gaussian elimination, QR factorization, the symmetric eigenvalue problem, and the Floyd-Warshall all-pairs shortest-paths algorithm. 2.5D algorithms achieve lower interprocessor bandwidth cost by exploiting auxiliary memory. Algorithms employing this technique are well known for matrix multiplication, and have been derived in the BSP model for LU and QR factorization, as well as the Floyd-Warshall algorithm. We introduce alternate versions of LU and QR algorithms which have measurable performance improvements over their BSP counterparts, and we give the first evaluations of their performance. We also explore network-topology-aware mapping on torus networks for matrix multiplication and LU, showing how 2.5D algorithms can efficiently exploit collective communication, as well as introducing an adaptation of Cannon's matrix multiplication algorithm that is better suited for torus networks with dimension larger than two. For the symmetric eigenvalue problem, we give the first 2.5D algorithms, additionally solving challenges with memory-bandwidth efficiency that arise for this problem. We also give a new memory-bandwidth efficient algorithm for Krylov subspace methods (repeated multiplication of a vector by a sparse-matrix), which is motivated by the application of our lower bound techniques to this problem. </p><p> The latter half of the thesis contains algorithms for higher-order tensors, in particular tensor contractions. The motivating application for this work is the family of coupled-cluster methods, which solve the many-body Schr&ouml;dinger equation to provide a chemically-accurate model of the electronic structure of molecules and chemical reactions where electron correlation plays a significant role. The numerical computation of these methods is dominated in cost by contraction of antisymmetric tensors. We introduce Cyclops Tensor Framework, which provides an automated mechanism for network-topology-aware decomposition and redistribution of tensor data. It leverages 2.5D matrix multiplication to perform tensor contractions communication-efficiently. The framework is capable of exploiting symmetry and antisymmetry in tensors and utilizes a distributed packed-symmetric storage format. Finally, we consider a theoretically novel technique for exploiting tensor symmetry to lower the number of multiplications necessary to perform a contraction via computing some redundant terms that allow preservation of symmetry and then cancelling them out with low-order cost. We analyze the numerical stability and communication efficiency of this technique and give adaptations to antisymmetric and Hermitian matrices. This technique has promising potential for accelerating coupled-cluster methods both in terms of computation and communication cost, and additionally provides a potential improvement for BLAS routines on complex matrices.</p>
author Solomonik, Edgar
author_facet Solomonik, Edgar
author_sort Solomonik, Edgar
title Provably Efficient Algorithms for Numerical Tensor Algebra
title_short Provably Efficient Algorithms for Numerical Tensor Algebra
title_full Provably Efficient Algorithms for Numerical Tensor Algebra
title_fullStr Provably Efficient Algorithms for Numerical Tensor Algebra
title_full_unstemmed Provably Efficient Algorithms for Numerical Tensor Algebra
title_sort provably efficient algorithms for numerical tensor algebra
publisher University of California, Berkeley
publishDate 2015
url http://pqdtopen.proquest.com/#viewpdf?dispub=3686016
work_keys_str_mv AT solomonikedgar provablyefficientalgorithmsfornumericaltensoralgebra
_version_ 1716799824546758656
spelling ndltd-PROQUEST-oai-pqdtoai.proquest.com-36860162015-04-02T03:59:50Z Provably Efficient Algorithms for Numerical Tensor Algebra Solomonik, Edgar Computer Science <p> This thesis targets the design of parallelizable algorithms and communication-efficient parallel schedules for numerical linear algebra as well as computations with higher-order tensors. Communication is a growing bottleneck in the execution of most algorithms on parallel computers, which manifests itself as data movement both through the network connecting different processors and through the memory hierarchy of each processor as well as synchronization between processors. We provide a rigorous theoretical model of communication and derive lower bounds as well as algorithms in this model. Our analysis concerns two broad areas of linear algebra and of tensor contractions. We demonstrate the practical quality of the new theoretically-improved algorithms by presenting results which show that our implementations outperform standard libraries and traditional algorithms. </p><p> We model the costs associated with local computation, interprocessor communication and synchronization, as well as memory to cache data transfers of a parallel schedule based on the most expensive execution path in the schedule. We introduce a new technique for deriving lower bounds on tradeoffs between these costs and apply them to algorithms in both dense and sparse linear algebra as well as graph algorithms. These lower bounds are attained by what we refer to as 2.5D algorithms, which we give for matrix multiplication, Gaussian elimination, QR factorization, the symmetric eigenvalue problem, and the Floyd-Warshall all-pairs shortest-paths algorithm. 2.5D algorithms achieve lower interprocessor bandwidth cost by exploiting auxiliary memory. Algorithms employing this technique are well known for matrix multiplication, and have been derived in the BSP model for LU and QR factorization, as well as the Floyd-Warshall algorithm. We introduce alternate versions of LU and QR algorithms which have measurable performance improvements over their BSP counterparts, and we give the first evaluations of their performance. We also explore network-topology-aware mapping on torus networks for matrix multiplication and LU, showing how 2.5D algorithms can efficiently exploit collective communication, as well as introducing an adaptation of Cannon's matrix multiplication algorithm that is better suited for torus networks with dimension larger than two. For the symmetric eigenvalue problem, we give the first 2.5D algorithms, additionally solving challenges with memory-bandwidth efficiency that arise for this problem. We also give a new memory-bandwidth efficient algorithm for Krylov subspace methods (repeated multiplication of a vector by a sparse-matrix), which is motivated by the application of our lower bound techniques to this problem. </p><p> The latter half of the thesis contains algorithms for higher-order tensors, in particular tensor contractions. The motivating application for this work is the family of coupled-cluster methods, which solve the many-body Schr&ouml;dinger equation to provide a chemically-accurate model of the electronic structure of molecules and chemical reactions where electron correlation plays a significant role. The numerical computation of these methods is dominated in cost by contraction of antisymmetric tensors. We introduce Cyclops Tensor Framework, which provides an automated mechanism for network-topology-aware decomposition and redistribution of tensor data. It leverages 2.5D matrix multiplication to perform tensor contractions communication-efficiently. The framework is capable of exploiting symmetry and antisymmetry in tensors and utilizes a distributed packed-symmetric storage format. Finally, we consider a theoretically novel technique for exploiting tensor symmetry to lower the number of multiplications necessary to perform a contraction via computing some redundant terms that allow preservation of symmetry and then cancelling them out with low-order cost. We analyze the numerical stability and communication efficiency of this technique and give adaptations to antisymmetric and Hermitian matrices. This technique has promising potential for accelerating coupled-cluster methods both in terms of computation and communication cost, and additionally provides a potential improvement for BLAS routines on complex matrices.</p> University of California, Berkeley 2015-03-27 00:00:00.0 thesis http://pqdtopen.proquest.com/#viewpdf?dispub=3686016 EN