A Scalable Architecture for Accelerating Multi-Operation and Continuous Floating-Point Matrix Computing on FPGAs

Matrix computing is a basic operational model that was broadly used in science and engineering applications. In this study, we first propose a novel optimization method to obtain a high-performance and scalable architecture for matrix multiplication, including reducing data transmission, optimizing...

Full description

Bibliographic Details
Main Authors: Longlong Zhang, Yuanxi Peng, Ahui Huang, Xiao Hu
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9093911/
Description
Summary:Matrix computing is a basic operational model that was broadly used in science and engineering applications. In this study, we first propose a novel optimization method to obtain a high-performance and scalable architecture for matrix multiplication, including reducing data transmission, optimizing data flow, improving resource utilization, and dynamically changing the length of the linear array. Based on the optimized architecture, we present a multi-operation floating-point matrix computing unit (design-I), which extends the function of matrix computing from single matrix multiplication operation to matrix addition, matrix subtraction, matrix-vector multiplication, matrix-scalar multiplication. With low storage demand and computing efficiency, design-I can be used in computing dense matrices of arbitrary sizes. Moreover, we propose a continuous floating-point matrix computing unit (design-II), which not only has the same function of multi-operation but also meets the requirement of continuous matrix computing in practical engineering and avoids a large amount of intermediate data transfer. Finally, the authors adopt the above-mentioned unit cores to build a matrix computing acceleration system according to different engineering requirements. The experiments implemented in the Xilinx 585T FPGA device show that the accelerator achieves a maximum frequency of 195Mhz with 256 processing elements (PEs) and performs 99.8GFLOPS. The architecture is more outstanding in application scope and prospects compared with state-of-the-art methods.
ISSN:2169-3536