Summary: | Matrix computing is a basic operational model that was broadly used in science and engineering applications. In this study, we first propose a novel optimization method to obtain a high-performance and scalable architecture for matrix multiplication, including reducing data transmission, optimizing data flow, improving resource utilization, and dynamically changing the length of the linear array. Based on the optimized architecture, we present a multi-operation floating-point matrix computing unit (design-I), which extends the function of matrix computing from single matrix multiplication operation to matrix addition, matrix subtraction, matrix-vector multiplication, matrix-scalar multiplication. With low storage demand and computing efficiency, design-I can be used in computing dense matrices of arbitrary sizes. Moreover, we propose a continuous floating-point matrix computing unit (design-II), which not only has the same function of multi-operation but also meets the requirement of continuous matrix computing in practical engineering and avoids a large amount of intermediate data transfer. Finally, the authors adopt the above-mentioned unit cores to build a matrix computing acceleration system according to different engineering requirements. The experiments implemented in the Xilinx 585T FPGA device show that the accelerator achieves a maximum frequency of 195Mhz with 256 processing elements (PEs) and performs 99.8GFLOPS. The architecture is more outstanding in application scope and prospects compared with state-of-the-art methods.
|