A large dimensional matrix chain matrix multiplier for extremely low IO bandwidth requirements

Large-dimensional matrix multiplication is often implemented by submatrix block method. The maximum size of the submatrix determines the speed of the entire matrix multiplication. Concerning the problem that the matrix size directly processed by the classical systolic structure is severely limited b...

Full description

Bibliographic Details
Main Authors: Song Yukun, Zheng Qiangqiang, Wang Zezhong, Zhang Duoli
Format: Article
Language:zho
Published: National Computer System Engineering Research Institute of China 2019-09-01
Series:Dianzi Jishu Yingyong
Subjects:
Online Access:http://www.chinaaet.com/article/3000108356
Description
Summary:Large-dimensional matrix multiplication is often implemented by submatrix block method. The maximum size of the submatrix determines the speed of the entire matrix multiplication. Concerning the problem that the matrix size directly processed by the classical systolic structure is severely limited by the IO bandwidth, this paper proposes a large-dimensional matrix chain multiplier structure with extremely low IO bandwidth requirements, and completes the hardware design implementation and performance verification. The following is the main work of this thesis. Firstly, optimizing the data organization of matrix multiplication, realizing the input matrix size has nothing to do with IO bandwidth, and make maximum use of the internal logic and storage resources of the device. Secondly, according to the optimized data organization form, the chain multiplier hardware is designed for realizing the source data calculation and transmission overlap operation. Thirdly, the adaptability of the multiplier to the matrix scale is enhanced, and the designed chain multiplier can be configured in real time as multiple independent chains, multiple sets of operations in parallel. Lastly, completing the hardware implementation and performance test of chain multipliers of different sizes on the Xilinx C7V2000T FPGA chip. On this chip, the chain multiplier proposed in this paper supports up to 800 arithmetic units, which is 8 times the size of the classic systolic structure. In the same number of operators, the chain multiplier performance proposed in this paper uses only the classical pulsation structure to calculate the IO bandwidth of 1/8 to obtain equal performance.
ISSN:0258-7998