Summary: | 博士 === 國立交通大學 === 電子工程系 === 89 === Arithmetic Module Design and its Application to FFT
Student: Wen-Chang Yeh
Advisor:Prof. Chein-Wei Jen
Department of Electronics Engineering, National Chiao-Tung University
Abstract
Addition and multiplication are the most fundamental operations for digital signal processing and are widely employed in modern computers and various applications. Although they have been studied extensively in the literature, most of the research focuses on algorithm and architecture exploration without considering operation scheduling at the same time. In this dissertation, we shall study the design of arithmetic modules and consider bit-level and word-level scheduling problems at the same time with systematic methodology.
For two''s complement addition, a set of operators and notations have been developed to explore the relationships among the conventional carry-lookahead based and conditional-sum based algorithms. From the obtained formulae, we can describe the path that generates carry and the path that generates sum separately. Moreover, by using these formulae we can prove that these algorithms can have almost identical topology from an algorithm perspective. Hence, we can clarify and identify distinctive features of the carry-generation and the sum-generation for each algorithm. By exploring these features, two timing-driven generalized addition algorithms, dual-bit forward prediction (DFP) and generalized earliest-first (GEF), have been proposed to achieve high performance. Unlike traditional algorithms, the proposed GEF algorithm can use conditional-sum and carry-lookahead rules to generate optimized adder that fully exploits the features of input delay profile.
For bit-parallel multiplication, it consists of three steps: partial product generation, partial product reduction, and final addition. For partial product generation, direct generation and several popular modified-Booth encoding (MBE) schemes are studied. A novel MBE algorithm has been proposed to generate the partial products within two exclusive-or (XOR) gates while the power consumption are minimized by removing spurious signal glitching in the partial product array. A new partial product array for MBE has been derived to improve the performance of the LSB part. We also examine the performance of partial product reduction tree (PPRT) optimized via three-dimensional minimization (TDM) algorithm with the proposed MBE scheme. For TDM algorithm, we evaluate the performance improvement achievable by using different full adders and present a powerful sum-carry separation technique to improve the output profile. When the generalized addition algorithms developed herein is applied to the final addition of multiplication, more than 10% speed improvement can be achieved while the hardware cost and the power consumption of the adder can also be reduced.
To study word-level operation scheduling, we explore the design space of Fast Fourier Transform (FFT) algorithm, which consists of a series of addition and multiplication. At algorithm level, we inspect the algorithms based on Cooley-Tukey decomposition by examining their signal flow graphs. Furthermore, for any algorithm based on Cooley-Tukey decomposition, we provide a systematic design methodology to obtain regular hardware architectures. The first one is one-dimensional (1D) pipeline architecture. With the aid of the proposed design methodology, we can overcome the irregularity of split-radix FFT algorithm and obtain a regular 1D pipeline architecture for the split-radix algorithm. Through similar design procedure, other higher radix algorithms can also be obtained. Based on the pipeline architecture, we further derive the design methodology for single processing element architecture. For both pipeline based and the single processing element based designs, we have derived the properties regarding the performance, throughput and the hardware requirements.
Various issues related to the design of FFT for OFDM system are discussed at the last part of this dissertation. One-dimensional pipeline has been adopted to meet the requirements of both high performance and low power. For high-speed application, delay-balanced hardware architecture has been proposed to remove unnecessary carry-propagation additions where the final addition of multiplication is merged into the butterfly operation to replace CPA with CSA. For low speed application, we also present a three-multiplication to further reduce the hardware cost and the power consumption. Based on the post-synthesis and post-layout simulations, split-radix pipeline architecture is recognized as a good candidate for both high-speed and low-speed applications. Compared with a conventional 64-point radix-22 pipeline, the proposed split-radix design reduces 15% power and 14.5% latency at 100Mhz, 3.3v, and 25℃.
|