VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filtering

博士 === 國立臺灣大學 === 電子工程學研究所 === 93 === Discrete Wavelet Transform (DWT) has led the revolution of block-based image coding and close-loop video coding systems. In this dissertation, VLSI architectures and memory analysis of DWT in three dimensions are discussed in three different parts: One-Dimension...

Full description

Bibliographic Details
Main Authors: Chao-Tsung Huang, 黃朝宗
Other Authors: 陳良基
Format: Others
Language:en_US
Published: 2005
Online Access:http://ndltd.ncl.edu.tw/handle/10050557960481141470
id ndltd-TW-093NTU05428050
record_format oai_dc
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立臺灣大學 === 電子工程學研究所 === 93 === Discrete Wavelet Transform (DWT) has led the revolution of block-based image coding and close-loop video coding systems. In this dissertation, VLSI architectures and memory analysis of DWT in three dimensions are discussed in three different parts: One-Dimensional (1-D) DWT, Two-Dimensional (2-D) DWT, and Motion-Compensated Temporal Filtering (MCTF) that performs DWT in the temporal direction. Because 1-D DWT, 2-D DWT, and MCTF are pixel-level, framelevel, and group-of-picture-level operations, the design levels target at processing element, module, and system, respectively. The implementation method of 1-D DWT is highly related to the algorithm view. In Part I of this dissertation, many different algorithm views for DWT are surveyed first: two-channel filter bank, polyphase decomposition, lifting scheme, and B-spline factorization. The previous 1-D DWT architectures can be classified into convolution- and lifting-based. Second, we propose a flipping structure to reduce the critical path of lifting-based DWT architecture without any hardware overhead. The lifting-based architectures are usually adopted because of its fewer computation complexity and in-place implementation. However, the critical path is potentially long owing to the serial connection of triangular matrices. The flipping structure multiplies the inverses on the timing accumulation path for efficiently reducing the critical path, instead of the conventional pipelining technique that introduces many registers. The case studies of JPEG 2000 (9,7) filter and the linear (6,10) filter demonstrate the efficiency of flipping structure. Third, a new category of DWT implementation based on B-spline factorization is proposed, which can use fewer multipliers. For Daubechies wavelets, it can guarantee to reduce about one half of multiplies compared to convolution-based architectures. However, the lifting scheme cannot reduce the computation complexity for even linear DWT filters. By case studies of the (6,10), (10,18), and (9,7) filters, the proposed B-spline-based architecture shows the superior performance in terms of logic gate count. The 2-D DWT belongs to frame-based computations, so the performance of hardware implementation is dominated by external memory bandwidth and internal memory size. In Part II of this dissertation, a detailed survey for different scan methods is first given and classified into five categories. An overlapped stripe-based scan is proposed to provide a better trade-off for memory requirement. Second, generic line-based 2-D DWT architectures are proposed, which can adopt any kind of 1-D DWT modules. For 1-level 2-D line-based architecture, the line buffer is separated into data buffer and temporal buffer. We propose a data flow for data buffer and a mapping method for temporal buffer, which can minimize the line buffer size. Two multi-level 2-D DWT line-based architectures are also proposed, which can minimize the external memory access, with different hardware utilizations. Third, we propose a memory-efficient implementation for line-based 2-D DWT, which is called multiple-lifting scheme. The implementation issues of temporal buffer are first discussed. Then, the proposed multiple-lifting scheme provides a new implementation method for temporal buffer. It can reduce the temporal buffer access frequency to replace the two-port SRAM by single-port SRAM. The reduction of access frequency also decreases the power consumption of temporal buffer proportionally. By evaluating hardware designs for the (9,7) filter with Artisan 0.18um cell library and RAM compiler, the efficiency of area and power reduction is proven. Fourth, an efficient VLSI implementation for 2-D Shape-Adaptive DWT (SA-DWT) is proposed. The SA-DWT requires the capability to process the boundary extension for very short signal segments. It is proposed to be implemented by use of stage-based boundary extension strategy and shape-adaptive boundary handling unit. The SA-DWT with the JPEG 2000 lossy (9,7) filter and the MPEG-4 VTC (9,3) filter are implemented to prove the efficiency. Furthermore, the SA-DWT implementation with (9,7) filter is fabricated with core area 1.68x1.68mm2 in TSMC 0.25um process. This chip has real-time processing capability of 1-level 2-D SA-DWT for HDTV1080p 30fps sequences when working at 50MHz. MCTF is to perform DWT in the temporal direction with Motion Compensation (MC). MCTF has become the core technology in interframe wavelet video coding and the next generation video coding standard MPEG SVC. In Part III of this dissertation, the first research work on VLSI architecture and memory analysis of MCTF are presented. First, memory issues of one-level MCTF are discussed. The 5/3 MCTF consists of prediction and update stages. The former is analyzed in terms of macroblock- and frame-level data reuse schemes separately. After reviewing previous macroblock-level reuse schemes, we propose a new Level C+ scheme to provide a good trade-off between Level C and D schemes. Based on the open-loop prediction nature, we propose three frame-level data reuse schemes: Double Reference Frames ME, Double Current Frames ME, and modified Double Current Frames ME. The analysis of 5/3 MCTF is based on the combination of prediction and update stages, which includes external memory bandwidth and storage size. Second, system issues of multi-level MCTF are discussed, including computation complexity, external memory bandwidth and storage size, and coding delay. The computation complexities are very similar for different MCTF configurations, but other three system issues are quite different. They depend on the adopted macroblock- and frame-level reuse scheme, decomposition level, inter- or intra-coded lowpass frames, and 5/3 or 1/3 MCTF. Based on simulation results, the impact of the latter three parameters on coding performance is evaluated. Accordingly, a flexible and efficient system architecture is proposed for multi-level MCTF. It can adapt the temporal prediction structures to any-level 5/3 MCTF, 1/3 MCTF, or Hierarchial B-frames, and even the close-loop MCP with two reference frames. In summary, this dissertation presents a fast lifting-based architecture, named flipping structure, and a new design category based on B-spline factorization that can provide the smallest gate count for 1-D DWT processing element design. For 2-D DWT, a generic line-based architecture is proposed to minimize the on-chip memory and to be capable of adopting any kind of DWT modules. Furthermore, a memory-efficient implementation for temporal buffer, called multiple-lifting scheme, is presented to reduce the memory area and access power efficiently. Besides, the boundary extension of SA-DWT is addressed by proposed stage-based boundary handling units. As for MCTF, system-level implementation issues are considered. The block-level and frame-level data reuse schemes are both discussed for one-level MCTF. According to analysis results, a flexible and efficient system architecture is proposed for multi-level MCTF, which can support many configurations of MCTF systems.
author2 陳良基
author_facet 陳良基
Chao-Tsung Huang
黃朝宗
author Chao-Tsung Huang
黃朝宗
spellingShingle Chao-Tsung Huang
黃朝宗
VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filtering
author_sort Chao-Tsung Huang
title VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filtering
title_short VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filtering
title_full VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filtering
title_fullStr VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filtering
title_full_unstemmed VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filtering
title_sort vlsi architecture and analysis of discrete wavelet transform and motion-compensated temporal filtering
publishDate 2005
url http://ndltd.ncl.edu.tw/handle/10050557960481141470
work_keys_str_mv AT chaotsunghuang vlsiarchitectureandanalysisofdiscretewavelettransformandmotioncompensatedtemporalfiltering
AT huángcháozōng vlsiarchitectureandanalysisofdiscretewavelettransformandmotioncompensatedtemporalfiltering
AT chaotsunghuang zhēnduìlísànxiǎobōzhuǎnhuànyǐjíyídòngbǔchángshìshíjiānlǜbōzhījītǐdiànlùjiàgòushèjìyǔfēnxī
AT huángcháozōng zhēnduìlísànxiǎobōzhuǎnhuànyǐjíyídòngbǔchángshìshíjiānlǜbōzhījītǐdiànlùjiàgòushèjìyǔfēnxī
_version_ 1718153841902354432
spelling ndltd-TW-093NTU054280502015-12-21T04:04:05Z http://ndltd.ncl.edu.tw/handle/10050557960481141470 VLSI Architecture and Analysis of Discrete Wavelet Transform and Motion-Compensated Temporal Filtering 針對離散小波轉換以及移動補償式時間濾波之積體電路架構設計與分析 Chao-Tsung Huang 黃朝宗 博士 國立臺灣大學 電子工程學研究所 93 Discrete Wavelet Transform (DWT) has led the revolution of block-based image coding and close-loop video coding systems. In this dissertation, VLSI architectures and memory analysis of DWT in three dimensions are discussed in three different parts: One-Dimensional (1-D) DWT, Two-Dimensional (2-D) DWT, and Motion-Compensated Temporal Filtering (MCTF) that performs DWT in the temporal direction. Because 1-D DWT, 2-D DWT, and MCTF are pixel-level, framelevel, and group-of-picture-level operations, the design levels target at processing element, module, and system, respectively. The implementation method of 1-D DWT is highly related to the algorithm view. In Part I of this dissertation, many different algorithm views for DWT are surveyed first: two-channel filter bank, polyphase decomposition, lifting scheme, and B-spline factorization. The previous 1-D DWT architectures can be classified into convolution- and lifting-based. Second, we propose a flipping structure to reduce the critical path of lifting-based DWT architecture without any hardware overhead. The lifting-based architectures are usually adopted because of its fewer computation complexity and in-place implementation. However, the critical path is potentially long owing to the serial connection of triangular matrices. The flipping structure multiplies the inverses on the timing accumulation path for efficiently reducing the critical path, instead of the conventional pipelining technique that introduces many registers. The case studies of JPEG 2000 (9,7) filter and the linear (6,10) filter demonstrate the efficiency of flipping structure. Third, a new category of DWT implementation based on B-spline factorization is proposed, which can use fewer multipliers. For Daubechies wavelets, it can guarantee to reduce about one half of multiplies compared to convolution-based architectures. However, the lifting scheme cannot reduce the computation complexity for even linear DWT filters. By case studies of the (6,10), (10,18), and (9,7) filters, the proposed B-spline-based architecture shows the superior performance in terms of logic gate count. The 2-D DWT belongs to frame-based computations, so the performance of hardware implementation is dominated by external memory bandwidth and internal memory size. In Part II of this dissertation, a detailed survey for different scan methods is first given and classified into five categories. An overlapped stripe-based scan is proposed to provide a better trade-off for memory requirement. Second, generic line-based 2-D DWT architectures are proposed, which can adopt any kind of 1-D DWT modules. For 1-level 2-D line-based architecture, the line buffer is separated into data buffer and temporal buffer. We propose a data flow for data buffer and a mapping method for temporal buffer, which can minimize the line buffer size. Two multi-level 2-D DWT line-based architectures are also proposed, which can minimize the external memory access, with different hardware utilizations. Third, we propose a memory-efficient implementation for line-based 2-D DWT, which is called multiple-lifting scheme. The implementation issues of temporal buffer are first discussed. Then, the proposed multiple-lifting scheme provides a new implementation method for temporal buffer. It can reduce the temporal buffer access frequency to replace the two-port SRAM by single-port SRAM. The reduction of access frequency also decreases the power consumption of temporal buffer proportionally. By evaluating hardware designs for the (9,7) filter with Artisan 0.18um cell library and RAM compiler, the efficiency of area and power reduction is proven. Fourth, an efficient VLSI implementation for 2-D Shape-Adaptive DWT (SA-DWT) is proposed. The SA-DWT requires the capability to process the boundary extension for very short signal segments. It is proposed to be implemented by use of stage-based boundary extension strategy and shape-adaptive boundary handling unit. The SA-DWT with the JPEG 2000 lossy (9,7) filter and the MPEG-4 VTC (9,3) filter are implemented to prove the efficiency. Furthermore, the SA-DWT implementation with (9,7) filter is fabricated with core area 1.68x1.68mm2 in TSMC 0.25um process. This chip has real-time processing capability of 1-level 2-D SA-DWT for HDTV1080p 30fps sequences when working at 50MHz. MCTF is to perform DWT in the temporal direction with Motion Compensation (MC). MCTF has become the core technology in interframe wavelet video coding and the next generation video coding standard MPEG SVC. In Part III of this dissertation, the first research work on VLSI architecture and memory analysis of MCTF are presented. First, memory issues of one-level MCTF are discussed. The 5/3 MCTF consists of prediction and update stages. The former is analyzed in terms of macroblock- and frame-level data reuse schemes separately. After reviewing previous macroblock-level reuse schemes, we propose a new Level C+ scheme to provide a good trade-off between Level C and D schemes. Based on the open-loop prediction nature, we propose three frame-level data reuse schemes: Double Reference Frames ME, Double Current Frames ME, and modified Double Current Frames ME. The analysis of 5/3 MCTF is based on the combination of prediction and update stages, which includes external memory bandwidth and storage size. Second, system issues of multi-level MCTF are discussed, including computation complexity, external memory bandwidth and storage size, and coding delay. The computation complexities are very similar for different MCTF configurations, but other three system issues are quite different. They depend on the adopted macroblock- and frame-level reuse scheme, decomposition level, inter- or intra-coded lowpass frames, and 5/3 or 1/3 MCTF. Based on simulation results, the impact of the latter three parameters on coding performance is evaluated. Accordingly, a flexible and efficient system architecture is proposed for multi-level MCTF. It can adapt the temporal prediction structures to any-level 5/3 MCTF, 1/3 MCTF, or Hierarchial B-frames, and even the close-loop MCP with two reference frames. In summary, this dissertation presents a fast lifting-based architecture, named flipping structure, and a new design category based on B-spline factorization that can provide the smallest gate count for 1-D DWT processing element design. For 2-D DWT, a generic line-based architecture is proposed to minimize the on-chip memory and to be capable of adopting any kind of DWT modules. Furthermore, a memory-efficient implementation for temporal buffer, called multiple-lifting scheme, is presented to reduce the memory area and access power efficiently. Besides, the boundary extension of SA-DWT is addressed by proposed stage-based boundary handling units. As for MCTF, system-level implementation issues are considered. The block-level and frame-level data reuse schemes are both discussed for one-level MCTF. According to analysis results, a flexible and efficient system architecture is proposed for multi-level MCTF, which can support many configurations of MCTF systems. 陳良基 2005 學位論文 ; thesis 194 en_US