Summary: | 博士 === 國立臺灣大學 === 電子工程學研究所 === 94 === Temporal prediction is the most critical component of video coding systems, because it not only significantly improves the coding performance but also dominates the required computation resources of a video coding system. Due to its required huge computation complexity and large memory bandwidth, a hardware acceleration is a must for temporal prediction, which is also the core of this dissertation. There are four major hardware design challenges of temporal prediction. The first one is the architecture design of processing elements (PEs) due to the huge computation complexity. The second one is the data reuse strategy because of the large memory bandwidth. The on-chip memory arrangement is another design challenge to satisfy the memory requirements of PEs and the data reuse strategy. The scheduling is the last one. Irregular memory access usually reduces the utilization of memory bandwidth, so a scheduling to guarantee regular memory access is important. These four design challenges are related three system issues, hardware area, system memory bandwidth, and system memory size. For different systems, the constraints and weighting factors of system issues are variant, so different design strategies are required. In the following, we classify temporal prediction into three categories, local motion estimation (LME), global motion estimation (GME), and motion-compensated temporal filtering (MCTF), for the discussion. We not only overcome the four design challenges but also provide different design strategies for different systems.
In the first part of LME, we target on the architecture design of VBSME. There are many methods to support VBSME, and the most efficient one is to use the SADs of the smallest blocks to derive those of larger blocks. If this method is used, the overhead of VBSME in different architectures depends on the data flow of the partial sum of absolute differences (SAD). We classify the data flows of partial SADs into three types, storing in registers of PEs, propagating with propagation registers, and no partial SADs. Among three kinds of data flows, the first one requires the largest overhead of VBSME and the last one requires the smallest overhead. In the second part of LME, we discuss the data reuse strategy and propose a macroblock-level data reuse scheme, Level C+ scheme, in which the overlapped searching region in the horizontal and vertical directions can be fully and partially reused, respectively. Compared to Level C scheme, Level C+ scheme with corresponding scan order can save 46% memory bandwidth with only 12% increase of on-chip memory size, For HDTV 720p with the searching range of size [-128,128).
In the GME part, we use the architecture of GME to discuss the memory arrangement and scheduling. The major design challenges of GME are irregular memory access due to scaling and rotation and the memory access requirement of the interpolation and differential values. We propose the reference-based scheduling to eliminate irregular memory access and adopt the interleaved memory arrangement to satisfy the memory access requirement. Finally, a hardware accelerator of GME is implemented, which is 131 K gates with 7.9 Kbits memory and can real time process MPEG-4 ASP@L3 at 30 MHz. Compared to the previous work, our proposed architecture requires much smaller on-chip memory size and memory bandwidth.
In MCTF part, frame-level data reuse and the hardware architecture of MCTF are two issues. The former focuses on data reuse strategies and the latter involves four design challenges. Frame-level data reuse means that we can use the system memory size to further reduce the required system memory bandwidth. We develop a methodology for the frame-level data reuse analysis and estimate their tradeoffs between on-chip memory size and system memory usages. In the second part, we present the first hardware accelerator of MCTF, which is also a computation-aware engine. By adopting the frame-level data reuse schemes, 20% -- 42% memory bandwidth of prediction stages can be saved in different coding schemes. A new MB-pipelining scheme is developed to save the data buffer overhead of frame-level data reuse. As for the update stage, the proposed techniques can save 50% memory bandwidth and 75% hardware cost. The reconfigurable concept is also adopted in the on-chip memory to perform different data reuse strategies. The proposed accelerator can real time process CIF Format with the searching range of size [-32, 32) @ 60 MHz. Totally, six coding schemes are supported, so it can adapt itself to fit the dynamic system resources constraints, by performing a suitable coding schemes. The implemented result is 3.82 mm x 3.57 mm, in TSMC 0.18um technology.
In brief, the analysis and VLSI architecture of temporal prediction methods in video coding standards are studied in this dissertation. By classifying data flows and from various data processing viewpoints, new schedulings, architectures, and data reuse schemes are developed for different systems. For architecture design, we not only analyze the impact of VBSME on hardware architectures but also propose a new MB pipeline to eliminate the overhead of frame-level data reuse. For data reuse strategies, we discuss the data reuse schemes from MB level to frame level and provide various tradeoff between on-chip memory size and system memory usages. As for on-chip memory arrangement, we adopt the interleaved and reconfigurable concept to satisfy the memory requirements of PEs and data reuse strategies. As for the scheduling, the reference-based scheduling is also developed to solve the irregular memory access problems.
|