Summary: | 碩士 === 國立中正大學 === 電機工程所 === 93 === In H.264 video standard, intra and inter prediction are the most important components to increase the data compression ratio. The existing designs separate the intra and inter prediction so that the complexity in the controller is high. In addition, the intra prediction of existing designs does not provide a good solution to optimize the local memory required to buffer the neighboring reconstructed samples. Moreover, in order to support the interlaced video of H.264 main profile coding, we will face the design challenge of high computational complexity in the inter and intra prediction, which is the bottleneck in real-time realizing high quality video applications. The associated operations require a lot of memory bandwidth and occupy a lot of computation time, which motivates us to go on the research of designing a high performance, low cost, and low power design for H.264 motion compensation in this thesis.
The proposed predictive pixel compensator (PPC) of H.264 decoder IP core combines intra and inter prediction together to reduce the complexity in data processing and control circuit. In the proposed design, we adopt a shared adder-based architecture style that supports all of the 17 intra prediction modes for intra prediction generator. In addition, we have also adopted the concept of shared terms to reduce computational complexity in the intra prediction up to 50% computation. Besides, we have also proposed the distributed memory access to reduce the memory size by saving 44% memory cost needed for buffering the neighboring pixels.
In addition, a new inter prediction design is presented to reduce memory bandwidth and speed up calculation of interpolation. Through exploiting the data reuse between interpolation window of neighboring blocks, the proposed inter prediction generator can save about 50% of external memory bandwidth, and 23% of local memory access time. In addition, by applying 4x4-block based data parallelism and mixed six-tap FIR filter architecture for luma interpolation design, it can efficiently reduce the hardware cost up to 27% as compared with design [8,11]. As a result, the proposed design can achieve the throughput of 272K MBs/Sec at 87MHz. The synthesis result shows that the design achieves the maximum speed of 166 MHz. When we synthesize the proposed design at the clock constraint of 125MHz, the hardware cost is about 60854 gates under a 0.18μm CMOS technology, which achieves the real-time processing requirement for H.264 video decoding targeting at HD1080i format video@30Hz.
|