Summary: | 碩士 === 國立臺灣大學 === 資訊工程學研究所 === 92 === Multimedia applications have become important workloads for modern computer systems. The latest video coding standard H.264/AVC adopts lots of coding tools, which can improve the coding efficiency and visual quality but also add the implementation complexity a lot. The increasing computation and storage requirements pose challenges to achieve real-time video playback on general-purpose processors (GPPs). In this thesis, I study and analyze the performance of a software implementation of H.264/AVC decoder on GPPs. Through this study, we can find out the performance bottleneck of running the H.264 decoder on a modern GPP. Understanding the characteristics of H.264 decoder allows us to tune hardware processor architecture and software program implementation for performance. I analyze three important program characteristics: the intrinsic available ILP, program locality and control flow predictability. Furthermore, I investigate what application features (sequence content, resolution, bitrate) and new added coding tools (multi-ref frames, CABAC) have direct impact on performance. In this study, I adopt the simulation-based approach to perform workload characterization. It allows us to explore the design space thoroughly and evaluate different architectural enhancements. The important findings of this study includes 1) H.264 decoder does present significant instruction level parallelism. 2) H.264 is computation-bound not memory-bound because block-level data reuse can be captured by data cache. 3) H.264 has poor branch predictability due to nested loops and content dependent branch. Loop unrolling and absolute instruction can reduce branch stall time significantly. 4) For application features, video contents with low motion and smaller resolution increase the inter frame prediction opportunity thereby increasing cache miss rates. Higher bitrate increases execution time of entropy coding. New added multi-ref frame does not have direct impact on cache performance since inter-frame reuse cannot be captured in data cache. CABAC has lower control flow predictability than CAVLC due to bit-wise access to bitstream.
|