Summary: | 博士 === 國立清華大學 === 資訊工程學系 === 101 === To sustain continuously growing performance requirement, modern digital signal processing (DSP) processors are commonly equipped with subword instructions to accelerate signal processing workloads, such as audio processing and video encoding/decoding. To further increase the computation power of VLIW DSP processors, besides subword instructions, plural functional units of very long instruction word (VLIW) DSP processors can be added to process multiple data streams in parallel. However, because of power and area concerns, many embedded VLIW DSP processors adopt distributed register file designs, which privatize register files for clusters of functional units to reduce read/write ports of register files and wire connection between register files and functional units. The distributed register file designs introduce several access constraints on register files and present great challenges to compilers and assembly programmers in distributing single instruction, multiple data (SIMD) workload to clustered functional units of VLIW processors. To support VLIW DSP processors with distributed register files, several compiler phases have to consider the register-file access constraints and minimize their impact. In this dissertation, we address the issue in supporting SIMD parallelism for VLIW DSP processors with subword instructions and distributed register files. Currently, industrial practices have adopted intrinsics to enable developers to utilize hardware resources and compete with hand-coded assembly in performance. However, it is still an open issue to provide such a solution for VLIW DSP processors with distributed register files.
In this work, we propose a SIMD intrinsics design to allow programmers to write highly optimized code by following our programming guides. Our intrinsics design allows programmers to parallelize C/C++ programs with access to DSP subword instructions and clustered functional units of VLIW DSP processors. In addition, we also propose collaborative compiler optimizations to enable efficient code generation for SIMD programs written in the intrinsics. The collaborative optimizations include (1) a register-file assignment scheme prior to conventional register allocation which attempts to avoid register-file access constraints by assigning data to proper register files (2) two data replication techniques which reduce the inter-cluster communication overhead and avoid register-file constraints in the highly optimized SIMD programs with intrinsics. In our experiments, we use DSPstone benchmark and H.264 kernels to evaluate the proposed intrinsics programming and compiler optimization scheme. The intrinsics support and compiler optimizations are implemented in an Open64 compiler which has been optimized for a VLIW DSP processor with distributed register files. We rewrite the DSPstone benchmark and H.264 kernels with the SIMD intrinsics by following the programming guides. The result shows that we are able to obtain remarkable performance improvements with the intrinsics and compiler optimizations, which are speedups of 2.9 and 3.5 for DSPstone and H.264 kernels, respectively. Besides the decent performance improvement over original C programs, we also provide performance comparison between hand-coded assembly and C programs with SIMD intrinsics for H.264 kernels.
|