Summary: | 博士 === 國立交通大學 === 電子工程系所 === 94 === Today’s wireless and multimedia applications demand multi-billions operations per second. Owing to the advances in IC technology, it is not difficult to fabricate tens to hundreds of arithmetic units in a hundred-MHz or few-GHz processor to achieve the required performance. However, the complexity of data generation and operation coordination/synchronization of these parallel arithmetic units is prohibitive in most embedded systems. This dissertation first studies microarchitectural techniques that reduce the communication complexities of parallel arithmetic units. We propose a simple inter-cluster communication (ICC) mechanism with load/store instruction pairs and a novel distributed & ping-pong register organization for digital signal processors. In our experiments in UMC 0.13μm 1P8M Copper Logic Process, the area and the timing are saved by 76.8% and 46.9% respectively. On the other hand, we study very long instruction word (VLIW) execution schemes with improved code density in this dissertation. We propose a unified VLIW encoding scheme with flexible variable- length instruction encoding, NOP removal, and automatic instruction replication to improve the code density. In our simulations with both hand-optimized and compiled codes, the proposed approach saves 74.0%~75.9% code sizes. Finally, a complete VLIW DSP with our proposed improvements is implemented and verified from instruction set simulation in C++, microarchitecture exploration in SystemC, FPGA prototyping and chip tapeout. The silicon implementation in UMC 0.13μm 1P8M Copper Logic Process operates at 333MHz. Its core size is 3.2mm×3.15mm including 128KB data memory and 32KB instruction memory. The average power consumption is 189mW.
|