Summary: | Increasing levels of VLSI integration present new opportunities, and new challenges, for designers of high performance microprocessor-based systems. With more transistors at their disposal, architects are faced with complex decisions regarding processor features, cache hierarchies, and supporting several uniprocessor and multiprocessor target systems. In addition, as the speed gap between microprocessors and board-level technology continues to widen, a robust system-level design becomes a critical element for attaining acceptable performance. This dissertation describes STATS, a comprehensive, semi-automated, trade-off analysis toolset. STATS overcomes the limitations of previous approaches by including the processor, cache hierarchy, system interconnect, and main memory designs, technology and architectural considerations, and both uniprocessor and multiprocessor analysis, within a single framework. STATS employs a judicious combination of compilation, execution-driven simulation, analytical modeling, and Spice analysis tools to achieve a reasonable balance of accuracy and analysis time. STATS is used in three architectural investigations. The first, an in-depth analysis of cache hierarchy alternatives for the Alpha 21064A processor design, includes a comparison of employing one, two, or three levels of hierarchy. A detailed analysis demonstrates the importance of precisely characterizing all aspects of cache hierarchy design, including traffic rates, miss ratios, cycle time, latency, and bandwidth, to avoid incorrect design decisions. The second explores tradeoffs in the design of a next-generation 8-way super-scalar microprocessor-based workstation. Some conclusions are that trading off a smaller L1 Dcache size for more arithmetic units provides the best overall performance, and only marginal performance gains are obtained by using the package pins for an L3 cache rather than a direct main memory connection. Novel mechanisms for multi-porting L1 Dcaches and pipelining large, on-chip L2 caches are shown to achieve up to an 81% performance improvement over conventional methods. The third investigation concerns the cluster design of CC-NUMA multiprocessors using the 8-way superscalar microprocessor. The results demonstrate that integrating the main memory controller onto the microprocessor die considerably reduces bus utilization and improves multiprocessor performance by as much as 35%. Interleaving alternatives for the distributed main memory are explored, as well as options for managing bus utilization in future cluster designs.
|