Summary: | 碩士 === 國立交通大學 === 資訊工程系 === 88 === In recent x86 microprocessors, superscalar techniques are widely used to achieve higher performance by executing multiple instructions in parallel. To exploit higher instruction level parallelism of current commercial programs on x86 superscalar microprocessors, we study the critical high issue rate topics in x86 microprocessors. Topics include
i) the instruction decoding with high issue rate, and
ii) the predictive data load/store scheduling,
which are very different than in RISC processors. Furthermore, to build an efficient simulation environment for x86 research, we develop
iii) the single-pass trace simulation techniques for x86 superscalar micro-architecture.
In the first topic, the high issue rate decoding, we examine the x86 instruction to primitive operation (POP) translation strategies and the decoding rules to achieve a higher degree of parallel execution. The semantic of x86 instructions may be too complex and thus the decoders need to translate the instructions into POPs. There are two different POP translation strategies: one is to merge the address generation into load/store operations and the other is to use individual address generation operations. Simulation results show that, in high issue rate decoders, the latter strategy improves the performance by 20% to 25%. Besides, we find that equipping the UMAB with the ability of result buses snooping can further exploit higher parallel execution degree. Considering the tradeoffs between hardware cost and performance, a cost-effective decoding rule is recommended.
In the second topic, the predictive load/store scheduling, we develop several predictive scheduling policies of loads/stores suited for x86 superscalar processors. The proportion of memory access instructions for x86 microprocessors is relatively high, and exploiting the parallel execution degree of memory accesses becomes crucial in high superscalar degree. Traditional prediction techniques developed on RISC suffer the lengthened penalty of prediction errors and thus cannot work effectively when applied to x86 processors. To increasing the prediction accuracy, we develop new address and dependency prediction policies. We improve the dependency prediction by adding forwarding prediction ability, refining the predictions with 2-bit counter, and filtering out the error-like predictions with another 2-bit counter. To reduce the miss-penalty, we consider the prediction stage and the strategies for handling loaded data. Experiment results show that, by reducing the miss-penalty and increasing the prediction accuracy, the predictive scheduling proposed in this work can significantly improve the performance.
In the third topic, the single-pass trace simulation techniques, we develop the single-pass techniques to build an efficient trace-driven simulator for whole x86 superscalar processors. The single-pass trace simulation techniques have been developed to evaluate many sets of design configurations in one simulation run. However, these techniques are only suited to storages having the inclusion property. The major difficulty in this topic is that both the pipeline with out-of-order mechanism and the branch prediction buffer (BTB) do not show the inclusion property. Thus, we develop the single-pass simulation techniques for the BTB and the out-of-order mechanism separately. For the out-of-order mechanism, we put the incoming instructions in a unified instruction progression queue, and enumerate the possible pipeline states in a pipeline state vector. For the BTB, difficulty arises since the prediction information in the BTB has no inclusion property. We propose the state vector method and the state link method to overcome this difficulty. The state vector method enumerates the states of various possibilities, whereas the state link method book-keeps only the changing locations of the states in the state vector. By integrating the single-pass simulation for both out-of-order mechanism and BTB we developed and the traditional single-pass simulation for caches, our simulator becomes a platform of a complete single-pass simulation for whole x86 superscalar microprocessors. The speedup of this single-pass simulation over the conventional simulation is 4.15 in terms of simulation time when 10 sets of configurations are evaluated.
We further apply the state vector method on the single-pass simulation for multi-processor (MP) cache coherence protocols. By inserting the bubble state to imitate the inclusion property, we develop a single-pass simulation to measure not only the performance as tradition but also bus traffic of MP caches with various coherence protocols and sizes.
Having dealt with the critical topics discussed in this dissertation, an efficient simulation environment and hence a high-issue rate x86 micro-architecture can be built. We hope the efforts in this research can contribute to the design of future high issue rate x86 microprocessors.
|