Summary: | In this thesis, the performance and energy efficiency of four different implementations of matrix multiplication, written in OmpSs and OpenCL, is tested and evaluated. The benchmarking is done using an Intel Ivy Bridge Core i7 3770K. The results are evaluated and discussed with regards to different optimization configurations, like vectorization and multi-threading. Energy measurements are taken using PAPI, which in turn uses the Running Average Power Limit interface in the Intel processor to take energy readings. Performance is presented using MFLOPS, while energy efficiency is compared using MFLOPS/W, watts used, and the energy delay product and energy delay squared. The OpenCL versions are compared with and without vectorization. One of the applications using OmpSs is also measured with regards to vectorization, and also number of threads. The last OmpSs version uses the BLAS implementation ATLAS, which is already vectorized. Therefore it is only compared using number of threads. SSE and AVX vectorization is shown to significantly improve performance while using little to no extra energy per second for all implementations. Multi-threading also gives higher performance, however this consumes more energy. Running with eight threads was shown to spend more energy while performing worse when using ATLAS. The OmpSs version using ATLAS was both the fastest and most energy efficient, peaking at 125 GFLOPS and 2.7 GLOPS/W while running with four threads and using AVX.
|