Summary: | 碩士 === 國立清華大學 === 資訊工程學系 === 103 === Scalability of algorithms and implementations that ensures the computational efficiency can sustain with more machines is one of the most crucial performance factor in big data processing. Nowadays, the scale of machines and storages can be extended to match the growth of data size easily. However, without scalable algorithms, more machines can even slow down the data processing.
In this thesis, we investigated and improved the scalability of the algorithms and implementations of the QR decomposition for tall-and-skinny matrices on cloud platforms. Our algorithm is based on the TSQR (Tall-and-Skinny QR) al-gorithm, proposed by Demmel et al., which has been shown optimal in communi-cation cost for QR decomposing tall-and-skinny matrices. However, our analysis shows that the disk IO dominates the entire performance of MapReduce implemen-tation. Therefore, we implemented it using Apache Spark, an in-memory pro-cessing programming model for distributed computing environment.
We applied our TSQR implementation to the SSVD-based Collaborative Fil-tering (CF). CF is a computational kernel commonly used in e-commerce, such as Amazon recommendation, Goggle Ads, Facebook friend suggestion, etc. The SSVD-based CF has superior performance and accuracy comparing to existing methods. However, it has a performance bottleneck of QR decomposition step in the SSVD (Stochastic SVD) step. Experiments show that our implementation of TSQR in Spark is more efficient than that of in Hadoop MapReduce, and the over-all performance of TSQR can be improved by upto 400% for several benchmarks.
|