Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering

碩士 === 國立清華大學 === 資訊工程學系 === 103 === Scalability of algorithms and implementations that ensures the computational efficiency can sustain with more machines is one of the most crucial performance factor in big data processing. Nowadays, the scale of machines and storages can be extended to match...

Full description

Bibliographic Details
Main Authors: Yu, Hsiu-Cheng, 余修丞
Other Authors: Lee, Che-Rung
Format: Others
Language:en_US
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/01260555691292068543
Description
Summary:碩士 === 國立清華大學 === 資訊工程學系 === 103 === Scalability of algorithms and implementations that ensures the computational efficiency can sustain with more machines is one of the most crucial performance factor in big data processing. Nowadays, the scale of machines and storages can be extended to match the growth of data size easily. However, without scalable algorithms, more machines can even slow down the data processing. In this thesis, we investigated and improved the scalability of the algorithms and implementations of the QR decomposition for tall-and-skinny matrices on cloud platforms. Our algorithm is based on the TSQR (Tall-and-Skinny QR) al-gorithm, proposed by Demmel et al., which has been shown optimal in communi-cation cost for QR decomposing tall-and-skinny matrices. However, our analysis shows that the disk IO dominates the entire performance of MapReduce implemen-tation. Therefore, we implemented it using Apache Spark, an in-memory pro-cessing programming model for distributed computing environment. We applied our TSQR implementation to the SSVD-based Collaborative Fil-tering (CF). CF is a computational kernel commonly used in e-commerce, such as Amazon recommendation, Goggle Ads, Facebook friend suggestion, etc. The SSVD-based CF has superior performance and accuracy comparing to existing methods. However, it has a performance bottleneck of QR decomposition step in the SSVD (Stochastic SVD) step. Experiments show that our implementation of TSQR in Spark is more efficient than that of in Hadoop MapReduce, and the over-all performance of TSQR can be improved by upto 400% for several benchmarks.