Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering

碩士 === 國立清華大學 === 資訊工程學系 === 103 === Scalability of algorithms and implementations that ensures the computational efficiency can sustain with more machines is one of the most crucial performance factor in big data processing. Nowadays, the scale of machines and storages can be extended to match...

Full description

Bibliographic Details
Main Authors: Yu, Hsiu-Cheng, 余修丞
Other Authors: Lee, Che-Rung
Format: Others
Language:en_US
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/01260555691292068543
id ndltd-TW-103NTHU5392002
record_format oai_dc
spelling ndltd-TW-103NTHU53920022016-12-19T04:14:35Z http://ndltd.ncl.edu.tw/handle/01260555691292068543 Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering 以雲端平台實作瘦長QR分解及其應用 Yu, Hsiu-Cheng 余修丞 碩士 國立清華大學 資訊工程學系 103 Scalability of algorithms and implementations that ensures the computational efficiency can sustain with more machines is one of the most crucial performance factor in big data processing. Nowadays, the scale of machines and storages can be extended to match the growth of data size easily. However, without scalable algorithms, more machines can even slow down the data processing. In this thesis, we investigated and improved the scalability of the algorithms and implementations of the QR decomposition for tall-and-skinny matrices on cloud platforms. Our algorithm is based on the TSQR (Tall-and-Skinny QR) al-gorithm, proposed by Demmel et al., which has been shown optimal in communi-cation cost for QR decomposing tall-and-skinny matrices. However, our analysis shows that the disk IO dominates the entire performance of MapReduce implemen-tation. Therefore, we implemented it using Apache Spark, an in-memory pro-cessing programming model for distributed computing environment. We applied our TSQR implementation to the SSVD-based Collaborative Fil-tering (CF). CF is a computational kernel commonly used in e-commerce, such as Amazon recommendation, Goggle Ads, Facebook friend suggestion, etc. The SSVD-based CF has superior performance and accuracy comparing to existing methods. However, it has a performance bottleneck of QR decomposition step in the SSVD (Stochastic SVD) step. Experiments show that our implementation of TSQR in Spark is more efficient than that of in Hadoop MapReduce, and the over-all performance of TSQR can be improved by upto 400% for several benchmarks. Lee, Che-Rung 李哲榮 2014 學位論文 ; thesis 70 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立清華大學 === 資訊工程學系 === 103 === Scalability of algorithms and implementations that ensures the computational efficiency can sustain with more machines is one of the most crucial performance factor in big data processing. Nowadays, the scale of machines and storages can be extended to match the growth of data size easily. However, without scalable algorithms, more machines can even slow down the data processing. In this thesis, we investigated and improved the scalability of the algorithms and implementations of the QR decomposition for tall-and-skinny matrices on cloud platforms. Our algorithm is based on the TSQR (Tall-and-Skinny QR) al-gorithm, proposed by Demmel et al., which has been shown optimal in communi-cation cost for QR decomposing tall-and-skinny matrices. However, our analysis shows that the disk IO dominates the entire performance of MapReduce implemen-tation. Therefore, we implemented it using Apache Spark, an in-memory pro-cessing programming model for distributed computing environment. We applied our TSQR implementation to the SSVD-based Collaborative Fil-tering (CF). CF is a computational kernel commonly used in e-commerce, such as Amazon recommendation, Goggle Ads, Facebook friend suggestion, etc. The SSVD-based CF has superior performance and accuracy comparing to existing methods. However, it has a performance bottleneck of QR decomposition step in the SSVD (Stochastic SVD) step. Experiments show that our implementation of TSQR in Spark is more efficient than that of in Hadoop MapReduce, and the over-all performance of TSQR can be improved by upto 400% for several benchmarks.
author2 Lee, Che-Rung
author_facet Lee, Che-Rung
Yu, Hsiu-Cheng
余修丞
author Yu, Hsiu-Cheng
余修丞
spellingShingle Yu, Hsiu-Cheng
余修丞
Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering
author_sort Yu, Hsiu-Cheng
title Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering
title_short Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering
title_full Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering
title_fullStr Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering
title_full_unstemmed Implementations of TSQR for Cloud Platforms and Its Applications of SSVD and Collaborative Filtering
title_sort implementations of tsqr for cloud platforms and its applications of ssvd and collaborative filtering
publishDate 2014
url http://ndltd.ncl.edu.tw/handle/01260555691292068543
work_keys_str_mv AT yuhsiucheng implementationsoftsqrforcloudplatformsanditsapplicationsofssvdandcollaborativefiltering
AT yúxiūchéng implementationsoftsqrforcloudplatformsanditsapplicationsofssvdandcollaborativefiltering
AT yuhsiucheng yǐyúnduānpíngtáishízuòshòuzhǎngqrfēnjiějíqíyīngyòng
AT yúxiūchéng yǐyúnduānpíngtáishízuòshòuzhǎngqrfēnjiějíqíyīngyòng
_version_ 1718401245738172416