Turning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and Coresets
Organizing data into groups using unsupervised learning algorithms such as k-means clustering and GMM are some of the most widely used techniques in data exploration and data mining. As these clustering algorithms are iterative by nature, for big datasets it is increasingly challenging to find clust...
Main Author: | |
---|---|
Format: | Others |
Language: | en |
Published: |
Harvard University
2015
|
Subjects: | |
Online Access: | http://nrs.harvard.edu/urn-3:HUL.InstRepos:14398541 |
id |
ndltd-harvard.edu-oai-dash.harvard.edu-1-14398541 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-harvard.edu-oai-dash.harvard.edu-1-143985412017-07-27T15:51:33ZTurning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and CoresetsMoon, Tarik AdnanApplied MechanicsOrganizing data into groups using unsupervised learning algorithms such as k-means clustering and GMM are some of the most widely used techniques in data exploration and data mining. As these clustering algorithms are iterative by nature, for big datasets it is increasingly challenging to find clusters quickly. The iterative nature of k-means makes it inherently difficult to optimize such algorithms for modern hardware, especially as pushing data through the memory hierarchy is the main bottleneck in modern systems. Therefore, performing on-the-fly unsupervised learning is particularly challenging. In this thesis, we address this challenge by presenting an ensemble of algorithms to provide hardware-aware clustering along with a road-map for hardware-aware machine learning algorithms. We move beyond simple yet aggressive parallelization useful only for the embarrassingly parallel parts of the algorithms by employing data reduction, re-factoring of the algorithm, as well as, parallelization through SIMD commands of a general purpose processor. We find that careful engineering employing the SIMD instructions available by the processor and hand-tuning reduces response time by about 4 times. Further, by reducing both data dimensionality and data-points by PCA and then coreset-based sampling we get a very good representative sample of the dataset. Running clustering on the reduced dataset, we achieve a significant speedup. This data reduction technique reduces data dimensionality and data-points, effectively reducing the cost of the k-means algorithm by reducing the number of iteration and the total amount of computations. Last but not least, using we can save pre-computed data to compute cluster variations on the fly. Compared to the state of the art using k-means++, our approach offers comparable accuracy while running about 14 times faster, by moving less data fewer times through the memory hierarchy.2015-04-09T13:56:02Z2015-052015-04-0820152015-04-09T13:56:02ZThesis or Dissertationtextapplication/pdfMoon, Tarik Adnan. 2015. Turning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and Coresets. Bachelor's thesis, Harvard College.http://nrs.harvard.edu/urn-3:HUL.InstRepos:14398541enopenhttp://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAAHarvard University |
collection |
NDLTD |
language |
en |
format |
Others
|
sources |
NDLTD |
topic |
Applied Mechanics |
spellingShingle |
Applied Mechanics Moon, Tarik Adnan Turning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and Coresets |
description |
Organizing data into groups using unsupervised learning algorithms such as k-means clustering and GMM are some of the most widely used techniques in data exploration and data mining. As these clustering algorithms are iterative by nature, for big datasets it is increasingly challenging to find clusters quickly. The iterative nature of k-means makes it inherently difficult to optimize such algorithms for modern hardware, especially as pushing data through the memory hierarchy is the main bottleneck in modern systems. Therefore, performing on-the-fly unsupervised learning is particularly challenging.
In this thesis, we address this challenge by presenting an ensemble of algorithms to provide hardware-aware clustering along with a road-map for hardware-aware machine learning algorithms. We move beyond simple yet aggressive parallelization useful only for the embarrassingly parallel parts of the algorithms by employing data reduction, re-factoring of the algorithm, as well as, parallelization through SIMD commands of a general purpose processor. We find that careful engineering employing the SIMD instructions available by the processor and hand-tuning reduces response time by about 4 times. Further, by reducing both data dimensionality and data-points by PCA and then coreset-based sampling we get a very good representative sample of the dataset.
Running clustering on the reduced dataset, we achieve a significant speedup. This data reduction technique reduces data dimensionality and data-points, effectively reducing the cost of the k-means algorithm by reducing the number of iteration and the total amount of computations. Last but not least, using we can save pre-computed data to compute cluster variations on the fly. Compared to the state of the art using k-means++, our approach offers comparable accuracy while running about 14 times faster, by moving less data fewer times through the memory hierarchy. |
author |
Moon, Tarik Adnan |
author_facet |
Moon, Tarik Adnan |
author_sort |
Moon, Tarik Adnan |
title |
Turning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and Coresets |
title_short |
Turning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and Coresets |
title_full |
Turning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and Coresets |
title_fullStr |
Turning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and Coresets |
title_full_unstemmed |
Turning Big Data Into Small Data: Hardware Aware Approximate Clustering With Randomized SVD and Coresets |
title_sort |
turning big data into small data: hardware aware approximate clustering with randomized svd and coresets |
publisher |
Harvard University |
publishDate |
2015 |
url |
http://nrs.harvard.edu/urn-3:HUL.InstRepos:14398541 |
work_keys_str_mv |
AT moontarikadnan turningbigdataintosmalldatahardwareawareapproximateclusteringwithrandomizedsvdandcoresets |
_version_ |
1718507023292694528 |