An Efficient and Balanced Platform for Data-Parallel Subsampling Workloads

Bibliographic Details
Main Author: Kambhampati, Satya Sundeep
Language:English
Published: The Ohio State University / OhioLINK 2014
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=osu1397695501
id ndltd-OhioLink-oai-etd.ohiolink.edu-osu1397695501
record_format oai_dc
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-osu13976955012021-08-03T06:24:05Z An Efficient and Balanced Platform for Data-Parallel Subsampling Workloads Kambhampati, Satya Sundeep Computer Engineering Computer Science With the advent of internet services, data started growing faster than it can be processed. To personalize user experience, this enormous data has to be processed in real time, in interactive fashion. In order to achieve faster data processing often a statistical method called subsampling is adopted and such workloads are called subsampling workloads. Subsampling workloads compute statistics from a set of observed samples using a random subset of sample data (i.e., a subsample). Data-parallel platforms group these samples into tasks; each task subsamples its data in parallel. Current, state-of-the-art platforms such as Hadoop are built for large tasks that run for long periods of time, but applications with smaller average task sizes suffer large overheads on these platforms. Tasks in subsampling workloads are sized to minimize the number of overall cache misses, and these tasks can complete in seconds. This technique can reduce the overall length of a map-reduce job, but only when the savings from the cache miss rate reduction are not eclipsed by the platform overhead of task creation and data distribution. In this thesis, we propose a data-parallel platform with an efficient data distribution component that breaks data-parallel subsampling workloads into compute clusters with tiny tasks. Each tiny task completes in few hundreds of milliseconds to seconds. Tiny tasks reduce processor cache misses caused by random subsampling, which speeds up per-task running time. However, they cause significant scheduling overheads and data distribution challenges. We propose a task knee-pointing algorithm and a dynamic scheduler that schedules the tasks to worker nodes based on the availability and response times of the data nodes. Since we know the task size and the number worker nodes prior to execution, we decide a few initial data nodes that all worker nodes access. Data is fully replicated across these nodes. Based on the response times from the initial set of data nodes, we estimate the cache interference between task execution and data fetch cycles; the replication factor (number of data nodes) is varied accordingly to meet the SLOs of tiny tasks. In this document, we discuss the challenges of our proposal and propose a task execution framework that can support tiny tasks with an efficient data distribution platform. We compare our framework against various configurations of BashReduce and Hadoop. A detailed discussion of tiny task approach on two workloads, EAGLET and Netflix movie rating is presented. 2014-09-08 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1397695501 http://rave.ohiolink.edu/etdc/view?acc_num=osu1397695501 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection NDLTD
language English
sources NDLTD
topic Computer Engineering
Computer Science
spellingShingle Computer Engineering
Computer Science
Kambhampati, Satya Sundeep
An Efficient and Balanced Platform for Data-Parallel Subsampling Workloads
author Kambhampati, Satya Sundeep
author_facet Kambhampati, Satya Sundeep
author_sort Kambhampati, Satya Sundeep
title An Efficient and Balanced Platform for Data-Parallel Subsampling Workloads
title_short An Efficient and Balanced Platform for Data-Parallel Subsampling Workloads
title_full An Efficient and Balanced Platform for Data-Parallel Subsampling Workloads
title_fullStr An Efficient and Balanced Platform for Data-Parallel Subsampling Workloads
title_full_unstemmed An Efficient and Balanced Platform for Data-Parallel Subsampling Workloads
title_sort efficient and balanced platform for data-parallel subsampling workloads
publisher The Ohio State University / OhioLINK
publishDate 2014
url http://rave.ohiolink.edu/etdc/view?acc_num=osu1397695501
work_keys_str_mv AT kambhampatisatyasundeep anefficientandbalancedplatformfordataparallelsubsamplingworkloads
AT kambhampatisatyasundeep efficientandbalancedplatformfordataparallelsubsamplingworkloads
_version_ 1719436020675510272