Automatic Tuning of Data-Intensive Analytical Workloads

<p>Modern industrial, government, and academic organizations are collecting massive amounts of data ("Big Data") at an unprecedented scale and pace. The ability to perform timely and cost-effective analytical processing of such large datasets in order to extract deep insights is now...

Full description

Bibliographic Details
Main Author:	Herodotou, Herodotos
Other Authors:	Babu, Shivnath
Published:	2012
Subjects:	Computer science cost-based optimization Database systems MapReduce systems self-tuning systems
Online Access:	http://hdl.handle.net/10161/5415

id	ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-5415
record_format	oai_dc
spelling	ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-54152013-01-07T20:07:56ZAutomatic Tuning of Data-Intensive Analytical WorkloadsHerodotou, HerodotosComputer sciencecost-based optimizationDatabase systemsMapReduce systemsself-tuning systems<p>Modern industrial, government, and academic organizations are collecting massive amounts of data ("Big Data") at an unprecedented scale and pace. The ability to perform timely and cost-effective analytical processing of such large datasets in order to extract deep insights is now a key ingredient for success. These insights can drive automated processes for advertisement placement, improve customer relationship management, and lead to major scientific breakthroughs.</p><p>Existing database systems are adapting to the new status quo while large-scale dataflow systems (like Dryad and MapReduce) are becoming popular for executing analytical workloads on Big Data. Ensuring good and robust performance automatically on such systems poses several challenges. First, workloads often analyze a hybrid mix of structured and unstructured datasets stored in nontraditional data layouts. The structure and properties of the data may not be known upfront, and will evolve over time. Complex analysis techniques and rapid development needs necessitate the use of both declarative and procedural programming languages for workload specification. Finally, the space of workload tuning choices is very large and high-dimensional, spanning configuration parameter settings, cluster resource provisioning (spurred by recent innovations in cloud computing), and data layouts.</p><p>We have developed a novel dynamic optimization approach that can form the basis for tuning workload performance automatically across different tuning scenarios and systems. Our solution is based on (i) collecting monitoring information in order to learn the run-time behavior of workloads, (ii) deploying appropriate models to predict the impact of hypothetical tuning choices on workload behavior, and (iii) using efficient search strategies to find tuning choices that give good workload performance. The dynamic nature enables our solution to overcome the new challenges posed by Big Data, and also makes our solution applicable to both MapReduce and Database systems. We have developed the first cost-based optimization framework for MapReduce systems for determining the cluster resources and configuration parameter settings to meet desired requirements on execution time and cost for a given analytic workload. We have also developed a novel tuning-based optimizer in Database systems to collect targeted run-time information, perform optimization, and repeat as needed to perform fine-grained tuning of SQL queries.</p>DissertationBabu, Shivnath2012Dissertationhttp://hdl.handle.net/10161/5415
collection	NDLTD
sources	NDLTD
topic	Computer science cost-based optimization Database systems MapReduce systems self-tuning systems
spellingShingle	Computer science cost-based optimization Database systems MapReduce systems self-tuning systems Herodotou, Herodotos Automatic Tuning of Data-Intensive Analytical Workloads
description	<p>Modern industrial, government, and academic organizations are collecting massive amounts of data ("Big Data") at an unprecedented scale and pace. The ability to perform timely and cost-effective analytical processing of such large datasets in order to extract deep insights is now a key ingredient for success. These insights can drive automated processes for advertisement placement, improve customer relationship management, and lead to major scientific breakthroughs.</p><p>Existing database systems are adapting to the new status quo while large-scale dataflow systems (like Dryad and MapReduce) are becoming popular for executing analytical workloads on Big Data. Ensuring good and robust performance automatically on such systems poses several challenges. First, workloads often analyze a hybrid mix of structured and unstructured datasets stored in nontraditional data layouts. The structure and properties of the data may not be known upfront, and will evolve over time. Complex analysis techniques and rapid development needs necessitate the use of both declarative and procedural programming languages for workload specification. Finally, the space of workload tuning choices is very large and high-dimensional, spanning configuration parameter settings, cluster resource provisioning (spurred by recent innovations in cloud computing), and data layouts.</p><p>We have developed a novel dynamic optimization approach that can form the basis for tuning workload performance automatically across different tuning scenarios and systems. Our solution is based on (i) collecting monitoring information in order to learn the run-time behavior of workloads, (ii) deploying appropriate models to predict the impact of hypothetical tuning choices on workload behavior, and (iii) using efficient search strategies to find tuning choices that give good workload performance. The dynamic nature enables our solution to overcome the new challenges posed by Big Data, and also makes our solution applicable to both MapReduce and Database systems. We have developed the first cost-based optimization framework for MapReduce systems for determining the cluster resources and configuration parameter settings to meet desired requirements on execution time and cost for a given analytic workload. We have also developed a novel tuning-based optimizer in Database systems to collect targeted run-time information, perform optimization, and repeat as needed to perform fine-grained tuning of SQL queries.</p> === Dissertation
author2	Babu, Shivnath
author_facet	Babu, Shivnath Herodotou, Herodotos
author	Herodotou, Herodotos
author_sort	Herodotou, Herodotos
title	Automatic Tuning of Data-Intensive Analytical Workloads
title_short	Automatic Tuning of Data-Intensive Analytical Workloads
title_full	Automatic Tuning of Data-Intensive Analytical Workloads
title_fullStr	Automatic Tuning of Data-Intensive Analytical Workloads
title_full_unstemmed	Automatic Tuning of Data-Intensive Analytical Workloads
title_sort	automatic tuning of data-intensive analytical workloads
publishDate	2012
url	http://hdl.handle.net/10161/5415
work_keys_str_mv	AT herodotouherodotos automatictuningofdataintensiveanalyticalworkloads
_version_	1716473594178961408

Automatic Tuning of Data-Intensive Analytical Workloads

Similar Items