PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates

<p>As the reinforcement learning community has shifted its focus from heuristic methods to methods that have performance guarantees, PAC-optimal exploration algorithms have received significant attention. Unfortunately, the majority of current PAC-optimal exploration algorithms are inapplicabl...

Full description

Bibliographic Details
Main Author:	Pazis, Jason
Other Authors:	Parr, Ronald
Published:	2015
Subjects:	Computer science Artificial intelligence Concurrent Delay Exploration MDP PAC-optimal Reinforcement Learning
Online Access:	http://hdl.handle.net/10161/11334

id	ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-11334
record_format	oai_dc
spelling	ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-113342016-01-06T03:30:45ZPAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed UpdatesPazis, JasonComputer scienceArtificial intelligenceConcurrentDelayExplorationMDPPAC-optimalReinforcement Learning<p>As the reinforcement learning community has shifted its focus from heuristic methods to methods that have performance guarantees, PAC-optimal exploration algorithms have received significant attention. Unfortunately, the majority of current PAC-optimal exploration algorithms are inapplicable in realistic scenarios: 1) They scale poorly to domains of realistic size. 2) They are only applicable to discrete state-action spaces. 3) They assume that experience comes from a single, continuous trajectory. 4) They assume that value function updates are instantaneous. The goal of this work is to bridge the gap between theory and practice, by introducing an efficient and customizable PAC optimal exploration algorithm, that is able to explore in multiple, continuous or discrete state MDPs simultaneously. Our algorithm does not assume that value function updates can be completed instantaneously, and maintains PAC guarantees in realtime environments. Not only do we extend the applicability of PAC optimal exploration algorithms to new, realistic settings, but even when instant value function updates are possible, our bounds present a significant improvement over previous single MDP exploration bounds, and a drastic improvement over previous concurrent PAC bounds. We also present Bellman error MDPs, a new analysis methodology for online and offline reinforcement learning algorithms, and TCE, a new, fine grained metric for the cost of exploration.</p>DissertationParr, Ronald2015Dissertationhttp://hdl.handle.net/10161/11334
collection	NDLTD
sources	NDLTD
topic	Computer science Artificial intelligence Concurrent Delay Exploration MDP PAC-optimal Reinforcement Learning
spellingShingle	Computer science Artificial intelligence Concurrent Delay Exploration MDP PAC-optimal Reinforcement Learning Pazis, Jason PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
description	<p>As the reinforcement learning community has shifted its focus from heuristic methods to methods that have performance guarantees, PAC-optimal exploration algorithms have received significant attention. Unfortunately, the majority of current PAC-optimal exploration algorithms are inapplicable in realistic scenarios: 1) They scale poorly to domains of realistic size. 2) They are only applicable to discrete state-action spaces. 3) They assume that experience comes from a single, continuous trajectory. 4) They assume that value function updates are instantaneous. The goal of this work is to bridge the gap between theory and practice, by introducing an efficient and customizable PAC optimal exploration algorithm, that is able to explore in multiple, continuous or discrete state MDPs simultaneously. Our algorithm does not assume that value function updates can be completed instantaneously, and maintains PAC guarantees in realtime environments. Not only do we extend the applicability of PAC optimal exploration algorithms to new, realistic settings, but even when instant value function updates are possible, our bounds present a significant improvement over previous single MDP exploration bounds, and a drastic improvement over previous concurrent PAC bounds. We also present Bellman error MDPs, a new analysis methodology for online and offline reinforcement learning algorithms, and TCE, a new, fine grained metric for the cost of exploration.</p> === Dissertation
author2	Parr, Ronald
author_facet	Parr, Ronald Pazis, Jason
author	Pazis, Jason
author_sort	Pazis, Jason
title	PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_short	PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_full	PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_fullStr	PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_full_unstemmed	PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_sort	pac-optimal, non-parametric algorithms and bounds for exploration in concurrent mdps with delayed updates
publishDate	2015
url	http://hdl.handle.net/10161/11334
work_keys_str_mv	AT pazisjason pacoptimalnonparametricalgorithmsandboundsforexplorationinconcurrentmdpswithdelayedupdates
_version_	1718160409481969664

PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates

Similar Items