PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates

<p>As the reinforcement learning community has shifted its focus from heuristic methods to methods that have performance guarantees, PAC-optimal exploration algorithms have received significant attention. Unfortunately, the majority of current PAC-optimal exploration algorithms are inapplicabl...

Full description

Bibliographic Details
Main Author: Pazis, Jason
Other Authors: Parr, Ronald
Published: 2015
Subjects:
MDP
Online Access:http://hdl.handle.net/10161/11334
id ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-11334
record_format oai_dc
spelling ndltd-DUKE-oai-dukespace.lib.duke.edu-10161-113342016-01-06T03:30:45ZPAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed UpdatesPazis, JasonComputer scienceArtificial intelligenceConcurrentDelayExplorationMDPPAC-optimalReinforcement Learning<p>As the reinforcement learning community has shifted its focus from heuristic methods to methods that have performance guarantees, PAC-optimal exploration algorithms have received significant attention. Unfortunately, the majority of current PAC-optimal exploration algorithms are inapplicable in realistic scenarios: 1) They scale poorly to domains of realistic size. 2) They are only applicable to discrete state-action spaces. 3) They assume that experience comes from a single, continuous trajectory. 4) They assume that value function updates are instantaneous. The goal of this work is to bridge the gap between theory and practice, by introducing an efficient and customizable PAC optimal exploration algorithm, that is able to explore in multiple, continuous or discrete state MDPs simultaneously. Our algorithm does not assume that value function updates can be completed instantaneously, and maintains PAC guarantees in realtime environments. Not only do we extend the applicability of PAC optimal exploration algorithms to new, realistic settings, but even when instant value function updates are possible, our bounds present a significant improvement over previous single MDP exploration bounds, and a drastic improvement over previous concurrent PAC bounds. We also present Bellman error MDPs, a new analysis methodology for online and offline reinforcement learning algorithms, and TCE, a new, fine grained metric for the cost of exploration.</p>DissertationParr, Ronald2015Dissertationhttp://hdl.handle.net/10161/11334
collection NDLTD
sources NDLTD
topic Computer science
Artificial intelligence
Concurrent
Delay
Exploration
MDP
PAC-optimal
Reinforcement Learning
spellingShingle Computer science
Artificial intelligence
Concurrent
Delay
Exploration
MDP
PAC-optimal
Reinforcement Learning
Pazis, Jason
PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
description <p>As the reinforcement learning community has shifted its focus from heuristic methods to methods that have performance guarantees, PAC-optimal exploration algorithms have received significant attention. Unfortunately, the majority of current PAC-optimal exploration algorithms are inapplicable in realistic scenarios: 1) They scale poorly to domains of realistic size. 2) They are only applicable to discrete state-action spaces. 3) They assume that experience comes from a single, continuous trajectory. 4) They assume that value function updates are instantaneous. The goal of this work is to bridge the gap between theory and practice, by introducing an efficient and customizable PAC optimal exploration algorithm, that is able to explore in multiple, continuous or discrete state MDPs simultaneously. Our algorithm does not assume that value function updates can be completed instantaneously, and maintains PAC guarantees in realtime environments. Not only do we extend the applicability of PAC optimal exploration algorithms to new, realistic settings, but even when instant value function updates are possible, our bounds present a significant improvement over previous single MDP exploration bounds, and a drastic improvement over previous concurrent PAC bounds. We also present Bellman error MDPs, a new analysis methodology for online and offline reinforcement learning algorithms, and TCE, a new, fine grained metric for the cost of exploration.</p> === Dissertation
author2 Parr, Ronald
author_facet Parr, Ronald
Pazis, Jason
author Pazis, Jason
author_sort Pazis, Jason
title PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_short PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_full PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_fullStr PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_full_unstemmed PAC-optimal, Non-parametric Algorithms and Bounds for Exploration in Concurrent MDPs with Delayed Updates
title_sort pac-optimal, non-parametric algorithms and bounds for exploration in concurrent mdps with delayed updates
publishDate 2015
url http://hdl.handle.net/10161/11334
work_keys_str_mv AT pazisjason pacoptimalnonparametricalgorithmsandboundsforexplorationinconcurrentmdpswithdelayedupdates
_version_ 1718160409481969664