Minimal Exploration in Episodic Reinforcement Learning

Exploration-exploitation trade-off is a fundamental dilemma that reinforcement learning algorithms face. This dilemma is also central to the design of various state of the art bandit algorithms. We take inspiration from these algorithms and try to design reinforcement learning algorithms in an episo...

Full description

Bibliographic Details
Main Author:	Tripathi, Ardhendu Shekhar
Format:	Others
Language:	English
Published:	KTH, Skolan för elektroteknik och datavetenskap (EECS) 2018
Subjects:	Reinforcemebt Learning Exploitation Exploration Regret Optimism in Face of Uncertainty Bayesian Engineering and Technology Teknik och teknologier
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233579

id	ndltd-UPSALLA1-oai-DiVA.org-kth-233579
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-kth-2335792018-09-03T06:42:56ZMinimal Exploration in Episodic Reinforcement LearningengTripathi, Ardhendu ShekharKTH, Skolan för elektroteknik och datavetenskap (EECS)2018Reinforcemebt LearningExploitationExplorationRegretOptimism in Face of UncertaintyBayesianEngineering and TechnologyTeknik och teknologierExploration-exploitation trade-off is a fundamental dilemma that reinforcement learning algorithms face. This dilemma is also central to the design of various state of the art bandit algorithms. We take inspiration from these algorithms and try to design reinforcement learning algorithms in an episodic setting. In this work, we develop two algorithms which are based on the principle of optimism in face of uncertainty to minimize exploration. The idea is that the agent follows the optimal policy for a surrogate model, named optimistic model, which is close enough to the former but leads to a higher longterm reward. We show extensively through experiments on synthetic toy MDP’s that the performance of our algorithms is in line (even better in the case where the reward dynamics are known) with the algorithms based on the Bayesian treatment of the problem and other algorithms based on the optimism in face of uncertainty principle. The algorithms suggested in this thesis trump the Bayesian algorithms in terms of the variance of the regret achieved by the algorithms over multiple runs. Another contribution is the derivation of several regret lower bounds,such as a problem specific (both, asymptotic and non-asymptotic) and a minimax regret lower bound, for any uniformly good algorithm in an episodic setting. Avvägningen mellan upptäckande och utnyttjande är ett grundläggande dilemma som övervakade inlärningsalgoritmer handskas med. Det här dilemmat är också centralt i designen av diverse toppmoderna bandit-algoritmer. Vi inspireras av dessa algoritmer och försöker utforma övervakade inlärningsalgoritmer i en episodisk miljö. I det här arbetet utvecklar vi två algoritmer som är baserade på principen om optimism vid osäkerhet för att minimera upptäckande. Idén är att agenten följer den optimala policyn för en surrogatmodell som kallas optimistisk modell, som är tillräckligt nära ursprungsmodellen men leder till en högre långsiktig belöning. Vi visar utförligt genom experiment på syntetiska leksaks-MDP att algoritmernas prestanda är i linje med (till och med bättre när belöningsdynamiken är känd) algoritmerna grundade på den bayesiska behandlingen av problemet och andra algoritmer baserade på optimism vid osäkerhet. Algoritmerna som föreslås i den här avhandlingen presterar bättre än de bayesiska algoritmerna i varians av den ånger som uppnås av algoritmerna över många körningar. Ett annat bidrag är härledningen av flera nedre gränser, såsom en problem-specifik nedre gräns (både asymptotisk och icke-asymptotisk) och en nedre gräns enligt minmax-principen, för en godtycklig uniformt god algoritm i en episodisk miljö. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233579TRITA-EECS-EX ; 2018:533application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Reinforcemebt Learning Exploitation Exploration Regret Optimism in Face of Uncertainty Bayesian Engineering and Technology Teknik och teknologier
spellingShingle	Reinforcemebt Learning Exploitation Exploration Regret Optimism in Face of Uncertainty Bayesian Engineering and Technology Teknik och teknologier Tripathi, Ardhendu Shekhar Minimal Exploration in Episodic Reinforcement Learning
description	Exploration-exploitation trade-off is a fundamental dilemma that reinforcement learning algorithms face. This dilemma is also central to the design of various state of the art bandit algorithms. We take inspiration from these algorithms and try to design reinforcement learning algorithms in an episodic setting. In this work, we develop two algorithms which are based on the principle of optimism in face of uncertainty to minimize exploration. The idea is that the agent follows the optimal policy for a surrogate model, named optimistic model, which is close enough to the former but leads to a higher longterm reward. We show extensively through experiments on synthetic toy MDP’s that the performance of our algorithms is in line (even better in the case where the reward dynamics are known) with the algorithms based on the Bayesian treatment of the problem and other algorithms based on the optimism in face of uncertainty principle. The algorithms suggested in this thesis trump the Bayesian algorithms in terms of the variance of the regret achieved by the algorithms over multiple runs. Another contribution is the derivation of several regret lower bounds,such as a problem specific (both, asymptotic and non-asymptotic) and a minimax regret lower bound, for any uniformly good algorithm in an episodic setting. === Avvägningen mellan upptäckande och utnyttjande är ett grundläggande dilemma som övervakade inlärningsalgoritmer handskas med. Det här dilemmat är också centralt i designen av diverse toppmoderna bandit-algoritmer. Vi inspireras av dessa algoritmer och försöker utforma övervakade inlärningsalgoritmer i en episodisk miljö. I det här arbetet utvecklar vi två algoritmer som är baserade på principen om optimism vid osäkerhet för att minimera upptäckande. Idén är att agenten följer den optimala policyn för en surrogatmodell som kallas optimistisk modell, som är tillräckligt nära ursprungsmodellen men leder till en högre långsiktig belöning. Vi visar utförligt genom experiment på syntetiska leksaks-MDP att algoritmernas prestanda är i linje med (till och med bättre när belöningsdynamiken är känd) algoritmerna grundade på den bayesiska behandlingen av problemet och andra algoritmer baserade på optimism vid osäkerhet. Algoritmerna som föreslås i den här avhandlingen presterar bättre än de bayesiska algoritmerna i varians av den ånger som uppnås av algoritmerna över många körningar. Ett annat bidrag är härledningen av flera nedre gränser, såsom en problem-specifik nedre gräns (både asymptotisk och icke-asymptotisk) och en nedre gräns enligt minmax-principen, för en godtycklig uniformt god algoritm i en episodisk miljö.
author	Tripathi, Ardhendu Shekhar
author_facet	Tripathi, Ardhendu Shekhar
author_sort	Tripathi, Ardhendu Shekhar
title	Minimal Exploration in Episodic Reinforcement Learning
title_short	Minimal Exploration in Episodic Reinforcement Learning
title_full	Minimal Exploration in Episodic Reinforcement Learning
title_fullStr	Minimal Exploration in Episodic Reinforcement Learning
title_full_unstemmed	Minimal Exploration in Episodic Reinforcement Learning
title_sort	minimal exploration in episodic reinforcement learning
publisher	KTH, Skolan för elektroteknik och datavetenskap (EECS)
publishDate	2018
url	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233579
work_keys_str_mv	AT tripathiardhendushekhar minimalexplorationinepisodicreinforcementlearning
_version_	1718728037166481408

Minimal Exploration in Episodic Reinforcement Learning

Similar Items