Efficient Preference-based Reinforcement Learning

Common reinforcement learning algorithms assume access to a numeric feedback signal. The numeric feedback contains a high amount of information and can be maximized efficiently. However, the definition of a numeric feedback signal can be difficult in practise due to several limitations and badly def...

Full description

Bibliographic Details
Main Author:	Wirth, Christian
Format:	Others
Language:	en
Published:	2017
Online Access:	https://tuprints.ulb.tu-darmstadt.de/6952/1/ThesisColorMerged.pdf Wirth, Christian <http://tuprints.ulb.tu-darmstadt.de/view/person/Wirth=3AChristian=3A=3A.html> (2017): Efficient Preference-based Reinforcement Learning.Darmstadt, Technische Universität, [Ph.D. Thesis]

id	ndltd-tu-darmstadt.de-oai-tuprints.ulb.tu-darmstadt.de-6952
record_format	oai_dc
spelling	ndltd-tu-darmstadt.de-oai-tuprints.ulb.tu-darmstadt.de-69522020-07-15T07:09:31Z http://tuprints.ulb.tu-darmstadt.de/6952/ Efficient Preference-based Reinforcement Learning Wirth, Christian Common reinforcement learning algorithms assume access to a numeric feedback signal. The numeric feedback contains a high amount of information and can be maximized efficiently. However, the definition of a numeric feedback signal can be difficult in practise due to several limitations and badly defined values may lead to an unintended outcome. For humans, it is usually easier to define qualitative feedback signals than quantitative. Hence, we want to solve reinforcement learning problems with a qualitative signal, potentially capable of overcoming several of the limitations of numeric feedback. Preferences have several advantages over other qualitative settings, like ordinal feedback or advice. Preferences are scale-free and do not require assumptions over the optimal outcome. However, preferences are difficult to use for solving sequential decision problems, because it is unknown which decisions are responsible for the observed preference. Hence, we analyze different approaches for learning from preferences and show the design principles that can be used, as well as the advantages and problems that occur. We also survey the field of preference-based reinforcement learning and categorize the algorithms according to the design principles. Efficiency is of special interest in this setting, as it is important to keep the amount of required preferences low, because they depend on human evaluation. Hence, our focus is on efficient use of the preferences. It can be stated that it is important to be able to generalize the obtained preferences, as this keeps the amount of required preferences low. Therefore, we consider methods that are able to generalize the obtained preferences to models not yet evaluated. However, this introduces uncertain feedback and the exploration/exploitation problem already known from classical reinforcement learning has to be considered with the preferences in mind. We show how to efficiently solve this dual exploration problem by interleaving both tasks, in an undirected manner. We use undirected exploration methods, because they scale better to high-dimensional spaces. Furthermore, human feedback has to be assumed to be error-prone and we analyze the problems that arise when using human evaluation. We show that noise is the most substantial problem when dealing with human preferences and present a solution to this problem. 2017 Ph.D. Thesis NonPeerReviewed text CC-BY-NC-ND 4.0 International - Creative Commons, Attribution Non-commerical, No-derivatives https://tuprints.ulb.tu-darmstadt.de/6952/1/ThesisColorMerged.pdf Wirth, Christian <http://tuprints.ulb.tu-darmstadt.de/view/person/Wirth=3AChristian=3A=3A.html> (2017): Efficient Preference-based Reinforcement Learning.Darmstadt, Technische Universität, [Ph.D. Thesis] en info:eu-repo/semantics/doctoralThesis info:eu-repo/semantics/openAccess
collection	NDLTD
language	en
format	Others
sources	NDLTD
description	Common reinforcement learning algorithms assume access to a numeric feedback signal. The numeric feedback contains a high amount of information and can be maximized efficiently. However, the definition of a numeric feedback signal can be difficult in practise due to several limitations and badly defined values may lead to an unintended outcome. For humans, it is usually easier to define qualitative feedback signals than quantitative. Hence, we want to solve reinforcement learning problems with a qualitative signal, potentially capable of overcoming several of the limitations of numeric feedback. Preferences have several advantages over other qualitative settings, like ordinal feedback or advice. Preferences are scale-free and do not require assumptions over the optimal outcome. However, preferences are difficult to use for solving sequential decision problems, because it is unknown which decisions are responsible for the observed preference. Hence, we analyze different approaches for learning from preferences and show the design principles that can be used, as well as the advantages and problems that occur. We also survey the field of preference-based reinforcement learning and categorize the algorithms according to the design principles. Efficiency is of special interest in this setting, as it is important to keep the amount of required preferences low, because they depend on human evaluation. Hence, our focus is on efficient use of the preferences. It can be stated that it is important to be able to generalize the obtained preferences, as this keeps the amount of required preferences low. Therefore, we consider methods that are able to generalize the obtained preferences to models not yet evaluated. However, this introduces uncertain feedback and the exploration/exploitation problem already known from classical reinforcement learning has to be considered with the preferences in mind. We show how to efficiently solve this dual exploration problem by interleaving both tasks, in an undirected manner. We use undirected exploration methods, because they scale better to high-dimensional spaces. Furthermore, human feedback has to be assumed to be error-prone and we analyze the problems that arise when using human evaluation. We show that noise is the most substantial problem when dealing with human preferences and present a solution to this problem.
author	Wirth, Christian
spellingShingle	Wirth, Christian Efficient Preference-based Reinforcement Learning
author_facet	Wirth, Christian
author_sort	Wirth, Christian
title	Efficient Preference-based Reinforcement Learning
title_short	Efficient Preference-based Reinforcement Learning
title_full	Efficient Preference-based Reinforcement Learning
title_fullStr	Efficient Preference-based Reinforcement Learning
title_full_unstemmed	Efficient Preference-based Reinforcement Learning
title_sort	efficient preference-based reinforcement learning
publishDate	2017
url	https://tuprints.ulb.tu-darmstadt.de/6952/1/ThesisColorMerged.pdf Wirth, Christian <http://tuprints.ulb.tu-darmstadt.de/view/person/Wirth=3AChristian=3A=3A.html> (2017): Efficient Preference-based Reinforcement Learning.Darmstadt, Technische Universität, [Ph.D. Thesis]
work_keys_str_mv	AT wirthchristian efficientpreferencebasedreinforcementlearning
_version_	1719327459302703104

Efficient Preference-based Reinforcement Learning

Similar Items