Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Many goal-reaching reinforcement learning (RL) tasks have empirically verified that rewarding the agent on subgoals improves convergence speed and practical performance. We attempt to provide a theoretical framework to quantify the computational benefits of rewarding the completion of subgoals, in t...

Full description

Bibliographic Details
Main Authors:	Baek, C. (Author), Jiao, J. (Author), Ma, Y. (Author), Zhai, Y. (Author), Zhou, Z. (Author)
Format:	Article
Language:	English
Published:	AI Access Foundation 2022
Subjects:	Computational complexity Convergence speed Economic and social effects Graph theory Intermediate state Multipath Performance Policy learning Reinforcement learning Short-path Single path Subgoals Theoretical framework Value iteration
Online Access:	View Fulltext in Publisher


LEADER	02474nam a2200349Ia 4500
001	10.1613-JAIR.1.13326
008	220425s2022 CNT 000 0 und d
020			\|a 10769757 (ISSN)
245	1	0	\|a Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning
260		0	\|b AI Access Foundation \|c 2022
856			\|z View Fulltext in Publisher \|u https://doi.org/10.1613/JAIR.1.13326
520	3		\|a Many goal-reaching reinforcement learning (RL) tasks have empirically verified that rewarding the agent on subgoals improves convergence speed and practical performance. We attempt to provide a theoretical framework to quantify the computational benefits of rewarding the completion of subgoals, in terms of the number of synchronous value iterations. In particular, we consider subgoals as one-way intermediate states, which can only be visited once per episode and propose two settings that consider these one-way intermediate states: the one-way single-path (OWSP) and the one-way multi-path (OWMP) settings. In both OWSP and OWMP settings, we demonstrate that adding intermediate rewards to subgoals is more computationally efficient than only rewarding the agent once it completes the goal of reaching a terminal state. We also reveal a trade-off between computational complexity and the pursuit of the shortest path in the OWMP setting: adding intermediate rewards significantly reduces the computational complexity of reaching the goal but the agent may not find the shortest path, whereas with sparse terminal rewards, the agent finds the shortest path at a significantly higher computational cost. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and some popular deep RL algorithms. © 2022 AI Access Foundation. All rights reserved.
650	0	4	\|a Computational complexity
650	0	4	\|a Convergence speed
650	0	4	\|a Economic and social effects
650	0	4	\|a Graph theory
650	0	4	\|a Intermediate state
650	0	4	\|a Multipath
650	0	4	\|a Performance
650	0	4	\|a Policy learning
650	0	4	\|a Reinforcement learning
650	0	4	\|a Short-path
650	0	4	\|a Single path
650	0	4	\|a Subgoals
650	0	4	\|a Theoretical framework
650	0	4	\|a Value iteration
700	1		\|a Baek, C. \|e author
700	1		\|a Jiao, J. \|e author
700	1		\|a Ma, Y. \|e author
700	1		\|a Zhai, Y. \|e author
700	1		\|a Zhou, Z. \|e author
773			\|t Journal of Artificial Intelligence Research

Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

Similar Items