Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning

Using the same algorithm and hyperparameter configurations, deep reinforcement learning (DRL) will derive drastically different results from multiple experimental trials, and most of these results are unsatisfactory. Because of the instability of the results, researchers have to perform many trials...

Full description

Bibliographic Details
Main Authors: Feng Liu, Shuling Dai, Yongjia Zhao
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9298771/
id doaj-ce74b71c5e014fbbac96e1e91fa156c3
record_format Article
spelling doaj-ce74b71c5e014fbbac96e1e91fa156c32021-03-30T04:25:07ZengIEEEIEEE Access2169-35362020-01-01822809922810710.1109/ACCESS.2020.30458359298771Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement LearningFeng Liu0https://orcid.org/0000-0002-9006-4520Shuling Dai1https://orcid.org/0000-0002-2934-9033Yongjia Zhao2https://orcid.org/0000-0002-4557-9066State Key Laboratory of VR Technology & Systems, Beihang University (BUAA), Beijing, ChinaState Key Laboratory of VR Technology & Systems, Beihang University (BUAA), Beijing, ChinaState Key Laboratory of VR Technology & Systems, Beihang University (BUAA), Beijing, ChinaUsing the same algorithm and hyperparameter configurations, deep reinforcement learning (DRL) will derive drastically different results from multiple experimental trials, and most of these results are unsatisfactory. Because of the instability of the results, researchers have to perform many trials to confirm an algorithm or a set of hyperparameters in DRL. In this article, we present the policy return method, which is a new design for reducing the number of trials when training a DRL model. This method allows the learned policy to return to a previous state when it becomes divergent or stagnant at any stage of training. When returning, a certain percentage of stochastic data is added to the weights of the neural networks to prevent a repeated decline. Extensive experiments on challenging tasks and various target scores demonstrate that the policy return method can bring about a 10% to 40% reduction in the required number of trials compared with that of the corresponding original algorithm, and a 10% to 30% reduction compared with the state-of-the-art algorithms.https://ieeexplore.ieee.org/document/9298771/Deep reinforcement learningpolicy return methodfewer trialsstochastic data
collection DOAJ
language English
format Article
sources DOAJ
author Feng Liu
Shuling Dai
Yongjia Zhao
spellingShingle Feng Liu
Shuling Dai
Yongjia Zhao
Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning
IEEE Access
Deep reinforcement learning
policy return method
fewer trials
stochastic data
author_facet Feng Liu
Shuling Dai
Yongjia Zhao
author_sort Feng Liu
title Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning
title_short Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning
title_full Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning
title_fullStr Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning
title_full_unstemmed Policy Return: A New Method for Reducing the Number of Experimental Trials in Deep Reinforcement Learning
title_sort policy return: a new method for reducing the number of experimental trials in deep reinforcement learning
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description Using the same algorithm and hyperparameter configurations, deep reinforcement learning (DRL) will derive drastically different results from multiple experimental trials, and most of these results are unsatisfactory. Because of the instability of the results, researchers have to perform many trials to confirm an algorithm or a set of hyperparameters in DRL. In this article, we present the policy return method, which is a new design for reducing the number of trials when training a DRL model. This method allows the learned policy to return to a previous state when it becomes divergent or stagnant at any stage of training. When returning, a certain percentage of stochastic data is added to the weights of the neural networks to prevent a repeated decline. Extensive experiments on challenging tasks and various target scores demonstrate that the policy return method can bring about a 10% to 40% reduction in the required number of trials compared with that of the corresponding original algorithm, and a 10% to 30% reduction compared with the state-of-the-art algorithms.
topic Deep reinforcement learning
policy return method
fewer trials
stochastic data
url https://ieeexplore.ieee.org/document/9298771/
work_keys_str_mv AT fengliu policyreturnanewmethodforreducingthenumberofexperimentaltrialsindeepreinforcementlearning
AT shulingdai policyreturnanewmethodforreducingthenumberofexperimentaltrialsindeepreinforcementlearning
AT yongjiazhao policyreturnanewmethodforreducingthenumberofexperimentaltrialsindeepreinforcementlearning
_version_ 1724181920229621760