Offline Multi-Policy Gradient for Latent Mixture Environments

Reinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challeng...

Full description

Bibliographic Details
Main Authors: Xiaoguang Li, Xin Zhang, Lixin Wang, Ge Yu
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9296301/
id doaj-9c0f84ce2eb44de095e43fbd323b529f
record_format Article
spelling doaj-9c0f84ce2eb44de095e43fbd323b529f2021-03-30T14:57:48ZengIEEEIEEE Access2169-35362021-01-01980181210.1109/ACCESS.2020.30453009296301Offline Multi-Policy Gradient for Latent Mixture EnvironmentsXiaoguang Li0https://orcid.org/0000-0003-4580-4865Xin Zhang1https://orcid.org/0000-0001-6228-4908Lixin Wang2https://orcid.org/0000-0003-1502-4665Ge Yu3https://orcid.org/0000-0002-3171-8889College of Information, Liaoning University, Shenyang, ChinaCollege of Information, Liaoning University, Shenyang, ChinaCollege of Information, Liaoning University, Shenyang, ChinaCollege of Computer Science and Engineering, Northeastern University (China), Shenyang, ChinaReinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challenge for reinforcement learning. To this challenge, the paper proposes a reinforcement learning approach called offline multi-policy gradient for latent mixture environments. The proposed method uses an objective of expected return of trajectory with respect to the joint distribution of trajectory and model, and adopts a multi-policy searching algorithm to find the optimal policies based on expectation maximization. We also prove that the off-policy technique of importance sampling and advantage function can be used by offline multi-policy learning with fixed historical trajectories. The effectiveness of our approach is demonstrated by the experiments on both synthetic and real datasets.https://ieeexplore.ieee.org/document/9296301/Reinforcement learninglatent mixture environmentsmulti-policy gradientoffline learning
collection DOAJ
language English
format Article
sources DOAJ
author Xiaoguang Li
Xin Zhang
Lixin Wang
Ge Yu
spellingShingle Xiaoguang Li
Xin Zhang
Lixin Wang
Ge Yu
Offline Multi-Policy Gradient for Latent Mixture Environments
IEEE Access
Reinforcement learning
latent mixture environments
multi-policy gradient
offline learning
author_facet Xiaoguang Li
Xin Zhang
Lixin Wang
Ge Yu
author_sort Xiaoguang Li
title Offline Multi-Policy Gradient for Latent Mixture Environments
title_short Offline Multi-Policy Gradient for Latent Mixture Environments
title_full Offline Multi-Policy Gradient for Latent Mixture Environments
title_fullStr Offline Multi-Policy Gradient for Latent Mixture Environments
title_full_unstemmed Offline Multi-Policy Gradient for Latent Mixture Environments
title_sort offline multi-policy gradient for latent mixture environments
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2021-01-01
description Reinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challenge for reinforcement learning. To this challenge, the paper proposes a reinforcement learning approach called offline multi-policy gradient for latent mixture environments. The proposed method uses an objective of expected return of trajectory with respect to the joint distribution of trajectory and model, and adopts a multi-policy searching algorithm to find the optimal policies based on expectation maximization. We also prove that the off-policy technique of importance sampling and advantage function can be used by offline multi-policy learning with fixed historical trajectories. The effectiveness of our approach is demonstrated by the experiments on both synthetic and real datasets.
topic Reinforcement learning
latent mixture environments
multi-policy gradient
offline learning
url https://ieeexplore.ieee.org/document/9296301/
work_keys_str_mv AT xiaoguangli offlinemultipolicygradientforlatentmixtureenvironments
AT xinzhang offlinemultipolicygradientforlatentmixtureenvironments
AT lixinwang offlinemultipolicygradientforlatentmixtureenvironments
AT geyu offlinemultipolicygradientforlatentmixtureenvironments
_version_ 1724180239780675584