Offline Multi-Policy Gradient for Latent Mixture Environments
Reinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challeng...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2021-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9296301/ |
id |
doaj-9c0f84ce2eb44de095e43fbd323b529f |
---|---|
record_format |
Article |
spelling |
doaj-9c0f84ce2eb44de095e43fbd323b529f2021-03-30T14:57:48ZengIEEEIEEE Access2169-35362021-01-01980181210.1109/ACCESS.2020.30453009296301Offline Multi-Policy Gradient for Latent Mixture EnvironmentsXiaoguang Li0https://orcid.org/0000-0003-4580-4865Xin Zhang1https://orcid.org/0000-0001-6228-4908Lixin Wang2https://orcid.org/0000-0003-1502-4665Ge Yu3https://orcid.org/0000-0002-3171-8889College of Information, Liaoning University, Shenyang, ChinaCollege of Information, Liaoning University, Shenyang, ChinaCollege of Information, Liaoning University, Shenyang, ChinaCollege of Computer Science and Engineering, Northeastern University (China), Shenyang, ChinaReinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challenge for reinforcement learning. To this challenge, the paper proposes a reinforcement learning approach called offline multi-policy gradient for latent mixture environments. The proposed method uses an objective of expected return of trajectory with respect to the joint distribution of trajectory and model, and adopts a multi-policy searching algorithm to find the optimal policies based on expectation maximization. We also prove that the off-policy technique of importance sampling and advantage function can be used by offline multi-policy learning with fixed historical trajectories. The effectiveness of our approach is demonstrated by the experiments on both synthetic and real datasets.https://ieeexplore.ieee.org/document/9296301/Reinforcement learninglatent mixture environmentsmulti-policy gradientoffline learning |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Xiaoguang Li Xin Zhang Lixin Wang Ge Yu |
spellingShingle |
Xiaoguang Li Xin Zhang Lixin Wang Ge Yu Offline Multi-Policy Gradient for Latent Mixture Environments IEEE Access Reinforcement learning latent mixture environments multi-policy gradient offline learning |
author_facet |
Xiaoguang Li Xin Zhang Lixin Wang Ge Yu |
author_sort |
Xiaoguang Li |
title |
Offline Multi-Policy Gradient for Latent Mixture Environments |
title_short |
Offline Multi-Policy Gradient for Latent Mixture Environments |
title_full |
Offline Multi-Policy Gradient for Latent Mixture Environments |
title_fullStr |
Offline Multi-Policy Gradient for Latent Mixture Environments |
title_full_unstemmed |
Offline Multi-Policy Gradient for Latent Mixture Environments |
title_sort |
offline multi-policy gradient for latent mixture environments |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2021-01-01 |
description |
Reinforcement learning has been widely applied for sequential decision making problems in various fields of the real world, including recommendation, e-learning, etc. The features of multi-policy, latent mixture environments and offline learning implied by many real applications bring a new challenge for reinforcement learning. To this challenge, the paper proposes a reinforcement learning approach called offline multi-policy gradient for latent mixture environments. The proposed method uses an objective of expected return of trajectory with respect to the joint distribution of trajectory and model, and adopts a multi-policy searching algorithm to find the optimal policies based on expectation maximization. We also prove that the off-policy technique of importance sampling and advantage function can be used by offline multi-policy learning with fixed historical trajectories. The effectiveness of our approach is demonstrated by the experiments on both synthetic and real datasets. |
topic |
Reinforcement learning latent mixture environments multi-policy gradient offline learning |
url |
https://ieeexplore.ieee.org/document/9296301/ |
work_keys_str_mv |
AT xiaoguangli offlinemultipolicygradientforlatentmixtureenvironments AT xinzhang offlinemultipolicygradientforlatentmixtureenvironments AT lixinwang offlinemultipolicygradientforlatentmixtureenvironments AT geyu offlinemultipolicygradientforlatentmixtureenvironments |
_version_ |
1724180239780675584 |