Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning

Thesis: S.M., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2020 === Cataloged from PDF of thesis. === Includes bibliographical references (pages 89-97). === Learning optimal policies in the presence of non-stationary policies of other simultaneously learning age...

Full description

Bibliographic Details
Main Author:	Kim, Dong Ki,S.M.Massachusetts Institute of Technology.
Other Authors:	Jonathan P. How.
Format:	Others
Language:	English
Published:	Massachusetts Institute of Technology 2020
Subjects:	Aeronautics and Astronautics.
Online Access:	https://hdl.handle.net/1721.1/128312

id	ndltd-MIT-oai-dspace.mit.edu-1721.1-128312
record_format	oai_dc
spelling	ndltd-MIT-oai-dspace.mit.edu-1721.1-1283122020-11-05T05:10:05Z Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning Kim, Dong Ki,S.M.Massachusetts Institute of Technology. Jonathan P. How. Massachusetts Institute of Technology. Department of Aeronautics and Astronautics. Massachusetts Institute of Technology. Department of Aeronautics and Astronautics Aeronautics and Astronautics. Thesis: S.M., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2020 Cataloged from PDF of thesis. Includes bibliographical references (pages 89-97). Learning optimal policies in the presence of non-stationary policies of other simultaneously learning agents is a major challenge in multiagent reinforcement learning (MARL). The difficulty is further complicated by other challenges, including the multiagent credit assignment, the high dimensionality of the problems, and the lack of convergence guarantees. As a result, many experiences are often required to learn effective multiagent policies. This thesis introduces two frameworks to reduce the sample complexity in MARL. The first framework presented in this thesis provides a method to reduce the sample complexity by exchanging knowledge between agents. In particular, recent work on agents that learn to teach other teammates has demonstrated that action advising accelerates team-wide learning. However, the prior work simplified the learning of advising policies by using simple function approximations and only considering advising with primitive (low-level) actions, both of which limit the scalability of learning and teaching to more complex domains. This thesis introduces a novel learning-to-teach framework, called hierarchical multiagent teaching (HMAT), that improves scalability to complex environments by using a deep representation for student policies and by advising with more expressive extended-action sequences over multiple levels of temporal abstraction. Our empirical evaluations demonstrate that HMAT improves team-wide learning progress in large, complex domains where previous approaches fail. HMAT also learns teaching policies that can effectively transfer knowledge to different teammates with knowledge of different tasks, even when the teammates have heterogeneous action spaces. The second framework introduces the first policy gradient theorem based on meta-learning, which enables fast adaptation (i.e., need only a few iterations) with respect to the non-stationary fellow agents in MARL. The policy gradient theorem that we prove inherently includes both a self-shaping term that considers the impact of a meta-agent's initial policy on its adapted policy and an opponent-shaping term that exploits the learning dynamics of the other agents. We demonstrate that our meta-policy gradient provides agents to meta-learn about different sources of non-stationarity in the environment to improve their learning performances. by Dong Ki Kim. S.M. S.M. Massachusetts Institute of Technology, Department of Aeronautics and Astronautics 2020-11-03T20:29:57Z 2020-11-03T20:29:57Z 2020 2020 Thesis https://hdl.handle.net/1721.1/128312 1201259574 eng MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582 97 pages application/pdf Massachusetts Institute of Technology
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Aeronautics and Astronautics.
spellingShingle	Aeronautics and Astronautics. Kim, Dong Ki,S.M.Massachusetts Institute of Technology. Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning
description	Thesis: S.M., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2020 === Cataloged from PDF of thesis. === Includes bibliographical references (pages 89-97). === Learning optimal policies in the presence of non-stationary policies of other simultaneously learning agents is a major challenge in multiagent reinforcement learning (MARL). The difficulty is further complicated by other challenges, including the multiagent credit assignment, the high dimensionality of the problems, and the lack of convergence guarantees. As a result, many experiences are often required to learn effective multiagent policies. This thesis introduces two frameworks to reduce the sample complexity in MARL. The first framework presented in this thesis provides a method to reduce the sample complexity by exchanging knowledge between agents. In particular, recent work on agents that learn to teach other teammates has demonstrated that action advising accelerates team-wide learning. === However, the prior work simplified the learning of advising policies by using simple function approximations and only considering advising with primitive (low-level) actions, both of which limit the scalability of learning and teaching to more complex domains. This thesis introduces a novel learning-to-teach framework, called hierarchical multiagent teaching (HMAT), that improves scalability to complex environments by using a deep representation for student policies and by advising with more expressive extended-action sequences over multiple levels of temporal abstraction. Our empirical evaluations demonstrate that HMAT improves team-wide learning progress in large, complex domains where previous approaches fail. HMAT also learns teaching policies that can effectively transfer knowledge to different teammates with knowledge of different tasks, even when the teammates have heterogeneous action spaces. === The second framework introduces the first policy gradient theorem based on meta-learning, which enables fast adaptation (i.e., need only a few iterations) with respect to the non-stationary fellow agents in MARL. The policy gradient theorem that we prove inherently includes both a self-shaping term that considers the impact of a meta-agent's initial policy on its adapted policy and an opponent-shaping term that exploits the learning dynamics of the other agents. We demonstrate that our meta-policy gradient provides agents to meta-learn about different sources of non-stationarity in the environment to improve their learning performances. === by Dong Ki Kim. === S.M. === S.M. Massachusetts Institute of Technology, Department of Aeronautics and Astronautics
author2	Jonathan P. How.
author_facet	Jonathan P. How. Kim, Dong Ki,S.M.Massachusetts Institute of Technology.
author	Kim, Dong Ki,S.M.Massachusetts Institute of Technology.
author_sort	Kim, Dong Ki,S.M.Massachusetts Institute of Technology.
title	Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning
title_short	Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning
title_full	Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning
title_fullStr	Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning
title_full_unstemmed	Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning
title_sort	learning to teach and meta-learning for sample-efficient multiagent reinforcement learning
publisher	Massachusetts Institute of Technology
publishDate	2020
url	https://hdl.handle.net/1721.1/128312
work_keys_str_mv	AT kimdongkismmassachusettsinstituteoftechnology learningtoteachandmetalearningforsampleefficientmultiagentreinforcementlearning
_version_	1719354876689907712

Learning to teach and meta-learning for sample-efficient multiagent reinforcement learning

Similar Items