Q-learning with nearest neighbors

© 2018 Curran Associates Inc.All rights reserved. We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is avai...

Full description

Bibliographic Details
Main Authors:	Shah, Devavrat (Author), Xie, Qiaomin (Author)
Format:	Article
Language:	English
Published:	2021-11-09T16:08:56Z.
Subjects:	Article
Online Access:	Get fulltext


LEADER	01479 am a22001453u 4500
001	137946
042			\|a dc
100	1	0	\|a Shah, Devavrat \|e author
700	1	0	\|a Xie, Qiaomin \|e author
245	0	0	\|a Q-learning with nearest neighbors
260			\|c 2021-11-09T16:08:56Z.
856			\|z Get fulltext \|u https://hdl.handle.net/1721.1/137946
520			\|a © 2018 Curran Associates Inc.All rights reserved. We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor γ ∈ (0, 1), given an arbitrary sample path with "covering time" L, we establish that the algorithm is guaranteed to output an ε-accurate estimate of the optimal Q-function using Õ e (L/(ε 3 (1 - γ) 7 )) samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as Õ e (1/ε d ), so the sample complexity scales as Õ e (1/ε d+3 ). Indeed, we establish a lower bound that argues that the dependence of Ω e (1/ε d+2 ) is necessary.
546			\|a en
655	7		\|a Article

Q-learning with nearest neighbors

Similar Items