Convergence Results for Some Temporal Difference Methods Based on Least Squares

We consider finite-state Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(lambda ). These are temporal difference methods for constructing a linear function approximation of the cost func...

Full description

Bibliographic Details
Main Authors: Yu, Huizhen (Contributor), Bertsekas, Dimitri P. (Contributor)
Other Authors: Massachusetts Institute of Technology. Laboratory for Information and Decision Systems (Contributor)
Format: Article
Language:English
Published: Institute of Electrical and Electronics Engineers, 2012-10-18T19:03:35Z.
Subjects:
Online Access:Get fulltext
LEADER 01640 am a22002053u 4500
001 74102
042 |a dc 
100 1 0 |a Yu, Huizhen  |e author 
100 1 0 |a Massachusetts Institute of Technology. Laboratory for Information and Decision Systems  |e contributor 
100 1 0 |a Bertsekas, Dimitri P.  |e contributor 
100 1 0 |a Bertsekas, Dimitri P.  |e contributor 
100 1 0 |a Yu, Huizhen  |e contributor 
700 1 0 |a Bertsekas, Dimitri P.  |e author 
245 0 0 |a Convergence Results for Some Temporal Difference Methods Based on Least Squares 
260 |b Institute of Electrical and Electronics Engineers,   |c 2012-10-18T19:03:35Z. 
856 |z Get fulltext  |u http://hdl.handle.net/1721.1/74102 
520 |a We consider finite-state Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(lambda ). These are temporal difference methods for constructing a linear function approximation of the cost function of a stationary policy, within the context of infinite-horizon discounted and average cost dynamic programming. We introduce an average cost method, patterned after the known discounted cost method, and we prove its convergence for a range of constant stepsize choices. We also show that the convergence rate of both the discounted and the average cost methods is optimal within the class of temporal difference methods. Analysis and experiment indicate that our methods are substantially and often dramatically faster than TD(lambda), as well as more reliable. 
546 |a en_US 
655 7 |a Article 
773 |t IEEE Transactions on Automatic Control