MDP planning tutorial pointers

MDP background

Puterman, Martin L. 2005. Markov Decision Processes (Discrete Stochastic Dynamic Programming). Wiley-Interscience.

General background on MDPs; many different criteria. Rigorous build-up of foundations.

Kallenberg, Lodewijk. n.d. MARKOV DECISION PROCESSES: Lecture Notes. Available online.

Another general intro; variations. Occupancy measures. Some proofs are nice. A bit too many examples for my taste, but, hey, one can skip those..
Arapostathis, Aristotle, Vivek S. Borkar, Emmanuel Fernández-Gaucherand, Mrinal K. Ghosh, and Steven I. Marcus. 1993. “Discrete-Time Controlled Markov Processes with Average Cost Criterion: A Survey.” SIAM Journal on Control and Optimization 31 (2): 282–344.

Survey of average-cost case
Lattimore, Tor, and Csaba Szepesvari. 2020. Chapter 38 of Bandit Algorithms. Cambridge University Press. https://tor-lattimore.com/downloads/book/book.pdf

The chapter rigorously builds up the theory for finite state-action space average reward MDPs with “finite diameter”. Short, self-contained. It rigorously argues for the sufficiency of deterministic stationary policies. The only catch is that quite a few of the proofs are left to the reader as exercises. If needed, the solutions are available from the authors. The proof uses the vanishing discount approach; so most of the tools that are needed to handle the discounted case are developed in this chapter.
Altman, Eitan. 2009. Constrained Markov Decision Processes.

Classic on constrained MDPs. Structural results. Proof for sufficiency of stationary policies is from here.
Brief into to CMDPs:
http://readingsml.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis.html

Sufficiency of stationary policies reproduced here. This also discusses in what ways the MDP framework where there is a single objective is potentially not sufficiently rich to capture many problems of practical interest. Interestingly, the algorithms and techniques developed for the MDP framework continue to be useful beyond MDPs.

Bertsekas, Dimitri P. 2005. Dynamic Programming and Optimal Control (Volumes 1,2). Athena Scientific.

Another rigorous introduction to MDPs; undiscounted, etc.
Braziunas, Darius. n.d. “POMDP Solution Methods.” https://dariusb.bitbucket.io/papers/POMDP_survey.pdf.

POMDP background; with a focus on some specific computational approaches.
Mykel J. Kochenderfer’s slides on MDPs and POMDPs. https://web.stanford.edu/~mykel/pomdps.pdf

Comes with the book:
Kochenderfer, Mykel J. Decision Making Under Uncertainty: Theory and Application. 1st ed. The MIT Press.

Gives an overview of various approaches; the coverage of POMDPs is quite nice. The emphasis is not on rigor or theory, but the book is meant to be easy to read and helpful to build intuition.
Åström, Karl Johan. 1965. “Optimal Control of Markov Processes with Incomplete State Information” Journal of Mathematical Analysis and Applications 10: 174–205.

The paper that introduced POMDPs(?).

Local planning; tree building/lookahead search

Kearns, M., Y. Mansour, and A. Ng. 2002. “A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes.” Machine Learning 49 (January): 193–208.

The journal version of the method that builds a tree of depth equal to the effective horizon.

Kearns, M., Y. Mansour, and A. Ng. 2000. “Approximate Planning in Large POMDPs via Reusable Trajectories.” In Advances in Neural Information Processing Systems. http://www.robotics.stanford.edu/~ang/papers/pomdp-long.ps.

A followup to the early, conference version of the above paper, that uses separate trees with a single successor state per action and proves the basic uniform convergence results. This gives us query-complexity bounds independent of the size of the state space, but exponential in the effective horizon.
Munos, Rémi. 2014. “From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning.” Foundations and Trends® in Machine Learning 7 (1): 1–129.

Summarizes, in a short book form, how to be more clever in building these trees. The goal is to avoid the exponential dependence on the planning horizon, which can only be done when the MDP is “nice”. This led to a number of non-uniform query/computational complexity results. Remi also had amazing demonstrations for deterministic MDPs:)
Lee, Wee S., Nan Rong, and David Hsu. 2008. “What Makes Some POMDP Problems Easy to Approximate?” In Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 689–96. Curran Associates, Inc.

Since we touched POMDPs; “classic” overlooked(?) paper that defines a space whose covering number governs the hardness of computation in a POMDP. The algorithm is “point-based value iteration”: It is a close relative to Rust’s algorithm; see below.
Lim, Michael H., Claire J. Tomlin, and Zachary N. Sunberg. 2019. “Sparse Tree Search Optimality Guarantees in POMDPs with Continuous Observation Spaces.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1910.04332.

Because continuous observation spaces are also real.

Local planning in smooth MDPs

Rust, John. 1997. “Using Randomization to Break the Curse of Dimensionality.” Econometrica: Journal of the Econometric Society 65 (3): 487.

This effectively considers local planning in smooth MDPs; showing a separation between global and local planning. The Chow-Tsitsiklis paper proved a lower bound for the same class for global planning that scaled exponentially with the state-action space dimension.
Szepesvári, C. 2001. “Efficient Approximate Planning in Continuous Space Markovian Decision Problems.” AI Communications. The European Journal on Artificial Intelligence 14 (January): 163–76.

Since the previous paper had a result only bounding the value function estimation bias (which is at best an auxiliary question), there remained the question of whether we can actually use the method for efficient action-computation. This is answered positively here.

Chow, Chef-Seng, and John N. Tsitsiklis. 1989. “The Complexity of Dynamic Programming.” Journal of Complexity 5 (4): 466–88.

The paper mentioned above. Lower bound for global planning. Only for value function approximation. Todo: Show the same bound hold for policy computation.

Policy search

Vlassis, Nikos, Michael L. Littman, and David Barber. 2012. “On the Computational Complexity of Stochastic Controller Optimization in POMDPs.” ACM Trans. Comput. Theory, 12, 4 (4): 1–8.

One can use a result from this paper to show that policy search for constant policies (they choose the action from the same probability distribution, no matter the state), is NP-hard.
Hansen, E. 1998. “Solving POMDPs by Searching in Policy Space.” In Proceedings of the Fourteenth Conference on …. http://dl.acm.org/citation.cfm?id=2074119.

Gives basic discretization results. I did not talk about this.
Ng, Andrew Y., and Michael Jordan. 2000. “PEGASUS: A Policy Search Method for Large MDPs and POMDPs.” In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 406–15. UAI’00. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Another paper that I did not talk about. This paper considers what is known as the scenario-method in operations research. The question in a way is whether the stochastic search problems are much harder or not than the deterministic problems. The scenario method assumes a simulator with controlled random seeds; the idea is to take finitely many such seeds and approximate the value of a policy using the average across the seeds. Uniform convergence results are shown; smoothness is needed for infinitely many actions.

Bagnell, J., S. Kakade, A. Ng, and J. Schneider. 2004. “Policy Search by Dynamic Programming.” In Advances in Neural Information Processing Systems. http://papers.nips.cc/paper/2378-policy-search-by-dynamic-programming.

Another paper I had to skip. Finite horizon MDPs. Algorithm is given a distribution over states for each round. Goal is to reduce policy search to “simpler” search. It works backwards, finding the policy for round t that maximizes value starting from the said distribution and for the rest of rounds using policies found previously. (Additive) error bound on how well the (nonstationary) policy will do as a function of how much the distributions are mismatched to that of the optimal policy and the accuracy of finding the best policy in each round.

Policy gradients

Williams, Ronald J. 1992. “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning 8 (3-4): 229–56.

Reinforce algorithm; score function method appearing in ML.
Glynn, Peter W. 1990. “Likelihood Ratio Gradient Estimation for Stochastic Systems.” Communications of the ACM 33 (10): 75–84.

Same from OR.
Sutton, R., D. McAllester, S. Singh, and Y. Mansour. 2000. “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” In NIPS-12, 1057–63.

I did not talk about this; alternative gradient expression. The paper also introduces the notion of compatible function approximation. I did not talk about this either.

Global convergence of policy gradients

Check bibliography in these papers and citations! New papers every day!

Bhandari, Jalaj, and Daniel Russo. 2019. “Global Optimality Guarantees For Policy Gradient Methods,” June. https://arxiv.org/abs/1906.01786v1.
Bhandari, Jalaj, and Daniel Russo. 2020. “A Note on the Linear Convergence of Policy Gradient Methods.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2007.11120
Fazel, Maryam, Rong Ge, Sham Kakade, and Mehran Mesbahi. 2018. “Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator.” Edited by Jennifer Dy and Andreas Krause, Proceedings of Machine Learning Research, 80: 1467–76.
Agarwal, Alekh, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. 2019. “Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes,” August. https://arxiv.org/abs/1908.00261v2.
Zhang, Junyu, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. 2020. “Variational Policy Gradient Method for Reinforcement Learning with General Utilities.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2007.02151.
Mei, Jincheng, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. 2020. “On the Global Convergence Rates of Softmax Policy Gradient Methods.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2005.06392.

Value function approximation

Whitt, Ward. 1978. “Approximations of Dynamic Programs, I.” Mathematics of Operations Research 3 (3): 231–43.

Classic paper on approximate dynamic programming. I did not cover any of this, but calculations similar to those in this paper come up all the time. The paper is about getting performance guarantees for the optimal policy of MDP A when the policy is used in MDP B. The goal is to show that the optimal policy for A is not bad in B when A and B are in a way close to each other. A bit hard to read because of the archaic notation, but it's totally worth reading it. Check out Theorem 3.1!
Schweitzer, Paul J., and Abraham Seidmann. 1985. “Generalized Polynomial Approximations in Markovian Decision Processes.” Journal of Mathematical Analysis and Applications 110 (2): 568–82.

The paper that explicitly introduces linear value function approximation in the context of planning in MDPs. No theoretical results, just ideas. The paper introduces essentially least-squares policy iteration (LSPI, no Q-functions though), LSVI (least squares value iteration), and ALP (approximate linear programming).
Jin, Chi, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. 2019. “Provably Efficient Reinforcement Learning with Linear Function Approximation,” July. https://arxiv.org/abs/1907.05388v2.

This is the paper that introduces linear MDPs (as discussed in the lecture; some other authors mean other things by linear MDPs). This is done in the context of exploration (exploration will be covered later). Earlier papers made weaker assumptions, like small/zero inherent Bellman error of the function space considered under all policies. Nevertheless, it is a nice discovery that this class of MDPs is ideal for policy iteration with linear function approximation. In particular, in linear MDPs the action-value function of any policy lies in the linear space spanned by the features, i.e., the action-value functions are all realizable.
Du, Simon S., Sham M. Kakade, Ruosong Wang, and Lin F. Yang. 2019. “Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1910.03016.

This is the paper that shows that a slight departure from the realizability assumption (that is that all Q^pi lie in the linear span of features) can cause trouble: The planner may need to issue as many queries as the number of states even if the realizability assumption is just slightly violated.

Lattimore, Tor, Csaba Szepesvari, and Gellert Weisz. 2019. “Learning with Good Feature Representations in Bandits and in RL with a Generative Model.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1911.07676.

This paper refined the argument of the previous one: The moral of the story is that a precision that is as good as the realizability error is what ruins query complexity. If we allow the representation error to blow up by a factor related to the number of features, the blow-up disappears. The paper also introduces other refinements, such as that not all feature-maps require the same error-relaxation to avoid the query complexity blow-up (e.g., state aggregation definitely won’t need this). Another refinement is the treatment of exploration in unrealizable linear bandits.
Shariff, Roshan, and Csaba Szepesvári. 2020. “Efficient Planning in Large MDPs with Weak Linear Function Approximation.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2007.06184.

While I did not talk about this paper, it is highly relevant: The paper considers linear functions approximation, but aims to address the case when the only assumption made is that Q^* is close to the span of features. At the end, a bit more is assumed, access to a set of “core” states such that the convex hull of the features at these states includes the features of all other states. An ALP approach is followed and a fully polynomial time algorithm is derived, although the runtime depends on the number of core states (in addition to the number of features), though it is independent of the cardinality of the state-space.

Other notable sources

Agarwal, Alekh, Nan Jiang, and Sham M. Kakade. 2019. Reinforcement Learning: Theory and Algorithms. https://rltheorybook.github.io/rl_monograph_AJK.pdf

Course notes. Covers a whole spectrum of things.
Mausam, and Andrey Kolobov. 2012. “Planning with Markov Decision Processes: An AI Perspective.” Synthesis Lectures on Artificial Intelligence and Machine Learning 6 (1): 1–210.

Connection to heuristic search, determinization, connection to the classical planning literature; stochastic shortest path is really at the center (dead-ends and whatnot). Just missing theoretical results beyond the basics.