Course Outline (Very Tentative)

Module 1: Introduction to Markov Decision Processes and Dynamic Programming
1. Basic theory
2. General algorithms for MDPs with small state spaces
  1. Value iteration
  2. Policy iteration
  3. Linear programming
  4. Asynchronous value iteration
3. Problem specific structures:
  1. We'll cover a few of the following examples: linear systems with quadratic costs, dynamic portfolio management, optimal stopping and myopic policies, scheduling and interchange arguments, multi-armed bandit problems.
Module 2: Estimating Long Run Value
1. Estimating the value of a fixed policy
  1. Monte Carlo vs temporal difference methods
  2. Batch methods and incremental temporal difference learning
  3. Value function approximation
2. Simultaneous control and value function estimation
  1. Q-learning
  2. Divergence issues and approximate policy iteration
  3. Experience replay and other improvements
Module 3: Exploration
1. Bandit problems and the need for 'deep exploration’ in RL
2. Optimistic algorithms
3. Thompson sampling and extensions
4. Exploration via randomized value functions