Course Outline (Very Tentative)
Module 1: Introduction to Markov Decision Processes and Dynamic Programming
Basic theory
General algorithms for MDPs with small state spaces
Value iteration
Policy iteration
Linear programming
Asynchronous value iteration
Problem specific structures:
We'll cover a few of the following examples: linear systems with quadratic costs, dynamic portfolio management, optimal stopping and myopic policies, scheduling and interchange arguments, multi-armed bandit problems.
Module 2: Estimating Long Run Value
Estimating the value of a fixed policy
Monte Carlo vs temporal difference methods
Batch methods and incremental temporal difference learning
Value function approximation
Simultaneous control and value function estimation
Q-learning
Divergence issues and approximate policy iteration
Experience replay and other improvements
Module 3: Exploration
Bandit problems and the need for 'deep exploration’ in RL
Optimistic algorithms
Thompson sampling and extensions
Exploration via randomized value functions
|