wy Lv3

Markov Decision Processes and Dynamic Programming

Formalising the RL interface

Markov Decision Process (MDP)

a mathematical formulation of the agent-environment interaction

the objective and how to achieve it

a simplifying assumption

assume the environment is fully observable

Almost all RL problems can be formalised as MDPs, e.g.,

  • Optimal control primarily deals with continuous MDPs
  • Partially observable problems can be converted into MDPs
  • Bandits are MDPs with one state

Joint Distributions

Alternative Definition

Markov Property The future is independent of the past given the present

Markov Property in a MDP


Formalising the objective

Returns

Discounted Return

Most Markov decision processes are discounted

Policies

Goal of an RL agent

Value Functions

value function & (state-)action values

Optimal Value Function

solution | state-value function | action-value function

Optimal Policy

Define a partial ordering over policies

Theorem (Optimal Policies)

Finding an Optimal Policy

Bellman Equations

Value Function

Action values | state-action values

Bellman Equations

The Bellman Optimality Equations


Problems in RL prediction vs control

Bellman Equation in Matrix Form

There are iterative methods for larger problems

  • Dynamic programming
  • Monte-Carlo evaluation
  • Temporal-Difference learning

Solving the Bellman Optimality Equation

The Bellman optimality equation is non-linear

Cannot use the same direct matrix solution as for policy optimisation (in general)

Many iterative solution methods

Using models / dynamic programming

Value iteration

Policy iteration

Using samples

Monte Carlo

Q-learning

Sarsa

Dynamic Programming

Dynamic programming refers to a collection of algorithms that can be used

to compute optimal policies given a perfect model of the environment as a

Markov decision process (MDP). — Sutton & Barto 2018

dynamic programming methods:

solve MDPs

two important parts: policy evaluation and policy improvement

Policy evaluation:

  • this algorithm always converges

Policy Improvement:

Policy Iteration

Value Iteration

Preliminaries → Functional Analysis

Normed Vector Spaces

Contraction Mapping

Fixed point

Banach Fixed Point Theorem

  • Title:
  • Author: wy
  • Created at : 2023-07-20 15:03:26
  • Updated at : 2023-07-22 22:39:00
  • Link: https://yuuee-www.github.io/blog/2023/07/20/RL/step3/RLstep3/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments