hola

hola

wy Lv3

2023-07-20 15:03:26 2023-07-20 15:03:26 Created 2023-07-22 22:39 2023-07-22 22:39 Updated

Markov Decision Processes and Dynamic Programming

Formalising the RL interface

Markov Decision Process (MDP)

a mathematical formulation of the agent-environment interaction

the objective and how to achieve it

a simplifying assumption

assume the environment is fully observable

Almost all RL problems can be formalised as MDPs, e.g.,

Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
Bandits are MDPs with one state

Joint Distributions

Alternative Definition

Markov Property The future is independent of the past given the present

Markov Property in a MDP

Formalising the objective

Returns

Discounted Return

Most Markov decision processes are discounted

Policies

Goal of an RL agent

Value Functions

value function & (state-)action values

Optimal Value Function

solution | state-value function | action-value function

Optimal Policy

Define a partial ordering over policies

Theorem (Optimal Policies)

Finding an Optimal Policy

Bellman Equations

Value Function

Action values ｜ state-action values

Bellman Equations

The Bellman Optimality Equations

Problems in RL prediction vs control

Bellman Equation in Matrix Form

There are iterative methods for larger problems

Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning

Solving the Bellman Optimality Equation

The Bellman optimality equation is non-linear

Cannot use the same direct matrix solution as for policy optimisation (in general)

Many iterative solution methods

Using models / dynamic programming

Value iteration

Policy iteration

Using samples

Monte Carlo

Q-learning

Sarsa

Dynamic Programming

Dynamic programming refers to a collection of algorithms that can be used

to compute optimal policies given a perfect model of the environment as a

Markov decision process (MDP). — Sutton & Barto 2018

dynamic programming methods:

solve MDPs

two important parts: policy evaluation and policy improvement

Policy evaluation:

this algorithm always converges

Policy Improvement:

Policy Iteration

Value Iteration

…

Preliminaries → Functional Analysis

Normed Vector Spaces

Contraction Mapping

Fixed point

Banach Fixed Point Theorem

…

Title:
Author: wy
Created at : 2023-07-20 15:03:26
Updated at : 2023-07-22 22:39:00
Link: https://yuuee-www.github.io/blog/2023/07/20/RL/step3/RLstep3/
License: This work is licensed under CC BY-NC-SA 4.0.

Comments

On this page

Returns
Policies
Value Functions
Bellman Equations
Solving the Bellman Optimality Equation
Preliminaries → Functional Analysis