wy Lv3

Policy Gradients and Actor Critics

Model-based RL Value-based RL Policy-based RL

Policy-Based Reinforcement Learning

model-free reinforcement learning

previous

Now,

parametrise the policy directly

Value-based and policy-based RL’:’ terminology

Stochastic policies

Policy Objective Functions

Episodic:

Average Reward:

Policy Gradients

Policy Optimisation

Policy Gradient

Gradients on parameterized policies

Contextual Bandits Policy Gradient

use the identity instead

REINFORCE (Williams, 1992)

The right-hand side gives an expected gradient that can be sampled:

The score function trick

Contextual Bandit Policy Gradient

Policy gradients’:’ reduce variance

Example’:’ Softmax Policy:

Policy Gradient Theorem

Policy gradient theorem (episodic):

Episodic policy gradients algorithm:

Policy gradient theorem (average reward):

Alternatively (but equivalently)

Policy gradients’:’ reduce variance


Actor Critics

Critics

A critic is a value function, learnt via policy evaluation:

Actor-Critic

Policy gradient variations

This is different from supervised learning (where learning and data are independent)

Increasing robustness with trust regions


Continuous action spaces

Gaussian policy

Policy gradient with Gaussian policy

Gradient ascent on value

Continuous actor-critic learning automaton (Cacla)

  • Title:
  • Author: wy
  • Created at : 2023-07-23 17:07:17
  • Updated at : 2023-07-23 17:21:24
  • Link: https://yuuee-www.github.io/blog/2023/07/23/RL/step7/RLstep7/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments