
Policy Gradients and Actor Critics
Model-based RL Value-based RL Policy-based RL
Policy-Based Reinforcement Learning
model-free reinforcement learning
previous

Now,
parametrise the policy directly

Value-based and policy-based RL’:’ terminology

Stochastic policies

Policy Objective Functions

Episodic:

Average Reward:

Policy Gradients
Policy Optimisation

Policy Gradient

Gradients on parameterized policies

Contextual Bandits Policy Gradient

use the identity instead
REINFORCE (Williams, 1992)
The right-hand side gives an expected gradient that can be sampled:

The score function trick

Contextual Bandit Policy Gradient

Policy gradients’:’ reduce variance

Example’:’ Softmax Policy:

Policy Gradient Theorem

Policy gradient theorem (episodic):

Episodic policy gradients algorithm:

Policy gradient theorem (average reward):

Alternatively (but equivalently)

Policy gradients’:’ reduce variance

Actor Critics
Critics
A critic is a value function, learnt via policy evaluation:

Actor-Critic

Policy gradient variations
This is different from supervised learning (where learning and data are independent)
Increasing robustness with trust regions

Continuous action spaces
Gaussian policy

Policy gradient with Gaussian policy

Gradient ascent on value

Continuous actor-critic learning automaton (Cacla)

- Title:
- Author: wy
- Created at : 2023-07-23 17:07:17
- Updated at : 2023-07-23 17:21:24
- Link: https://yuuee-www.github.io/blog/2023/07/23/RL/step7/RLstep7/
- License: This work is licensed under CC BY-NC-SA 4.0.