Policy Gradients and Actor Critics
Model-based RL Value-based RL Policy-based RL
Policy-Based Reinforcement Learning
model-free reinforcement learning
previous
Now,
parametrise the policy directly
Value-based and policy-based RL’:’ terminology
Stochastic policies
Policy Objective Functions
Episodic:
Average Reward:
Policy Gradients
Policy Optimisation
Policy Gradient
Gradients on parameterized policies
Contextual Bandits Policy Gradient
use the identity instead
REINFORCE (Williams, 1992)
The right-hand side gives an expected gradient that can be sampled:
The score function trick
Contextual Bandit Policy Gradient
Policy gradients’:’ reduce variance
Example’:’ Softmax Policy:
Policy Gradient Theorem
Policy gradient theorem (episodic):
Episodic policy gradients algorithm:
Policy gradient theorem (average reward):
Alternatively (but equivalently)
Policy gradients’:’ reduce variance
Actor Critics
Critics
A critic is a value function, learnt via policy evaluation:
Actor-Critic
Policy gradient variations
This is different from supervised learning (where learning and data are independent)
Increasing robustness with trust regions
Continuous action spaces
Gaussian policy
Policy gradient with Gaussian policy
Gradient ascent on value
Continuous actor-critic learning automaton (Cacla)
- Title:
- Author: wy
- Created at : 2023-07-23 17:07:17
- Updated at : 2023-07-23 17:21:24
- Link: https://yuuee-www.github.io/blog/2023/07/23/RL/step7/RLstep7/
- License: This work is licensed under CC BY-NC-SA 4.0.