Exploration and Exploitation
Setting
Learning agents need to trade off two things
Exploitation: Maximise performance based on current knowledge
Exploration: Increase knowledge
The Multi-Armed Bandit
Values and Regret
The action value for action a is the expected reward
The optimal value is
Regret of an action a is
The regret for the optimal action is zero
We want to minimise total regret:
Maximise cumulative reward ≡ minimise total regret
The summation spans over the full ‘lifetime of learning’
Algorithms
- Greedy
- ε-greedy
- UCB
The first three all use action value estimates Qt (a) ≈ q(a)
- Thompson sampling
- Policy gradients
Action values
The action value for action a is the expected reward
A simple estimate is the average of the sampled rewards:
The count for action a is
This can also be updated incrementally:
Where
α: step sizes, constant α would lead to tracking, rather than averaging
The greedy policy
One of the simplest policies
ε-Greedy Algorithm
Policy gradients
Policy search, action preferences Ht(a)
learn policies π (a) directly, instead of learning values
(softmax)
The preferences are not values: they are just learnable policy parameters
Goal: learn by optimising the preferences
Policy gradients
update policy parameters such that expected value increases
use gradient ascent:
Gradient bandits
Log-likelihood trick (also known as REINFORCE trick, Williams 1992):
sample this
Theory
Theorem (Lai and Robbins)
Counting Regret
Optimism in the Face of Uncertainty
More uncertainty about its value: more important to explore that action
Which action should we pick?
Algorithms: UCB
Upper Confidence Bounds
Theory: the optimality of UCB
Hoeffding’s Inequality
Calculating Upper Confidence Bounds
…
Bayesian approaches
Bayesian Bandits
Bayesian approach:
model distributions over values p(q(a) | θt )
Bayesian Learning
Example:
Bayesian Bandits with Upper Confidence Bounds
Thompson sampling
Probability Matching
Probability matching is optimistic in the face of uncertainty:
Actions have higher probability when either the estimated value is high, or the uncertainty is high
Can be difficult to compute π (a) analytically from posterior (but can be done numerically)
Planning to explore
Information State Space
Example: Bernoulli Bandits
Solving Information State Space Bandits
E.g., learn a Bayesian reward distribution, plan into the future
Bayes-adaptive RL: optimally trades off exploration with respect to the prior distribution
full RL: transition model
All included screenshots credit to Lecture 2:slides , wiki and comp0142 lectures.
- Title:
- Author: wy
- Created at : 2023-07-10 14:57:52
- Updated at : 2023-08-02 12:58:29
- Link: https://yuuee-www.github.io/blog/2023/07/10/RL/EE/RLEE/
- License: This work is licensed under CC BY-NC-SA 4.0.