Exploration and Exploitation
Exploration and Exploitation
Setting

Learning agents need to trade off two things
Exploitation: Maximise performance based on current knowledge
Exploration: Increase knowledge
The Multi-Armed Bandit

Values and Regret
The action value for action a is the expected reward

The optimal value is

Regret of an action a is

The regret for the optimal action is zero
We want to minimise total regret:

Maximise cumulative reward ≡ minimise total regret
The summation spans over the full ‘lifetime of learning’
Algorithms
Greedy
ε-greedy
UCB
The first three all use action value estimates Qt (a) ≈ q(a)
Thompson sampling
Policy gradients
Action values
The action value for action a is the expected reward

A simple estimate is the average of the sampled rewards:

The count for action a is

This can also be updated incrementally:

Where
α: step sizes, constant α would lead to tracking, rather than averaging
The greedy policy
One of the simplest policies

ε-Greedy Algorithm

Policy gradients
Policy search, action preferences Ht(a)
learn policies π (a) directly, instead of learning values


(softmax)
The preferences are not values: they are just learnable policy parameters
Goal: learn by optimising the preferences

Policy gradients
update policy parameters such that expected value increases
use gradient ascent:

Gradient bandits
Log-likelihood trick (also known as REINFORCE trick, Williams 1992):
sample this
Theory
Theorem (Lai and Robbins)

Counting Regret

Optimism in the Face of Uncertainty
More uncertainty about its value: more important to explore that action
Which action should we pick?

Algorithms: UCB
Upper Confidence Bounds

Theory: the optimality of UCB
Hoeffding’s Inequality

Calculating Upper Confidence Bounds
…
Bayesian approaches
Bayesian Bandits
Bayesian approach:
model distributions over values p(q(a) | θt )

Bayesian Learning

Example:


Bayesian Bandits with Upper Confidence Bounds

Thompson sampling
Probability Matching

Probability matching is optimistic in the face of uncertainty:
Actions have higher probability when either the estimated value is high, or the uncertainty is high
Can be difficult to compute π (a) analytically from posterior (but can be done numerically)

Planning to explore
Information State Space

Example: Bernoulli Bandits

Solving Information State Space Bandits
E.g., learn a Bayesian reward distribution, plan into the future
Bayes-adaptive RL: optimally trades off exploration with respect to the prior distribution
full RL: transition model
All included screenshots credit to Lecture 2:slides, wiki and comp0142 lectures.
Title:
Author: wy
Created at
: 2023-07-10 14:57:52**Updated at :** 2023-08-02 12:58:29**Link:** https://yuuee-www.github.io/blog/2023/07/10/RL/EE/RLEE/** License: ** This work is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0). [Prev posts](/2023/07/10/RL/Intro/RLIntro/) [Next posts](/2023/07/05/notes/) Comments On this page
1. [Values and Regret](#Values-and-Regret)
- Algorithms
- Action values
- The greedy policy
- ε-Greedy Algorithm
- Policy gradients
- Theory
- Optimism in the Face of Uncertainty
- Algorithms: UCB
- Bayesian approaches
1. [Bayesian Bandits](#Bayesian-Bandits)
1. [Information State Space](#Information-State-Space)
©
2022
-
2024 [wy](/)
24 posts in total
VISITOR COUNT
TOTAL PAGE VIEWS
POWERED BY [Hexo](https://hexo.io)
THEME [Redefine v2.6.4](https://github.com/EvanNotFound/hexo-theme-redefine)
Blog up for days hrs Min Sec
-
-
-
-
-
-
-
- Title: Exploration and Exploitation
- Author: wy
- Created at : 2023-07-10 06:57:52
- Updated at : 2023-08-02 04:58:29
- Link: https://yue-ruby-w.site/2023/07/10/2023-07-10-RL-EE-RLEE/
- License: This work is licensed under CC BY-NC-SA 4.0.