hola

wy Lv3

2023-07-10 14:57:52 2023-07-10 14:57:52 Created 2023-08-02 12:58:29 2023-08-02 12:58:29 Updated

Exploration and Exploitation

Setting

Learning agents need to trade off two things

Exploitation: Maximise performance based on current knowledge

Exploration: Increase knowledge

The Multi-Armed Bandit

Values and Regret

The action value for action a is the expected reward

The optimal value is

Regret of an action a is

The regret for the optimal action is zero

We want to minimise total regret:

Maximise cumulative reward ≡ minimise total regret

The summation spans over the full ‘lifetime of learning’

Algorithms

Greedy
ε-greedy
UCB

The first three all use action value estimates Qt (a) ≈ q(a)

Thompson sampling
Policy gradients

Action values

The action value for action a is the expected reward

A simple estimate is the average of the sampled rewards:

The count for action a is

This can also be updated incrementally:

Where

α: step sizes, constant α would lead to tracking, rather than averaging

The greedy policy

One of the simplest policies

ε-Greedy Algorithm

Policy gradients

Policy search, action preferences Ht(a)

learn policies π (a) directly, instead of learning values

(softmax)

The preferences are not values: they are just learnable policy parameters

Goal: learn by optimising the preferences

Policy gradients

update policy parameters such that expected value increases

use gradient ascent:

Gradient bandits

Log-likelihood trick (also known as REINFORCE trick, Williams 1992):

sample this

Theory

Theorem (Lai and Robbins)

Counting Regret

Optimism in the Face of Uncertainty

More uncertainty about its value: more important to explore that action

Which action should we pick?

Algorithms: UCB

Upper Confidence Bounds

Theory: the optimality of UCB

Hoeffding’s Inequality

Calculating Upper Confidence Bounds

…

Bayesian approaches

Bayesian Bandits

Bayesian approach:

model distributions over values p(q(a) | θt )

Bayesian Learning

Example:

Bayesian Bandits with Upper Confidence Bounds

Thompson sampling

Probability Matching

Probability matching is optimistic in the face of uncertainty:

Actions have higher probability when either the estimated value is high, or the uncertainty is high

Can be difficult to compute π (a) analytically from posterior (but can be done numerically)

Planning to explore

Information State Space

Example: Bernoulli Bandits

Solving Information State Space Bandits

E.g., learn a Bayesian reward distribution, plan into the future

Bayes-adaptive RL: optimally trades off exploration with respect to the prior distribution

full RL: transition model

All included screenshots credit to Lecture 2:slides , wiki and comp0142 lectures.

Title:
Author: wy
Created at : 2023-07-10 14:57:52
Updated at : 2023-08-02 12:58:29
Link: https://yuuee-www.github.io/blog/2023/07/10/RL/EE/RLEE/
License: This work is licensed under CC BY-NC-SA 4.0.

Comments

On this page

Exploration and Exploitation