wy Lv3

Exploration and Exploitation

Setting

Untitled

Learning agents need to trade off two things

Exploitation: Maximise performance based on current knowledge

Exploration: Increase knowledge

The Multi-Armed Bandit

Untitled

Values and Regret

The action value for action a is the expected reward

The optimal value is

Regret of an action a is

The regret for the optimal action is zero

We want to minimise total regret:

Maximise cumulative reward ≡ minimise total regret

The summation spans over the full ‘lifetime of learning’

Algorithms

  • Greedy
  • ε-greedy
  • UCB

The first three all use action value estimates Qt (a) ≈ q(a)

  • Thompson sampling
  • Policy gradients

Action values

The action value for action a is the expected reward

A simple estimate is the average of the sampled rewards:

The count for action a is

This can also be updated incrementally:

Where





α: step sizes, constant α would lead to tracking, rather than averaging

The greedy policy

One of the simplest policies

ε-Greedy Algorithm

Policy gradients

Policy search, action preferences Ht(a)

learn policies π (a) directly, instead of learning values

(softmax)

The preferences are not values: they are just learnable policy parameters

Goal: learn by optimising the preferences

Policy gradients

update policy parameters such that expected value increases

use gradient ascent:

Gradient bandits

Log-likelihood trick (also known as REINFORCE trick, Williams 1992):




sample this



Theory

Theorem (Lai and Robbins)

Untitled

Counting Regret

Untitled

Optimism in the Face of Uncertainty

More uncertainty about its value: more important to explore that action

Which action should we pick?

Untitled

Algorithms: UCB

Upper Confidence Bounds

Untitled

Theory: the optimality of UCB

Hoeffding’s Inequality

Untitled

Calculating Upper Confidence Bounds

Bayesian approaches

Bayesian Bandits

Bayesian approach:

model distributions over values p(q(a) | θt )

Bayesian Learning

Example:

Bayesian Bandits with Upper Confidence Bounds

Thompson sampling

Probability Matching

Probability matching is optimistic in the face of uncertainty:

Actions have higher probability when either the estimated value is high, or the uncertainty is high

Can be difficult to compute π (a) analytically from posterior (but can be done numerically)

Planning to explore

Information State Space

Example: Bernoulli Bandits

Solving Information State Space Bandits

E.g., learn a Bayesian reward distribution, plan into the future

Bayes-adaptive RL: optimally trades off exploration with respect to the prior distribution

full RL: transition model

All included screenshots credit to Lecture 2:slides , wiki and comp0142 lectures.

  • Title:
  • Author: wy
  • Created at : 2023-07-10 14:57:52
  • Updated at : 2023-08-02 12:58:29
  • Link: https://yuuee-www.github.io/blog/2023/07/10/RL/EE/RLEE/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments